iTrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi ETH Zurich VLDB 2007 Anat Heilper Jan. 2009 CS Seminar in Databases (236826) 1
Dec 20, 2015
iTrails: Pay-as-you-go Information Integration in Dataspaces
Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas
Blunschi
ETH Zurich
VLDB 2007
Anat Heilper
Jan. 2009
CS Seminar in Databases (236826)
1
Problem: Querying heterogeneous data Sources
Data Sources
Laptop Email Server
WebServer
DBServer
What is the impact of the global depression in Israel?Query
Systems
? ? ? ?
2
Solution 1: Use a Search Engine
Data Sources
LaptopEmail Server
WebServer
Query
System
DBServer
Graph IR Search Engine
global depression Israel
TopX [VLDB05], FleXPath [SIGMOD04], XSearch [VLDB03], XRank [SIGMOD03]
text,links
text,links
text,links
text,links
Query semantics are not precise!
3
Result Query
Solution 2: Use an Information Integration System
Data source 2Data source 1 Data source 3
Query interface Global schema
Source schema
?
Price index
countriesunemployment
Crime rate
countriesunemployment
Crime rate
Too much effort to provide schema mappings!
44
•Schema first approach (SFA)• Semantically integrated view over the data sources• Mappings between source schemas and mediated schema Queries have clearly defined semantics Expensive to construct and maintain Not all data sources have schemas
•No schema approach (NSA)• Keyword search• Requires good result ranking methods Performs no integration Query semantics is not well defined
2 opposite approaches :
Querying heteregenous data sources
5
Motivation of iTrail Find a integration solution in-between these two extremes?
?Dataspace System
Graph IR Search Engine
Data IntegrationSystem
Temps Cities
CO2 Sunspots
... ...
...
...
text,links
text,links
text,links
text,links
Pay-as-you-goInformation Integration The more effort you pay,
the more query power you have.
6
iTrails Core Idea: Add Integration Hints Incrementally
1) Provide search service over the data– Use general graph data model (iDM)– handles unstructured documents, XML, and relations
2) Add integration semantics via hints (trails)
3) If more semantics needed, apply trails– Smooth transition between search and data integration– Semantics added incrementally to improve precision /
recall
7
Example of an iDM
X1 = { .name = ‘home‘, .tuple = {.owner = ‘root‘, .lastmodified = ‘05.01.2000‘}, .content = “}X2 = { .name = ‘mike‘, .tuple = {.owner = ‘root‘, .lastmodified = ‘04.17.2008‘}, .content = “}. . .X5 = { .name = ‘SIGMOD42.pdf ‘, .tuple = {size = 10k, .owner = ‘mike‘, .lastmodified = ‘04.01.2007‘}, .content = ‘@PDF . . . ‘}…..
home
Mike
papers
PIM
SIGMOD42.pdf
SIGMOD44.pdf
QP
VLDB12.pdf
VLDB10.pdf
projects
PIM
SIGMOD42.pdf8
1
2
5
General graph data model - iDM
iDM (iMeMeX Data Model) represents every structural component of the input data as a node.
Supports unstructured, semi-structured and structured data, e.g., files&folders, XML, relations
9
iMeMeX – integrated MeMeX
Vannevar Bush introduced the concept “memex” in the 1945s: "device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility."
Bush predicted: "Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified."
10
Data model
Data represented by directed graph G = (RV, E) RV: {V1, . . . Vn} termed resource view
E: Ordered pairs (Vi , Vj ) of resource views
Vi Vj : Vj is reachable from Vi by traversing the edges E
11
Resource view
Component
Vi.name string
Vi.Tuple sequence of attribute value pairs ((att0, val0), (att1, val1),… )
Vi.content text
A resource view Vi has three components: name, tuple, and content
{.name= ‘SIGMOD42.pdf ‘, .tuple = {size = 10k, .owner = ‘mike‘, .lastmodified = 04.01.2007‘}, .content = ‘@PDF . . . ‘}
12
Query model
Query expression:– Query Q selects nodes R := Q(G) G.RV– Example: //mike/papers
Component projection– C {.name, .tuple.<atti>, .content} : projection of
set of resource views selected by query Q, i.e. set of components R’ := {Vi.C | Vi Q(G)}
13
Component projection example
Example: //mike//PIM/*.tuple.lastmodifiedX1 = { .name = ‘home‘, .tuple = {.owner = ‘root‘, .lastmodified = ‘05.01.2000‘}, .content = “}X2 = { .name = ‘mike‘, .tuple = {.owner = ‘root‘, .lastmodified = ‘04.17.2008‘}, .content = “}. . .X5 = { .name = ‘SIGMOD42.pdf ‘, .tuple = {size = 10k, .owner = ‘mike‘, .lastmodified = ‘04.01.2007‘}, .content = ‘@PDF . . . ‘}…..
home
Mike
papers
PIM
SIGMOD42.pdf
SIGMOD44.pdf
QP
VLDB12.pdf
VLDB10.pdf
projects
PIM
SIGMOD42.pdf
1
2
5 14
Syntax of query expression
QUERY_EXPRESSION := (PATH | KT_PREDICATE) (union QUERY_EXPRESSION)*
PATH := (LOCATION_STEP)+
LOCATION_STEP := LS_SEP NAME_PREDICATE (`[` KT_PREDICATE `]`)?
LS_SEP := `//` | `/`
NAME_PREDICATE := `*` | (`*`) ? VALUE (`* `)?
KT_PREDICATE := (KEYWORD | TUPLE) (LOGOP KT_PREDICATE)*
KEYWORD := `”` VALUE (WHITESPACE VALUE) * `”` | VALUE
(WHITESPACE KEYWORD)*
TUPLE := ATTRIBUTE_IDENTIFIER OPERATOR VALUE
OPERATOR := `=` | `<` | `>`
LOGOP := `AND` | `OR`15
semanticsAll nodes in graph
All nodes in graph that have ‘a’ in its content
All nodes in graph that have ‘a’ and ‘b’ in its content
All nodes in graph such that .name== ‘A’
nodes that .name== ‘B’ and there is an edge from W w.name == ‘A’
16
Logical algebra for query expressions
Operator Name semantics
G All resource views {V|V G.RV}
P(I) Selection {V|V I P(V)}
(I) Shallow unset {W|(V,W) G.E V I}
(I) Deep unset {V|V W V I}
I1 I2 intersection {V|V I1 V I2}
I1I2 union {V|V I1 V I2}
17
What have we seen so far?
Problem: querying heterogeneous data sources
Find a solution between SFA and NSA– Generic graph data model to describe the data– queries describes paths in the graph
19
How itrails help?
Queries are modified by hints ( trails) which adds/modifies search paths to look at.
Example: yesterday → //*[date = today() – 1]
20
iTrails: Defining Trails
Basic Form of a Trail
QL [.CL] → QR [.CR] Intuition:
When I query for QL [.CL], you should also query for QR [.CR]
–
Queries: keyword and path expressions
Attribute projections
iTrails: Defining Trails
Unidirectional trail
QL [.CL] → QR [.CR]
Intuition:– When query for QL [.CL], also query for QR [.CR]
Bidirectional trail QL [.CL] QR [.CR]
Example:ψi :=//*.tuple.date //*.tuple.modified
Queries:keyword and path expressions
Attribute projections
Query example:global warming zurich
or//Temperatures/*[celsius>10]
22
20
15
14
BE
ZH
ZH
Trail Examples: Global Warming Zurich
Trail for Implicit meaning:
query for global warming, also query Temperature data > 10 degrees”
Trail for an Entity: When query for zurich, query for references of zurich as a region
global warming → //Temperatures/*[celsius > 10]
Temperaturescity celsiusdate
Bern24-Sep
24-Sep
Zurich25-Sep
zurich → //*[region = “ZH”]
Uster
region
global warming zurich
9ZHZurich26-Sep
23
Trail Example: Deep Web Bookmarks
Trail for a Bookmark: Query for train home, also query Train website:origin = TelAviv Unidestination = Haifa Hof Hacarmel
train home
train home →//trainCompany.com//*[origin=“Tel Aviv Uni”
and dest =“HAifa Hof Hacarmel”]
WebServer
24
Trail Examples: Thesauri, Dictionaries, Language-agnostic Search
Trail for Thesauri: query for car, also query for auto
Trails for Dictionary: query for car, also query for carro and vice-versa
car auto
car automobile
car → auto
car → automobile automobile → carLaptop Email
Server
25
Trail Examples: Schema Equivalences
Trail for schema match on names: query for Employee.empName, also query for Person.name
Trail for schema match on salaries: query for Employee.salary, also query for Person.income
EmployeeempName salary
Personname age income
//Employee//*.tuple.empName → //Person//*.tuple.name
//Employee//*.tuple.salary → //Person//*.tuple.income
DBServer
empId
SSN
26
How are Trails Created?
Given by the user– Explicitly– Via Relevance Feedback
(Semi-)Automatically– Automatic schema matching– Ontologies and thesauri (e.g., wordnet)– User communities (e.g., trails on gene data, bookmarks
)
27
Uncertainty and Trails
Probabilistic Trails: – model uncertain trails– probabilities used to rank trails
QL [.CL] → QR [.CR], 0 ≤ p ≤ 1– Example: car → auto, p = 0.9
probability p reflects the likelihood that results obtained by trail are correct.
28
Certainty and Trails - continue
Scored Trails: – Give higher value to certain trails– Scoring Factors: boost scores of results obtained by the trail
QL [.CL] → QR [.CR], sf > 1. examples– T1: weather →sf //Temperatures/*, sf ≥ 1
– T2: yesterday → sf //*[date = today() – 1], sf ≥ 1
Intuition: sf reflects the relevance of the trail. – Results obtained are scored sf times higher than the results
obtained without the trail. – If no scoring factor is available, sf = 1
29
Rewriting Queries with Trails
U
weather yesterday
(1) Matching
T2: yesterday → //*[date = today() – 1]
Query
(2) Transformation
TrailU
weather
yesterday
U//*[date = today() – 1]
(3) Merging
T2
matches
30
Replacing Trails
Trails that use replace instead of union semantics
Uweather yesterday
(1) Matching
T2: yesterday //*[date = today() – 1]
Query
(2) Transformation
Trail
Uweather //*[date = today() – 1]
(3) Merging
T2
matches
31
...U
Problem: Recursive Matches (1/2)
U
weather
yesterday
U
//*[date = today() – 1]
T2: yesterday →
//*[date = today() – 1]
New query still matches T2,
so T2 could be applied
againU
weather U
yesterday
U//*[date = today() – 1]
//*[date = today() – 1]U
//*[date = today() – 1]
//*[date = today() – 1]
...
Infinite recursion
T2
matches
T2
matches 32
Problem: Recursive Matches (2/2)
U
weather
yesterday
U//*[date = today() – 1]
Trails may be mutually recursive
T3: //*.tuple.date → //*.tuple.modified
U
weather U
yesterday
//*[date = today() – 1]
T10: //*.tuple.modified → //*.tuple.date
U//*[modified = today() – 1]
U
weather Uyesterday
//*[date = today() – 1]U
//*[modified = today() – 1]U //*[date = today() – 1]
We again match T3
and enter an infinite loop
T3 matches
T10 matches
33
Algorithm to solve recursion - MMCA
Multiple Match Coloring Algorithm (MMCA):– Keep history of all trails matched or introduced– Given a set of trails Y. For every trail t in Y:– Apply t to Q iteratively and color the query tree
nodes in Q according to the trails that already touched those nodes
34
U
weather yesterday
First Level
U
weatheryesterday
//Temperatures/*
UU
//*[date = today() – 1]
U
weatheryesterday
//Temperatures/*
UU
//*[modified = today() – 1]
UU
//*[received = today() – 1]
//*[date = today() – 1]
SecondLevel
T1
matches
T2
matches
T3, T4 match
Multiple Match Coloring Algorithm
T1: weather → //Temperatures/*
T2:yesterday → //*[date =today()-1]
T3://*.tuple.date →//*.tuple.modified
T4://*.tuple.date →//*.tuple.received 35
MMCA is exponential in number of levels– Every leaf can be applied any of the trails, and
each trail can generate additional leafs.
Solution: Trail Pruning– Number of levels – punish recursive rewrites– Top-K trails matched in each level
Ranking by probability/certainity/weight
– Other - timeout, progressively compute query results
Multiple Match Coloring Algorithm cont.
36
iTrails Evaluation in iMeMex
Main Questions in Evaluation– Quality: Top-K Precision and Recall– Performance: Use of Materialization– Scalability: Query-rewrite Time vs. Number of
Trails
37
iTrails Evaluation in iMeMex
Scenario 1: Few High-quality Trails– Closer to information integration use cases– Obtained real datasets and indexed them– 18 hand-crafted trails– 14 hand-crafted queries
Scenario 2: Many Low-quality Trails– Closer to search use cases– Randomly generated up to 10,000 trails and queries
with a mutual uniform match probability of 1%
38
iTrails Evaluation in iMeMex: Scenario 1
Configured iMeMex to act in three modes– Baseline: Graph / IR search engine– iTrails: Rewrite search queries with trails– Perfect Query: Semantics-aware query
Data: shipped to central index
Laptop Email Server
WebServer
DBServer
sizes in MB
39
Trails and queries used in Scenario 1
max original tree size: 14max final tree size after applying trails: 35max # of trails applied: 5
40
Quality: Top-K Precision and Recall (k=20)
SearchEngine misses relevantresults
SearchQuery is partially
semantics-aware
Scenario 1: few high-quality Trails (18 trails)
Queries
perfect query
Perfect Query always has precision and recall
equal to 1
41
Performance: Use of Materialization
Trail merging adds overhead to query execution
Trail Materialization improves performence for almost all queries
Scenario 1: few high-quality trails (18 trails)
42
Scalability: Query-rewrite Time vs. Number of Trails – scenario 2
• No pruning approach exponential growth in the query plan sizes • Query-rewrite time can be controlled with pruning
43
summary
First framework to explore pay-as-you-go information integration in dataspaces
iTrails: generic method to model semantic relationships gradually
Itrails are used to rewrite queries Algorithm to control recursive query rewrites
44
Personal opinion - advantages
The method is incremental– Integrators can collect statistics, find most common
queries and define trails for popular queries first.– Dynamic system: If popular queries changes over
time, trails for less popular queries can be disabled to reduce system workload.
Trails can be defined independently by domain expects for each data domain.
45
Personal opinion - disadvantages
Trails are global: every rewritten query is evaluated over every data source.
– Trail can have different meaning for different data sources.
For a good quality of query results, trails have to be defined manually problem for large systems. Solution: use machine learning techniques to improve automatic
trails creation
Overlaps and inconsistencies in trails are possible since query returns union of the results satisfying all trails Solution: trail mining and weighting would be helpful here.
46
Bibliography
iTrails: Pay-as-you-go Information Integration in Dataspaces:Marcos Antonio Vaz Salles JensPeter Dittrich Shant Kirakos Karakashian
Olivier René Girard Lukas Blunschi ETH Zurich 8092 Zurich, Switzerland dbis.ethz.ch | iMeMex.org
From Databases to Dataspaces: A New Abstraction for Information Management:Michael Franklin University of California, Berkeley, Alon Halevy Google Inc. and U. Washington, David Maier Portland State University
Wikipedia, dataspace:http://en.wikipedia.org/wiki/Data_Spaces, memex:http://en.wikipedia.org/wiki/Vannevar_Bush
Imemex information: http://imemex.ethz.ch/
48
Algorithm runtime:– L: Number of leaves in query Q– M: Max number of leaves in query introduced by a trail– N: Number of trails– d {1, . . . ,N} number of levels
Theorem: Maximum number of trail applications performed by MMCA and maximum number of leaves in the merged query tree are both bounded by O(L • M^d )
Multiple Match Coloring Algorithm Analysis
50
MMCA run time analysis (O(L•M^d ) )
If trail t is matched in query Q, it colors Q leaf nodes Subtree containing only these nodes is not matched again by t. Worst case, in each level only one of the trails matches for
each of the leaves. 1st run: Trail match M new leaves for each of those leaves
total of LM new nodes plus L old nodes L(M+1) leaves and L trail applications for the first level.
2nd run: t doesn’t match any of the leaves anymore (they are colored in 1st run).
However, all leaves may be matched against N −1 colors. Worst case, again, only one of the trails matches for each of
the existing leaf nodes. In the d-th level, will lead to L(M+1)^(d−1) trail applications and
a total of L(M+1)^d leaves.
51
iDM: Lazily Computed Graph
iDM is not a static model– Every component of every Resource View may be created on
demand– Every Resource View may be created on demand
Behind the scenes, obtaining the content may:– Read a file on the filesystem– Access a page on the web– Fetch the data from an index structure
Behind the scenes, obtaining the group may:– Get the children of a folder in the filesystem– Look up an edge replica– Obtain the sections of a document
53
How to implement iDM: Architectural Perspective
Indexes&Replicas access (warehousing)
Data source access (mediation)
Complex operators (query algebra)
OperatorsPhysicalAlgebra
Data StoreResultCache
CatalogiQL
Query Processor
DataOperatorsCleaning
Replicas
Indexes&Data Store
CatalogiDM
Query Processor
Operators
Catalog
ContentConverters
Data SourceQuery
Processor
Data SourcePlugins
iMeMex PDSMS
Search & Browse Office ToolsEmail ...
DBMS
Application Layer
Data Source Layer
...
...IMAPFile System...54
Data management approaches
Features
Integration Solution
Search Dataspaces Data Integration
Integration Effort
Low Pay-as-you-go
High
Query Semantics
Precision / Recall
Precision / Recall
Precise
Need for Schema
Schema-never
Schema-later Schema-first
55
Canonical form
The canonical form of Г(Q) of a query Q is obtained by decomposing Q into location step separators and predicates (P) according to grammar. Г(Q) is constructed by the following recursion:
G if tree is empty
Tree = (tree) if LS_SEP=// and not first location step
μ(tree) if LS_SEP=/ and not first location step
tree σp(G) otherwise
56