Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא
Post on 22-Aug-2020
0 Views
Preview:
Transcript
Why & Where A characterization of Data Provenance
Authors: Peter Buneman, Sanjeev Khanna, and Wang Chiew-Tan
Presented by: Tamra Reutlinger
Example
Name, Id, address Id, Telephone
Telephone Name
:
("John Doe",1234) Not Valid!
Example
Name, Id, address Id, Telephone
Telephone Name
:
("John Doe",1234)
Id, Telephone
Valid
D1 D2
The Two Meanings Of Provenance
Why – why is the tuple in our View?
Where – where did the data “1234” come from? (what path did it go
through to get here)
Importance of Finding Provenance
Sources of different qualities
Scientific databases
On-line monitoring
OLAP (online analytical processing)
Provenance = Lineage =
מקור, מוצא, שושלת יוחסין, אילן יוחסין
Goal
Computing provenance
A syntactic approach
A general data model
Outline:
Introduction to data provenance
A deterministic model
Syntax & operations
Encoding relations
A Query language
Why provenance
Where provenance
Conclusion
Example
Name, Id, address Id, Telephone
:
Telephone Name
("John Doe",1234)
Id, Telephone
Valid
D1
D2
{Id:1}
a b c
a 3 “a”
{Id:3}
{Id:2}
num
109
“Where.. Collect..”
Edge-labeled Tree Models For
Semi-structured Data
The labels of each node are
distinct
semi-structured data
{Id:1}
a b c
a 3 “a”
{Id:3}
{Id:2}
num
109
Syntax & Operations
{x1:y1, x2:y2} x1
y1
x2
y2
Paths: x1.x2....xn
Example: the path {Id:1} identifies the value {Name:"Kim", Rate:50}
the path {id:1}.rate identifies the value 50
Path representation of v
The set of all the paths to the constants
At the terminal nodes.
{a:{1:c,3:d}}
=>
{(a.1,c),(a.3,d)}
{c:3} b
1 2
1 3
c d
e1 a
Substructure
:{1: ,3: } :{1: ,2 : ,3: }a c d a c b d:{1: ,3: } . :{1: ,3: }a c d b a c d
1 3
c d
a
a
2
b
w v
Path representation of w is a subset of the path representation of v
Deep Union
v1 U v2 is the union of the path representations
of v1 and v2
c
2
a b
1 d
4
b
5
e
c
2
a b
1 d
4
5
e
v1 v1 U v2 v2
Deep Union
The result may not be a partial function in which case the deep union is undefined.
c
2
a b
1 c
3
b
5
e ?!? C:2 or C:3
?!?
v1 v2
Outline:
Introduction to data provenance
A deterministic model
Syntax & operations
Encoding relations
A Query language
Why provenance Where provenance
Conclusion
Encoding of Relations
1685
{name:”G.F Handel}
Born period
“baroque”
{name:”J.S Bach”}
{name:”W.A Mozart”}
Born period
1685 “baroque”
Born period
1756 “classical”
Composers
{name:”G.F Handel}
{Opus:”BMV82”}
“Ihave enough”
{name:”J.S Bach”}
Works
title
“-” “art thou troubled?”
title title
{Opus:”BMV552”} {Opus:”HMV19”}
Encoding of Relations
Relation Key Tuple
______________
________________
________________
______________
______________
________________
__________________________
__________________________
___________________________
_________________________
____________
___________________________
____________
___________
___________
Outline:
Introduction to data provenance
A deterministic model
Syntax & operations
Encoding relations
A Query language
Why provenance Where provenance
Conclusion
Example
Name, Id, address Id, Telephone
:
Telephone Name
("John Doe",1234)
Id, Telephone
Valid
D1
D2
{Id:1}
a b c
a 3 “a”
{Id:3}
{Id:2}
num
109
?
A Query Language
A general syntactic form:
1 1 ,
:
,
( )
n n
where p e
p e
condition
collect e
Example
s . . : ,
1700
{ : }:
where composers x born u D
u
collect year u C
{year:1685}:C x2
Example
Q=
( .{ : }.{ : , : } ,
.{{ : }. : }: )
{ : }.{ : ,{ : }: }
where Composers name x born u period v D
Works name x opus w y D
collect name x born u opus w y
{
{ :" . . "}.{ :1685,{ :" 82"}:" "},
{ :" . . "}.{ :1685,{ :" 552"}:" "},
{ :" . . "}.{ :1685,{ :" 19"}:" ?"}
}
e
name J S Bach born opus BMV I haveenough
name J S Bach born opus BMV
name G F Handel born opus HMV Art thoughtroubled
Example
1685
{name:”G.F Handel}
Born period
“baroque”
{name:”J.S Bach”}
{name:”W.A Mozart”}
Born period
1685 “baroque”
Born period
1756 “classical”
Composers
{name:”G.F Handel}
{Opus:”BMV82”}
“Ihave enough”
{name:”J.S Bach”}
Works
title
“-” “art thou troubled?”
title title
{Opus:”BMV552”} {Opus:”HMV19”}
1685
{name:”G.F Handel}
Born
{name:”J.S Bach”}
Born
1685
{Opus:”BMV82”}
“Ihave enough” “-” “art thou troubled?”
{Opus:”BMV552”} {Opus:”HMV19”}
“Collect..“ -How?
For each pi and each assignment of the variables in pi, evaluate the condition
True? -add the value of e to the output.
”Union” together the output values.
( | |{ : }| | )collect e e e e e c xc – constants X – variables
Example
?
. : . : ,
. : . :
: . _ :
where Emps Id x salary y D
Emps Id y bonus z D
collect Id x new salary y
Well-Formed Queries
Q is well-formed if:
a) No pi is a single variable
b) Each ei is either a (nested) query or an expression that doesn’t involve a query
c) Each comparison is between variables or between variables and constants only.
soundness of rewrite rules
x DX
. .
. . : . ,
. :
S t u D
where R x y z Dt u
collect x y z
V 1700u V
Well-Defined Queries
A query may be undefined on a certain input.
Q is Well-Defined if it is defined on any input.
- For the rest of the presentation, we will consider only queries that are both well-formed and well-defined.
Singular Expression
A single path terminated by a constant or variable
and for any non-empty and
distinct expressions e1 and e2
1 2( )e e e
{Id:1}
a b c
a 3 “a”
{Id:3}
{Id:2}
num
109
Normal Form
Q = Q1 U..U Qn and each Qi=
Spi and se - singular pattern and singular expression respectively.
Di - database constant
condition - Boolean predicate on the variables of
the query.
1 1( ,.., , ) ( )n nwhere sp D sp D condition collect se
Strong Normalization
The rewrite system R is strongly normalizing
Therefore:
Well-formed query
any sequence of application of rewrite rules
Normal form
In a finite number of steps!
Outline:
Introduction to data provenance
A deterministic model
Why provenance (syntactic characterization and invariance under query rewriting)
Where provenance
Conclusion
Why Is The Tuple In Our View?
Name, Id, address Id, Telephone
:
Telephone Name
("John Doe",1234)
Id, Telephone
Valid
D1
D2
{Id:1}
a b c
a 3 “a”
{Id:3}
{Id:2}
num
109
“Where.. Collect..”
Witnesses
The collection of values taken from D that proves an output.
s is a witness for t with respect to Q and D if:
t Q(s) and s D
Example
Q=
( .{ : }.{ : , : } ,
.{{ : }. : }: )
{ : }.{ : ,{ : }: }
where Composers name x born u period v D
Works name x opus w y D
collect name x born u opus w y
{
{ :" . . "}.{ :1685,{ :" 82"}:" "},
{ :" . . "}.{ :1685,{ :" 552"}:" "},
{ :" . . "}.{ :1685,{ :" 19"}:" ?"}
}
e
name J S Bach born opus BMV I haveenough
name J S Bach born opus BMV
name G F Handel born opus HMV Art thoughtroubled
Example
1685
{name:”G.F Handel}
Born period
“baroque”
{name:”J.S Bach”}
{name:”W.A Mozart”}
Born period
1685 “baroque”
Born period
1756 “classical”
Composers
{name:”G.F Handel}
{Opus:”BMV82”}
“Ihave enough”
{name:”J.S Bach”}
Works
title
“-” “art thou troubled?”
title title
{Opus:”BMV552”} {Opus:”HMV19”}
1685
{name:”G.F Handel}
Born
{name:”J.S Bach”}
Born
1685
{Opus:”BMV82”}
“Ihave enough” “-” “art thou troubled?”
{Opus:”BMV552”} {Opus:”HMV19”}
Example - {name:"G.F Handel“}.born:1685
1685
{name:”G.F Handel}
Born period
“baroque”
{name:”J.S Bach”}
{name:”W.A Mozart”}
Born period
1685 “baroque”
Born period
1756 “classical”
Composers
{name:”G.F Handel}
{Opus:”BMV82”}
“Ihave enough”
{name:”J.S Bach”}
Works
title
“-” “art thou troubled?”
title title
{Opus:”BMV552”} {Opus:”HMV19”}
1685
{name:”G.F Handel}
Born
{name:”J.S Bach”}
Born
1685
{Opus:”BMV82”}
“Ihave enough” “-” “art thou troubled?”
{Opus:”BMV552”} {Opus:”HMV19”}
Witnesses
{Composers.{name:"G.F. Handel"}.{born:1685, period:"baroque"},
Works.{{name:"G.F. Handel"}.opus:"HMV19"}.title:"Art thou troubled?"}
{name:"G.F Handel“}.born:1685
witnesses
Example – Witness Basis
1685
{name:”G.F Handel}
Born period
“baroque”
{name:”J.S Bach”}
{name:”W.A Mozart”}
Born period
1685 “baroque”
Born period
1756 “classical”
Composers
{name:”G.F Handel}
{Opus:”BMV82”}
“Ihave enough”
{name:”J.S Bach”}
Works
title
“-” “art thou troubled?”
title title
{Opus:”BMV552”} {Opus:”HMV19”}
{name:”G.F Handel}
Born period
1685 “baroque”
Composers
{name:”G.F Handel}
Works
“art thou troubled?”
title
{Opus:”HMV19”}
Witness Basis - WQ,D(t)
t=t1 U t2
WQ,D(t1) WQ,D(t2) WQ,D(t) U
Q=Q1 U Q2
Q2(D) Q1(D)
WQ1,D(t) WQ2,D(t) U
Q (D)
WQ,D(t)
The set of all witnesses
for a value t in Q(D)
Witness Basis
Lemma 1: If Q ~> Q’ via the rewrite system R, then for any
value t in the output of Q(D), WQ,D(t)=WQ’,D(t)
Q - well formed
Q(D) Q(D)
WQ,D(t) WQ’,D(t) =
Q’ - normal form ~>
Algorithm: Why(t,Qi,D)
D
1 1,.., ,n np e p e condition
1 1' .. n np e p e
' " ( ) ( ') : "iQ where collect C
1 1( ,.., , ) ( )
n ni i i i i iQ where p e p e condition collect e
( ) ?ie t
סימונים
t
Minimal Witness Basis
A witness for a value is invariant under all equivalent queries but the witness basis is not.
The minimal witness basis is invariant under certain queries
Minimal Witness, Minimal Witness Basis
s is a minimal witness for t if:
MQ,D(t) - The minimal witness basis for t,
is a maximal subset of WQ,D(t) such that:
' , ( ').s s t Q s
, ,( ), ( ); .Q D Q Dm M t w W t w m
Example - 1685
1685
{name:”G.F Handel}
Born period
“baroque”
{name:”J.S Bach”}
{name:”W.A Mozart”}
Born period
1685 “baroque”
Born period
1756 “classical”
Composers
{name:”G.F Handel}
{Opus:”BMV82”}
“Ihave enough”
{name:”J.S Bach”}
Works
title
“-” “art thou troubled?”
title title
{Opus:”BMV552”} {Opus:”HMV19”}
{name:”G.F Handel}
Born period
1685 “baroque”
Composers
{name:”G.F Handel}
Works
“art thou troubled?”
title
{Opus:”HMV19”}
Example - 1685
1685
{name:”G.F Handel}
Born period
“baroque”
{name:”J.S Bach”}
{name:”W.A Mozart”}
Born period
1685 “baroque”
Born period
1756 “classical”
Composers
{name:”G.F Handel}
{Opus:”BMV82”}
“Ihave enough”
{name:”J.S Bach”}
Works
title
“-” “art thou troubled?”
title title
{Opus:”BMV552”} {Opus:”HMV19”}
Not a proof tree For value!!!
{name:”G.F Handel}
Born period
1685 “baroque”
Composers
{name:”G.F Handel}
Works
“art thou troubled?”
title
{Opus:”HMV19”}
Invariance of Minimal Witness Basis
under Equivalent queries
Q, Q’ - two equivalent well-formed queries
t is in Q(D) and Q’(D)
Then; MQ,D(t) = MQ’,D(t)
D=D1U..UDn, V=V(D). For a value t in Q(D,V),
where Q’ is the rewritten query via our rewrite
system R in which view V has been “composed out".
Cascaded Witnesses (Query Composition)
Unnesting of Witnesses
Q’,D ,{ , ( )}
,
W t { ' | ( ') ( ),
' is the value taken from view V D , ' ( ')}
Q D V D
V D
w w w v W t
v w W v
Outline:
Introduction to data provenance
A deterministic model
Why provenance
Where provenance (problems defining,
invariance under query rewriting)
Conclusion
Reminder:
So far we have looked at what pieces of input data validate the existence of an output value. (why provenance)
We now focus on identifying what pieces of input data helped create values that appear in the output. (where provenance)
Example - 1685
{name:”G.F Handel}
Born period
1685 “baroque”
Composers
{name:”G.F Handel}
Works
“art thou troubled?”
title
{Opus:”HMV19”}
{name:”G.F Handel}
Born
Composers
Witness basis
Where Provenance
There are many difficulties involved in formalizing this
Invariance Over Equivalent Queries
Looking for employees with a salary of 50$
where Emps.{Id:x}.salary:$50 D, collect {Id:x}.salary:$50
where Emps.{Id:x}.salary:y D, y = $50
collect {Id:x}.salary:y
!
where Emps.{Id:x}.salary:$50 D, collect {Id:x}.salary:$50
y = $50K
What is the where- Provenance of 50$?
Multiple Pieces of Data
where Emps.{Id:x}.salary:y D, Emps.{Id:x}.salary:z D, Emps.{Id:x}.bonus:z D
collect {Id:x}.new salary:y
where Emps.{Id:x}.salary:y D, Emps.{Id:x}.bonus:y D
collect {Id:x}.new_salary:y
New_salary is tracked
by y
New_salary is tracked
by y and z?
Nested Queries
where R.x.y : z D, S.x.y : z D collect x.y : z
where R.x.y : z D, S.t.u D, t:u collect {x.y : z, t : u}
where R.x.y : z D ,
collect x.y : z
{R.1.2:3,S.1.2:3}
D Output: 1.2:3
Where provenance: {R.1:2,S.1:2}
Where provenance: {R.1:2,S.1:2}
t:u
{1.2:3,1.2:3}
=>u = y:z
where R.x.y : z D ,
collect x.y : z
Traceable Queries
A restricted class of queries, for which where-provenance is preserved under rewriting.
Example - {name:"G.F Handel“}.born:1685
{name:”G.F Handel}
Born period
1685 “baroque”
Composers
{name:”G.F Handel}
Works
“art thou troubled?”
title
{Opus:”HMV19”}
{name:”G.F Handel}
Born
Composers
Witness basis
Where Provenance
Derivation Basis (Where Provenance)
The derivation basis for l:v finds a variable x in the output expression that will generate v.
1685
{name:”G.F Handel}
Born period
“baroque”
{name:”J.S Bach”}
{name:”W.A Mozart”}
Born period
1685 “baroque”
Born period
1756 “classical”
Composers
. . : ,
1700
{ : }:
where composers x born u D
u
collect year u C
{year:1685}:C x2
Where(l:v,Q,D)
Computes the derivation basis of l:v.
The “collect" clause of the new query returns two things:
the patterns
the paths
pointing to x in the “where" clause of Q
, 0( : , , ) ( : ) {([[ ]] .. [[ ]] , )}Q D nWhere l v Q D l v p p S
Derivation Basis
100
{Id:3}
bonus salary
2000
{Id:1}
{Id:2}
bonus salary
300 1900
bonus salary
17 1700
Emps
. : . : ,
. : . :
: . _ :
where Emps Id x salary y D
Emps Id x bonus y D
collect Id x new salary y
{Id:1}.new_salary:2100 {Id:2}.new_salary:1717 {Id:3}.new_salary:2200
1p
2p
1( ) . :1 . : 2000 p Emps Id salary D
, 1 2( : ) ( ( ) ( ), . :1 .{ , })}Q D l v p p Emps Id salary bonus
Derivation Basis - , ( : )Q D l v
Q=Q1 U Q2
Q2(D) Q1(D)
U 1 , ( : )Q D l v2 , ( : )Q D l v , ( : )Q D l v
v is an atomic value
Q Is Traceable If:
1) each pi in the query matches either against some database constant or against a sub-query
2) every sub-query is a view which does not share
any variables with the outer scope
3) only a singular pattern is allowed to match
against a sub-query
4) the pattern and output expression of the sub-
query consist of a sequence of distinct variables and have the same length.
Propositions
Proposition 1:
Proposition 2:
for any l:v in the output of Q(D)
Q - traceable Q’ - traceable Q ~> Q’
Q - traceable Q ~> Q’ , ',( : ) ( : )Q D Q Dl v l v
Outline:
Introduction to data provenance
A deterministic model
Why provenance
Where provenance
Conclusion
Why Is The Tuple In Our View?
Name, Id, address Id, Telephone
:
Telephone Name
("John Doe",1234)
Id, Telephone
Valid
D1
D2
{Id:1}
a b c
a 3 “a”
{Id:3}
{Id:2}
num
109
“Where.. Collect..”
Conclusions
o Describing and Understanding provenance of data
o Two perspectives: Why is a piece of data in the output? Where did a piece of data come from?
o A system of rewrite rules where
why-provenance is preserved over the class of well-defined queries and where-provenance is preserved over the class of traceable queries.
!תודה על ההקשבה
top related