Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Post on 22-Aug-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Why & Where A characterization of Data Provenance

Authors: Peter Buneman, Sanjeev Khanna, and Wang Chiew-Tan

Presented by: Tamra Reutlinger

Example

Name, Id, address Id, Telephone

Telephone Name

:

("John Doe",1234) Not Valid!

Example

Name, Id, address Id, Telephone

Telephone Name

:

("John Doe",1234)

Id, Telephone

Valid

D1 D2

The Two Meanings Of Provenance

Why – why is the tuple in our View?

Where – where did the data “1234” come from? (what path did it go

through to get here)

Importance of Finding Provenance

Sources of different qualities

Scientific databases

On-line monitoring

OLAP (online analytical processing)

Provenance = Lineage =

מקור, מוצא, שושלת יוחסין, אילן יוחסין

Goal

Computing provenance

A syntactic approach

A general data model

Outline:

Introduction to data provenance

A deterministic model

Syntax & operations

Encoding relations

A Query language

Why provenance

Where provenance

Conclusion

Example

Name, Id, address Id, Telephone

:

Telephone Name

("John Doe",1234)

Id, Telephone

Valid

D1

D2

{Id:1}

a b c

a 3 “a”

{Id:3}

{Id:2}

num

109

“Where.. Collect..”

Edge-labeled Tree Models For

Semi-structured Data

The labels of each node are

distinct

semi-structured data

{Id:1}

a b c

a 3 “a”

{Id:3}

{Id:2}

num

109

Syntax & Operations

{x1:y1, x2:y2} x1

y1

x2

y2

Paths: x1.x2....xn

Example: the path {Id:1} identifies the value {Name:"Kim", Rate:50}

the path {id:1}.rate identifies the value 50

Path representation of v

The set of all the paths to the constants

At the terminal nodes.

{a:{1:c,3:d}}

=>

{(a.1,c),(a.3,d)}

{c:3} b

1 2

1 3

c d

e1 a

Substructure

:{1: ,3: } :{1: ,2 : ,3: }a c d a c b d:{1: ,3: } . :{1: ,3: }a c d b a c d

1 3

c d

a

a

2

b

w v

Path representation of w is a subset of the path representation of v

Deep Union

v1 U v2 is the union of the path representations

of v1 and v2

c

2

a b

1 d

4

b

5

e

c

2

a b

1 d

4

5

e

v1 v1 U v2 v2

Deep Union

The result may not be a partial function in which case the deep union is undefined.

c

2

a b

1 c

3

b

5

e ?!? C:2 or C:3

?!?

v1 v2

Outline:

Introduction to data provenance

A deterministic model

Syntax & operations

Encoding relations

A Query language

Why provenance Where provenance

Conclusion

Encoding of Relations

1685

{name:”G.F Handel}

Born period

“baroque”

{name:”J.S Bach”}

{name:”W.A Mozart”}

Born period

1685 “baroque”

Born period

1756 “classical”

Composers

{name:”G.F Handel}

{Opus:”BMV82”}

“Ihave enough”

{name:”J.S Bach”}

Works

title

“-” “art thou troubled?”

title title

{Opus:”BMV552”} {Opus:”HMV19”}

Encoding of Relations

Relation Key Tuple

______________

________________

________________

______________

______________

________________

__________________________

__________________________

___________________________

_________________________

____________

___________________________

____________

___________

___________

Outline:

Introduction to data provenance

A deterministic model

Syntax & operations

Encoding relations

A Query language

Why provenance Where provenance

Conclusion

Example

Name, Id, address Id, Telephone

:

Telephone Name

("John Doe",1234)

Id, Telephone

Valid

D1

D2

{Id:1}

a b c

a 3 “a”

{Id:3}

{Id:2}

num

109

?

A Query Language

A general syntactic form:

1 1  ,

          :

           ,

          

( )

n n

where p e

p e

condition

collect e

Example

s    . . : ,

           1700

 { : }:

where composers x born u D

u

collect year u C

{year:1685}:C x2

Example

Q=

( .{ : }.{ : , : } ,

             .{{ : }. : }: )

 { : }.{ : ,{ : }: }

where Composers name x born u period v D

Works name x opus w y D

collect name x born u opus w y

{

{ :" . . "}.{ :1685,{ :" 82"}:"     "},

{ :" . . "}.{ :1685,{ :" 552"}:" "},

{ :" . . "}.{ :1685,{ :" 19"}:"     ?"}

      }

e

name J S Bach born opus BMV I haveenough

name J S Bach born opus BMV

name G F Handel born opus HMV Art thoughtroubled

Example

1685

{name:”G.F Handel}

Born period

“baroque”

{name:”J.S Bach”}

{name:”W.A Mozart”}

Born period

1685 “baroque”

Born period

1756 “classical”

Composers

{name:”G.F Handel}

{Opus:”BMV82”}

“Ihave enough”

{name:”J.S Bach”}

Works

title

“-” “art thou troubled?”

title title

{Opus:”BMV552”} {Opus:”HMV19”}

1685

{name:”G.F Handel}

Born

{name:”J.S Bach”}

Born

1685

{Opus:”BMV82”}

“Ihave enough” “-” “art thou troubled?”

{Opus:”BMV552”} {Opus:”HMV19”}

“Collect..“ -How?

For each pi and each assignment of the variables in pi, evaluate the condition

True? -add the value of e to the output.

”Union” together the output values.

( | |{ : }| | )collect e e e e e c xc – constants X – variables

Example

?

. : . :   ,

            . : . :  

: . _ :

where Emps Id x salary y D

Emps Id y bonus z D

collect Id x new salary y

Well-Formed Queries

Q is well-formed if:

a) No pi is a single variable

b) Each ei is either a (nested) query or an expression that doesn’t involve a query

c) Each comparison is between variables or between variables and constants only.

soundness of rewrite rules

x DX

. .

. . :    . ,

. :   

S t u D

where R x y z Dt u

collect x y z

V 1700u V

Well-Defined Queries

A query may be undefined on a certain input.

Q is Well-Defined if it is defined on any input.

- For the rest of the presentation, we will consider only queries that are both well-formed and well-defined.

Singular Expression

A single path terminated by a constant or variable

and for any non-empty and

distinct expressions e1 and e2

1 2( )e e e

{Id:1}

a b c

a 3 “a”

{Id:3}

{Id:2}

num

109

Normal Form

Q = Q1 U..U Qn and each Qi=

Spi and se - singular pattern and singular expression respectively.

Di - database constant

condition - Boolean predicate on the variables of

the query.

1 1( ,.., , ) ( )n nwhere sp D sp D condition collect se

Strong Normalization

The rewrite system R is strongly normalizing

Therefore:

Well-formed query

any sequence of application of rewrite rules

Normal form

In a finite number of steps!

Outline:

Introduction to data provenance

A deterministic model

Why provenance (syntactic characterization and invariance under query rewriting)

Where provenance

Conclusion

Why Is The Tuple In Our View?

Name, Id, address Id, Telephone

:

Telephone Name

("John Doe",1234)

Id, Telephone

Valid

D1

D2

{Id:1}

a b c

a 3 “a”

{Id:3}

{Id:2}

num

109

“Where.. Collect..”

Witnesses

The collection of values taken from D that proves an output.

s is a witness for t with respect to Q and D if:

t Q(s) and s D

Example

Q=

( .{ : }.{ : , : } ,

             .{{ : }. : }: )

 { : }.{ : ,{ : }: }

where Composers name x born u period v D

Works name x opus w y D

collect name x born u opus w y

{

{ :" . . "}.{ :1685,{ :" 82"}:"     "},

{ :" . . "}.{ :1685,{ :" 552"}:" "},

{ :" . . "}.{ :1685,{ :" 19"}:"     ?"}

      }

e

name J S Bach born opus BMV I haveenough

name J S Bach born opus BMV

name G F Handel born opus HMV Art thoughtroubled

Example

1685

{name:”G.F Handel}

Born period

“baroque”

{name:”J.S Bach”}

{name:”W.A Mozart”}

Born period

1685 “baroque”

Born period

1756 “classical”

Composers

{name:”G.F Handel}

{Opus:”BMV82”}

“Ihave enough”

{name:”J.S Bach”}

Works

title

“-” “art thou troubled?”

title title

{Opus:”BMV552”} {Opus:”HMV19”}

1685

{name:”G.F Handel}

Born

{name:”J.S Bach”}

Born

1685

{Opus:”BMV82”}

“Ihave enough” “-” “art thou troubled?”

{Opus:”BMV552”} {Opus:”HMV19”}

Example - {name:"G.F Handel“}.born:1685

1685

{name:”G.F Handel}

Born period

“baroque”

{name:”J.S Bach”}

{name:”W.A Mozart”}

Born period

1685 “baroque”

Born period

1756 “classical”

Composers

{name:”G.F Handel}

{Opus:”BMV82”}

“Ihave enough”

{name:”J.S Bach”}

Works

title

“-” “art thou troubled?”

title title

{Opus:”BMV552”} {Opus:”HMV19”}

1685

{name:”G.F Handel}

Born

{name:”J.S Bach”}

Born

1685

{Opus:”BMV82”}

“Ihave enough” “-” “art thou troubled?”

{Opus:”BMV552”} {Opus:”HMV19”}

Witnesses

{Composers.{name:"G.F. Handel"}.{born:1685, period:"baroque"},

Works.{{name:"G.F. Handel"}.opus:"HMV19"}.title:"Art thou troubled?"}

{name:"G.F Handel“}.born:1685

witnesses

Example – Witness Basis

1685

{name:”G.F Handel}

Born period

“baroque”

{name:”J.S Bach”}

{name:”W.A Mozart”}

Born period

1685 “baroque”

Born period

1756 “classical”

Composers

{name:”G.F Handel}

{Opus:”BMV82”}

“Ihave enough”

{name:”J.S Bach”}

Works

title

“-” “art thou troubled?”

title title

{Opus:”BMV552”} {Opus:”HMV19”}

{name:”G.F Handel}

Born period

1685 “baroque”

Composers

{name:”G.F Handel}

Works

“art thou troubled?”

title

{Opus:”HMV19”}

Witness Basis - WQ,D(t)

t=t1 U t2

WQ,D(t1) WQ,D(t2) WQ,D(t) U

Q=Q1 U Q2

Q2(D) Q1(D)

WQ1,D(t) WQ2,D(t) U

Q (D)

WQ,D(t)

The set of all witnesses

for a value t in Q(D)

Witness Basis

Lemma 1: If Q ~> Q’ via the rewrite system R, then for any

value t in the output of Q(D), WQ,D(t)=WQ’,D(t)

Q - well formed

Q(D) Q(D)

WQ,D(t) WQ’,D(t) =

Q’ - normal form ~>

Algorithm: Why(t,Qi,D)

D

1 1,.., ,n np e p e condition

1 1' .. n np e p e

' "    ( )      ( ') : "iQ where collect C

1 1( ,.., , ) ( )

n ni i i i i iQ where p e p e condition collect e

( ) ?ie t

סימונים

t

Minimal Witness Basis

A witness for a value is invariant under all equivalent queries but the witness basis is not.

The minimal witness basis is invariant under certain queries

Minimal Witness, Minimal Witness Basis

s is a minimal witness for t if:

MQ,D(t) - The minimal witness basis for t,

is a maximal subset of WQ,D(t) such that:

' , ( ').s s t Q s

, ,( ), ( ); .Q D Q Dm M t w W t w m

Example - 1685

1685

{name:”G.F Handel}

Born period

“baroque”

{name:”J.S Bach”}

{name:”W.A Mozart”}

Born period

1685 “baroque”

Born period

1756 “classical”

Composers

{name:”G.F Handel}

{Opus:”BMV82”}

“Ihave enough”

{name:”J.S Bach”}

Works

title

“-” “art thou troubled?”

title title

{Opus:”BMV552”} {Opus:”HMV19”}

{name:”G.F Handel}

Born period

1685 “baroque”

Composers

{name:”G.F Handel}

Works

“art thou troubled?”

title

{Opus:”HMV19”}

Example - 1685

1685

{name:”G.F Handel}

Born period

“baroque”

{name:”J.S Bach”}

{name:”W.A Mozart”}

Born period

1685 “baroque”

Born period

1756 “classical”

Composers

{name:”G.F Handel}

{Opus:”BMV82”}

“Ihave enough”

{name:”J.S Bach”}

Works

title

“-” “art thou troubled?”

title title

{Opus:”BMV552”} {Opus:”HMV19”}

Not a proof tree For value!!!

{name:”G.F Handel}

Born period

1685 “baroque”

Composers

{name:”G.F Handel}

Works

“art thou troubled?”

title

{Opus:”HMV19”}

Invariance of Minimal Witness Basis

under Equivalent queries

Q, Q’ - two equivalent well-formed queries

t is in Q(D) and Q’(D)

Then; MQ,D(t) = MQ’,D(t)

D=D1U..UDn, V=V(D). For a value t in Q(D,V),

where Q’ is the rewritten query via our rewrite

system R in which view V has been “composed out".

Cascaded Witnesses (Query Composition)

Unnesting of Witnesses

Q’,D ,{ , ( )}

,

W t { ' | ( ') ( ),

'  is the value taken from view V D , ' ( ')}

Q D V D

V D

w w w v W t

v w W v

Outline:

Introduction to data provenance

A deterministic model

Why provenance

Where provenance (problems defining,

invariance under query rewriting)

Conclusion

Reminder:

So far we have looked at what pieces of input data validate the existence of an output value. (why provenance)

We now focus on identifying what pieces of input data helped create values that appear in the output. (where provenance)

Example - 1685

{name:”G.F Handel}

Born period

1685 “baroque”

Composers

{name:”G.F Handel}

Works

“art thou troubled?”

title

{Opus:”HMV19”}

{name:”G.F Handel}

Born

Composers

Witness basis

Where Provenance

There are many difficulties involved in formalizing this

Invariance Over Equivalent Queries

Looking for employees with a salary of 50$

where Emps.{Id:x}.salary:$50 D, collect {Id:x}.salary:$50

where Emps.{Id:x}.salary:y D, y = $50

collect {Id:x}.salary:y

!

where Emps.{Id:x}.salary:$50 D, collect {Id:x}.salary:$50

y = $50K

What is the where- Provenance of 50$?

Multiple Pieces of Data

where Emps.{Id:x}.salary:y D, Emps.{Id:x}.salary:z D, Emps.{Id:x}.bonus:z D

collect {Id:x}.new salary:y

where Emps.{Id:x}.salary:y D, Emps.{Id:x}.bonus:y D

collect {Id:x}.new_salary:y

New_salary is tracked

by y

New_salary is tracked

by y and z?

Nested Queries

where R.x.y : z D, S.x.y : z D collect x.y : z

where R.x.y : z D, S.t.u D, t:u collect {x.y : z, t : u}

where R.x.y : z  D ,

collect x.y : z  

{R.1.2:3,S.1.2:3}

D Output: 1.2:3

Where provenance: {R.1:2,S.1:2}

Where provenance: {R.1:2,S.1:2}

t:u

{1.2:3,1.2:3}

=>u = y:z

where R.x.y : z  D ,

collect x.y : z  

Traceable Queries

A restricted class of queries, for which where-provenance is preserved under rewriting.

Example - {name:"G.F Handel“}.born:1685

{name:”G.F Handel}

Born period

1685 “baroque”

Composers

{name:”G.F Handel}

Works

“art thou troubled?”

title

{Opus:”HMV19”}

{name:”G.F Handel}

Born

Composers

Witness basis

Where Provenance

Derivation Basis (Where Provenance)

The derivation basis for l:v finds a variable x in the output expression that will generate v.

1685

{name:”G.F Handel}

Born period

“baroque”

{name:”J.S Bach”}

{name:”W.A Mozart”}

Born period

1685 “baroque”

Born period

1756 “classical”

Composers

   . . : ,

           1700

 { : }:

where composers x born u D

u

collect year u C

{year:1685}:C x2

Where(l:v,Q,D)

Computes the derivation basis of l:v.

The “collect" clause of the new query returns two things:

the patterns

the paths

pointing to x in the “where" clause of Q

, 0( : , , ) ( : ) {([[ ]] .. [[ ]] , )}Q D nWhere l v Q D l v p p S

Derivation Basis

100

{Id:3}

bonus salary

2000

{Id:1}

{Id:2}

bonus salary

300 1900

bonus salary

17 1700

Emps

. : . :   ,

. : . :  

: . _ :

where Emps Id x salary y D

Emps Id x bonus y D

collect Id x new salary y

{Id:1}.new_salary:2100 {Id:2}.new_salary:1717 {Id:3}.new_salary:2200

1p

2p

1( ) . :1 . : 2000 p Emps Id salary D

, 1 2( : ) ( ( ) ( ), . :1 .{ , })}Q D l v p p Emps Id salary bonus

Derivation Basis - , ( : )Q D l v

Q=Q1 U Q2

Q2(D) Q1(D)

U 1 , ( : )Q D l v2 , ( : )Q D l v , ( : )Q D l v

v is an atomic value

Q Is Traceable If:

1) each pi in the query matches either against some database constant or against a sub-query

2) every sub-query is a view which does not share

any variables with the outer scope

3) only a singular pattern is allowed to match

against a sub-query

4) the pattern and output expression of the sub-

query consist of a sequence of distinct variables and have the same length.

Propositions

Proposition 1:

Proposition 2:

for any l:v in the output of Q(D)

Q - traceable Q’ - traceable Q ~> Q’

Q - traceable Q ~> Q’ , ',( : ) ( : )Q D Q Dl v l v

Outline:

Introduction to data provenance

A deterministic model

Why provenance

Where provenance

Conclusion

Why Is The Tuple In Our View?

Name, Id, address Id, Telephone

:

Telephone Name

("John Doe",1234)

Id, Telephone

Valid

D1

D2

{Id:1}

a b c

a 3 “a”

{Id:3}

{Id:2}

num

109

“Where.. Collect..”

Conclusions

o Describing and Understanding provenance of data

o Two perspectives: Why is a piece of data in the output? Where did a piece of data come from?

o A system of rewrite rules where

why-provenance is preserved over the class of well-defined queries and where-provenance is preserved over the class of traceable queries.

!תודה על ההקשבה

top related