Top Banner
Language Design and Data Provenance 6/3/2019 1 GeCo Workshop, Como Val Tannen University of Pennsylvania
32

Language Design and Data Provenance

Apr 18, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Language Design and Data Provenance

LanguageDesignandDataProvenance

6/3/2019 1GeCoWorkshop,Como

ValTannenUniversityofPennsylvania

Page 2: Language Design and Data Provenance

6/3/2019 2GeCoWorkshop,Como

Collaborators

TofTawardTJGreenRelationalAIGrigorisKarvounarakisRelationalAI

GofPODSpaperTJ

ORCHESTRAZackIvesUniversityofPennsylvaniaTJ,Grigoris

OthercorepapersNateFosterCornellUniversityYaelAmsterdamerBar-IlanUniversityDanielDeutchTelAvivUniversityTovaMiloTelAvivUniversitySudeepaRoyDukeUniversityYuvalMoskovitchTelAvivUniversity

RecentworkErichGrädelRWTHAachen

MuchgratitudePeterBunemanUniversityofEdinburgh

Page 3: Language Design and Data Provenance

Provenance?

•  Provenanceisabout

–  trust:propagateitfrominputstooutputs

–  diagnostics:faultyoutputscomefromwhere?

–  (repairs):fixinputstofixoutputs(reverseprovenanceanalysis).

6/3/2019 GeCoWorkshop,Como 3

Page 4: Language Design and Data Provenance

(Binary)TrustwithCatVictims

6/3/2019 GeCoWorkshop,Como 4

mouse gray

mouse red

rat gray

*SueandValarenotedzoologists.**Zackisanotedcomputationalzoologist

cat mouse

cat rat

Sue’s notes *

Val’s notes *

cat gray

cat red

Zack ** computation

Yes

No

Yes

Yes

Yes Yes

No

No

No

Yes prey color

Page 5: Language Design and Data Provenance

ConfidenceScores(non-binarytrust)

6/3/2019 GeCoWorkshop,Como 5

mouse gray

mouse red

rat gray

cat mouse

cat rat

Sue’s notes

Val’s notes

cat gray

cat red

Zack computation

0.6

0.1

0.8

0.9

0.9 0.72

0.09

0.72 = max(0.9× 0.8, 0.9 × 0.6) 0.09 = 0.9 × 0.1

Page 6: Language Design and Data Provenance

ASimpleModelforDataPricing

6/3/2019 GeCoWorkshop,Como 6

mouse gray

mouse red

rat gray

cat mouse

cat rat

Sue’s notes

Val’s notes

cat gray

cat red

Zack computation

$6

$1

$8

$10

$10 $16

$11

16 = min(10 +8, 10 + 6) 11 = 10 + 1

Page 7: Language Design and Data Provenance

Computation?ExpressedinaQueryLanguage

6/3/2019 GeCoWorkshop,Como 7

mouse gray

mouse red

rat gray

cat mouse

cat rat

Sue’s notes

Val’s notes

cat gray

cat red

Zack computation

Zack(x,z) :- Sue(x,y) , Val(y,z)

Zack = PROJECT (JOIN (Sue, Val))

Zack = { (u.#pred, v.#color) | u 2 Sue , v 2 Val , u.#prey=v.#animal }

Page 8: Language Design and Data Provenance

6/3/2019 8GeCoWorkshop,Como

Doitonceanduseitrepeatedly:provenance

Label(annotate)inputitemsabstractlywithprovenancetokens.Provenancetracking:propagateexpressions(involvingtokens)

(toannotateintermediatedataand,finally,outputs)

Basedonquerylanguagedesign,tracktwodistinctwaysofusingdataitemsbycomputationprimitives:

•  jointly(thisaloneisbasicallylikekeepingalog)

•  alternatively(doingbothisessential;thinktrust)

Input-outputcompositional;Modular(intheprimitives)

Later,wewanttoevaluatetheprovenanceexpressionstoobtain binarytrust,confidencescores,dataprices,etc.

Page 9: Language Design and Data Provenance

AlgebraicinterpretationforRDB

SetX ofprovenancetokens.Spaceofannotations,provenanceexpressionsProv(X)

Prov(X)-relations:everytupleisannotatedwithsomeelementfromProv(X).

BinaryoperationsonProv(X):

· correspondstojointuse(join,cartesianproduct), +correspondstoalternativeuse(unionandprojection).

Specialannotations:

‘‘Absent’’tuplesareannotatedwith0. 1 isa‘‘neutral’’annotation(datawedonottrack).

6/3/2019 GeCoWorkshop,Como 9

Page 10: Language Design and Data Provenance

K-Relationalalgebra

Algebraiclawsof(Prov(X), +, ·, 0,1)?Moregenerally,forannotations

fromastructure(K, +, ·, 0,1)?

K-relations.GeneralizeRA+to(positive)K-relationalalgebra.

DesiredoptimizationequivalencesofK- relationalalgebraiff

(K, +, ·, 0,1) isacommutativesemiring.

GeneralizesSPJUorUCQornon-rec.Datalog

setsemantics(B,Ç,Æ,?,>)bagsemantics(N,+,·,0,1)

c-table-semantics[IL84](BoolExp(X), Ç,Æ,?,>) eventtablesemantics[FR97,Z97](P(Ω),[,Å,;,Ω)

6/3/2019 GeCoWorkshop,Como 10

Page 11: Language Design and Data Provenance

Whatisacommutativesemiring?

Analgebraicstructure(K,+,·,0,1)where:•  Kisthedomain

•  +isassociative,commutative,with0identity

•  ·isassociative,with1identitysemiring•  ·distributesover+•  a·0=0·a=0

•  ·isalsocommutative

Unlikering,norequirementforinversesto+

116/3/2019 GeCoWorkshop,Como

Page 12: Language Design and Data Provenance

Provenance:abstractsemiringannotation

6/3/2019 GeCoWorkshop,Como 12

mouse gray

mouse red

rat gray

cat mouse

cat rat

Sue’s notes

Val’s notes

cat gray

cat red

Zack Zack(x,z):-

Sue(x,y),Val(y,z)

r s t

p q

p·r+q·t p·s

KeepX={p,q,r,s,t } abstract.Diagnosticforwronganswers;Deletionpropagation.E.g.,r=s=0

Provenancepolynomials(N[X],+,·,0,1)semiring

Page 13: Language Design and Data Provenance

Provenancepropagationthroughlanguageoperations

6/3/2019 GeCoWorkshop,Como 13

mouse gray

mouse red

rat gray

cat mouse

cat rat

Sue Val

cat gray

cat red

PROJECT

r s t

p q

p·r+q·t p·s

cat mouse gray

cat mouse red

cat rat gray

p·r p·s q·t

JOIN

Page 14: Language Design and Data Provenance

Provenancepolynomials

6/3/2019 GeCoWorkshop,Como 14

(N[X],+,·,0,1)isthecommutativesemiringfreelygeneratedbyX(universalitypropertyinvolvinghomomorphisms)

ProvenancepolynomialsarePTIME-computable(datacomplexity).(querycomplexitydependsonlanguageandrepresentation)

ORCHESTRAprovenance(graphrepresentation)about30%overhead

Monomialscorrespondtologicalderivations(prooftreesinnon-rec.Datalog)

Provenancereadingofpolynomails:

outputtuplehasprovenance2r2 + rs threederivationsofthetuple-twoofthemuser, twice,-thethirduses r and s, onceeach

Page 15: Language Design and Data Provenance

Specializeprovenanceforconfidencescores

6/3/2019 GeCoWorkshop,Como 15

mouse gray

mouse red

rat gray

cat mouse

cat rat

Sue’s notes

Val’s notes cat gray

cat red

Zack Zack(x,z):-

Sue(x,y),Val(y,z)

r s t

p q

pr+qt ps

V =([0,1], max,·,0,1)theViterbisemiring

f: X![0,1] f(p)=f(q)=0.9 f(r)=0.6 f(s)=0.1 f(t)= 0.8

eval(f): N[X]!V eval(f)(pr+qt)=0.72 eval(f)(ps)= 0.09

0.6

0.1

0.8

0.72

0.09

0.9

0.9

Page 16: Language Design and Data Provenance

Someapplicationsemirings

6/3/2019 GeCoWorkshop,Como 16

(B,Æ,Ç,>,?)binarytrust

(N,+,·,0,1)multiplicity(numberofderivations)

(A,min,max,0,Pub)accesscontrol

V =([0,1], max,·,0,1)Viterbisemiring(MPE)confidencescores

T =([0,1],min,+,1,0)tropicalsemiring(shortestpaths)datapricing

F =([0,1], max,min,0,1)“fuzzylogic”semiring

Page 17: Language Design and Data Provenance

Twokindsofsemiringsinthisframework

6/3/2019 GeCoWorkshop,Como 17

Provenancesemirings,e.g.,

(N[X],+,·,0,1)provenancepolynomials[GKT07]

(Why(X),[,d,;,{;})witnesswhy-provenance[BKT01]

Applicationsemirings,e.g.,

(A,min,max,0,Pub)accesscontrol[FGT08]

V =([0,1], max,·,0,1)Viterbisemiring(MPE)[GKIT07]

Provenancespecializationrelieson

-Provenancesemiringsarefreelygeneratedbyprovenancetokens- Querycommutationwithsemiringhomomorphisms

Page 18: Language Design and Data Provenance

Querycommutationwithhomomorphisms

queryinQL homomorphismh : K1 ! K2

6/3/2019 GeCoWorkshop,Como 18

K1-Rel

K1-Rel

query query

h

h K2-Rel

K2-Rel

QL =RA+,Datalog[GKT07]andextensions[FGT08,GP10,ADT11a,T13,DMT15,GUKFC16,T17]

Page 19: Language Design and Data Provenance

K-NestedRelationalCalculus

K-sets.Everyelementofthesetisannotatedwithsomek 2 K.where (K, +, ·, 0,1) isacommutativesemiring.

Mapf onS{ f(x) | x 2 S }

Ifxisannotatedbykthentheannotationoff(x)ismultipliedbyk.

K-setsalsoformacommutativesemiring.Thisgivesannotationsfor

“FlatMap”g onS[ { g(x) | x 2 S }

6/3/2019 GeCoWorkshop,Como 19

Page 20: Language Design and Data Provenance

AHierarchyofProvenanceSemirings[G09,DMRT14]

N[X]

B[X] Trio(X)

Why(X)

Which(X)PosBool(X)

mostinformative

leastinformative

Example:2x2y+xy+5y2+xz

+="

206/3/2019 GeCoWorkshop,Como

Sorp(X)

surjectivesemiringhomomorphism,identityonX

absorption

absorption(ab+a=a)

"idemp.+idemp.

x2y+xy+y2+xz 3xy+5y+xz

y+xz

xy+y2+xz

xyz

"idemp.

xy+y+xz

"idemp. +idemp.

A

T,V

N

B

Page 21: Language Design and Data Provenance

Amenagerieofprovenancesemirings

6/3/2019 GeCoWorkshop,Como 21

(Which(X),[,[*, ;,;*)setsofcontributingtuples“Lineage”(1)[CWW00]

(Why(X),[,d,;,{;})setsofsetsof…Witnesswhy-provenance[BKT01]

(PosBool(X),Æ,Ç,>,?)minimalsetsofsetsof…Minimalwitnesswhy-provenance[BKT01]also“Lineage”(2)usedinprobabilisticdbs[SORK11]

(Trio(X),+,·,0,1)bagsofsetsof…“Lineage”(3)[BDHT08,G09]

(B[X],+,·,0,1)setsofbagsof…Booleancoeff.polynomials[G09]

(Sorp(X),+, ·,0,1)minimalsetsofbagsof…absorptivepolynomials[DMRT14]

(N[X],+,·,0,1)bagsofbagsof…universalprovenancepolynomials[GKT07]

Page 22: Language Design and Data Provenance

Furtheraspectsoftheframework

6/3/2019 GeCoWorkshop,Como 22

Extensiontotreedata(NestedRelationalCalculus,structuralrecursionontrees,unorderedXQuery)[FGT08]

StudyofCQ/UCQonprovenance-annotatedrelations[G09]

Extensiontoaggregates(poly-sizeoverhead)[ADT11a]

Poly-sizeprovenanceforDatalog(circuits;PosBool(X),Sorp(X)…)[DMRT14]

Extensiontodata-dependentfinitestateprocesses[DMT15]

Connectionstosemiringmonad[FGT08,T13] tosemimodules[ADT11a] totensorproducts[ADT11a,DMT15]

Page 23: Language Design and Data Provenance

Provenanceforaggregation

9/2/16

a 20+10 ?

b 15+10+25 ?

a 20 x

a 10 y

b 15 q

b 10 r

b 25 s

Desiderata1.  Compatibilitywithset/bagsemantics

2.  Fundamentalproperty(commutationwithhomomorphisms)

3.  Poly-sizeoverhead!1+2+4+…+2n-1=>2nresults

DS-agg

DS

SUMSGROUP BY D

23SimonsInstitute

Page 24: Language Design and Data Provenance

Solutioninspiredby(semi)linearalgebra

9/2/16

a x 20 + y 10 ?

b q 15 + r 10 + s 25 ?

DS-agga 20 x

a 10 y

b 15 q

b 10 r

b 25 s

DS

24SimonsInstitute

(R,+,0)isnotaProv(X)-semimodule,but…

(K-Rel,[,;)isaK-semimodulewiththesingletonsasbasis.

Relationsaretheresultof[-aggregation!Whatif(R,+,0)wereaProv(X)-semimodule?

Page 25: Language Design and Data Provenance

Tensorproductconstruction

9/2/16

a x ⊗20+y ⊗10 x + y

b q ⊗15+r ⊗10+s ⊗25 q + r + s

DS-agg

EmbedacommutativemonoidM(forsum,maxormin)intoaK-semimoduleK⊗M(newvalues!)

Consistency: embedding should be faithful.

25SimonsInstitute

Page 26: Language Design and Data Provenance

Negativeinformation;non-monotoneoperations(difference)

6/3/2019 GeCoWorkshop,Como 26

Booleanexpressions[IL84].Limited.

Addabinaryoperationcorrespondingtodifference m-semirings(commongen.ofsetandbagdifference)[GP10] spm-semirings(OPTIONALinSPARQL)[GUKFC16]

Encodedifferencebyaggregation[ADT11a]

Differentequationaltheories,differentalgebraicoptimizations[ADT11b]

Stillnotclearhowtotracknegativeinformation.useful:non-answers(whynot?),insertionpropagation.

Logicalmodelchecking(“provenanceof…truth?”) negationasduality(NNFs),logicalgames ongoingworkwithGrädel[T16,T17]

Page 27: Language Design and Data Provenance

Currenttargets

6/3/2019 GeCoWorkshop,Como 27

ANALYTICSCOMPUTATIONS

“Fine-grainedprovenanceforlinearalgebraoperators”Yan,T.,IvesTaPP16

DISTRIBUTEDSYSTEMS/NETWORKPROVENANCE

“Time-awareprovenancefordistributedsystems”,Zhou,Ding,Haeberlen,Ives,LooTaPP11

“Diagnosingmissingeventsindistributedsystemswithnegativeprovenance”,Wu,Zhao,Haeberlen,Zhou,LooSIGCOMM14

STATICANALYSISOFSOFTWARE

“OnabstractionrefinementforprogramanalysesinDatalog”Zhang,Mangal,Grigore,NaikPLDI14

Page 28: Language Design and Data Provenance

Frameworkreferences(I)

6/3/2019 GeCoWorkshop,Como 28

[GKT07]“Provenancesemirings”Green,Karvounarakis,TannenPODS07.

[GKIT07]“Updateexchangewithmappingsandprovenance”Green,Karvounarakis,Ives,TannenVLDB07.

[FGT08]“AnnotatedXML:queriesandprovenance”Foster,Green,TannenPODS08.

[G09]“Containmentofconjunctivequeriesonannotatedrelations”GreenICDT09.

[GP10]“OndatabasequerylanguagesforK-relations”,Geerts,PoggiJAppl.Logic2010.

Page 29: Language Design and Data Provenance

Frameworkreferences(II)

6/3/2019 GeCoWorkshop,Como 29

[ADT11a]“Provenanceforaggregatequeries”,Amsterdamer,Deutch,TannenPODS11.

[ADT11b]“Onthelimitationsofprovenanceforquerieswithdifference”,Amsterdamer,Deutch,TannenTaPP11

[T13]“Provenancepropagationincomplexqueries”TannenBunemanFestschrift2013

[DMRT14]“CircuitsforDatalogprovenance”,Deutch,Milo,Roy,T.ICDT14.

[DMT15]“Provenance-basedanalysisofdata-centricprocesses”Deutch,Moskovitch,TannenVLDBJ.2015

Page 30: Language Design and Data Provenance

Frameworkreferences(III)

6/3/2019 GeCoWorkshop,Como 30

[GUKFC16]“AlgebraicstructuresforcapturingtheprovenanceofSPARQLqueries”Geerts,Unger,Karvounarakis,Fundulaki,ChristophidesJACM2016

[T16]“Abouttheprovenanceoftruth”TannenSimonsInst.Website16https://simons.berkeley.edu/talks/val-tannen-2016-12-09

[T17]“ProvenanceanalysisforFOLmodelchecking”TannenSIGLOGNews2017

[GT17a]“Thesemiringframeworkfordatabaseprovenance”,Green,TannenPODS2017.

[GT17b]“Semiringprovenanceforfirst-ordermodelchecking”,Grädel,TannenCoRRabs/1712.01980(2017)

Page 31: Language Design and Data Provenance

Otherreferences

6/3/2019 GeCoWorkshop,Como 31

[IL84]“Incompleteinformationinrelationaldatabases”Imieliński,LipskiJACM1984

[FR97]“Aprobabilisticrelationalalgebra”Fuhr,RölleckeTOIS1997

[Z97]“Queryevaluationinprobabilisticrelationaldatabases”ZimányiDDS1997

[CWW00]“Tracingthelineageofviewdatainawarehousingenvironment”Cui,Widom,WienerTODS2000

[BKT01]“Whyandwhere:acharacterizationofdataprovenance”Buneman,Khanna,TanICDT2001

[BDHTW08]“Databaseswithuncertaintyandlineage”Benjelloun,DasSarma,Halevy,Theobald,WidomVLDBJ.2008

[SORK11]“Probabilisticdatabases”Suciu,Olteanu,Ré,KochSLDM2011

[SuciuOlteanuRéKoch11]

Page 32: Language Design and Data Provenance

6/3/2019 GeCoWorkshop,Como 32

Thankyou!