Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance.

Paolo Missier, Norman Paton, Khalid BelhajjameInformation Management Group

School of Computer Science, University of Manchester, UK

EDBT ConferenceLausanne, Switzerland, March 2010

Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

• Setting:– Black box provenance of workflow data products

• Fine-grained provenance:– tracking provenance through collections: motivation– functional model of collection-oriented workflow

processing

• Efficient query processing:– leveraging the functional model to achieve efficient

processing for a simple query model

• Experimental evaluation

Problem statement and outline


Computing provenance through a graph• Provenance graph is an unfolding of the workflow graph

structure– large: grows with size of input– lineage queries involve graph traversal

3From: Z. Bao, S. Cohen-Boulakia, S. Davidson, A. Eyal, and S. Khanna, "Differencing Provenance in Scientific Workflows," Procs. ICDE, 2009.


Main result

• Query the provenance of individual collections elements• But, avoid computing transitive closures on the provenance graph

• potentially very large• Traverse the workflow graph instead -- much smaller• This results in substantial performance improvement for typical queries

y11

a1 b1

ymn

bman

wv1 vn

... ...

...

...

Provenance graphWorkflow graph

X1 X2

Y

P

X3

y = [ [y11 ... y1n], ... [ym1 ... ymn] ]

X

Y

R

X

Y

Q


Main result



y11

a1 b1

ymn

bman

wv1 vn

... ...

...

...


X1 X2

Y

P

X3

y = [ [y11 ... y1n], ... [ym1 ... ymn] ]

X

Y

R

X

Y

Q

[1][n]

[]

[][1]


Main result



y11

a1 b1

ymn

bman

wv1 vn

... ...

...

...


X1 X2

Y

P

X3

y = [ [y11 ... y1n], ... [ym1 ... ymn] ]

X

Y

R

X

Y

Q

[1][n]

[]

[][1]


Workflow as data integrator



QTLgenomicregions

genesin QTL

metabolicpathways(KEGG)



QTLgenomicregions

genesin QTL

metabolicpathways(KEGG)


Motivation for fine-grained provenance List-structured KEGG gene ids:

[ [ mmu:26416 ], [ mmu:328788 ] ]

[ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ]

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]



[ [ mmu:26416 ], [ mmu:328788 ] ]


geneIDs pathways

••

••

••

•

••

•




[ [ mmu:26416 ], [ mmu:328788 ] ]


geneIDs pathways

••

••

••

•

••

•




[ [ mmu:26416 ], [ mmu:328788 ] ]


geneIDs pathways

••

••

••

•

••

•




[ [ mmu:26416 ], [ mmu:328788 ] ]


geneIDs pathways

••

••

••

•

••

•



• Setting:– Black box provenance of workflow data products

• Fine-grained provenance:– tracking provenance through collection elements– motivation➡ functional model of collection-oriented workflow

processing

Problem statement and outline


Functional model for collection processing /1Simple processing: service expects atomic values,receives atomic values

v1 v2 v3

w1 w2

X1 X2

Y1

P

X3

Y2

①



v1 v2 v3

w1 w2

X1 X2

Y1

P

X3

Y2

① ②

Simple iteration:service expects atomic values,receives input list

X

Y

P1

X

Y

Pn

v1

w1 wn

vn

v = [v1 ... vn]

w = [w1 ... wn]

v = [v1 ... vn]

X

Y

P ➠

w = [w1 ... wn]

...



v1 v2 v3

w1 w2

X1 X2

Y1

P

X3

Y2

① ②


X

Y

P1

X

Y

Pn

v1

w1 wn

vn

v = [v1 ... vn]

w = [w1 ... wn]

v = [v1 ... vn]

X

Y

P ➠

w = [w1 ... wn]

...

lineage(wi) = vi



v1 v2 v3

w1 w2

X1 X2

Y1

P

X3

Y2

① ②


X

Y

P1

X

Y

Pn

v1

w1 wn

vn

v = [v1 ... vn]

w = [w1 ... wn]

v = [v1 ... vn]

X

Y

P ➠

w = [w1 ... wn]

...dd = 0

ad = 1

δ = 1

lineage(wi) = vi



v1 v2 v3

w1 w2

X1 X2

Y1

P

X3

Y2

①

➠

v = [[...], ...[...]]

w = [[..] ...[...]]

X

Y

Pdd = 0

ad =2

δ = 2

lineage(wii) = vij

③Extension:service expects atomic values,receives input nested list

②


X

Y

P1

X

Y

Pn

v1

w1 wn

vn

v = [v1 ... vn]

w = [w1 ... wn]

v = [v1 ... vn]

X

Y

P ➠

w = [w1 ... wn]

...dd = 0

ad = 1

δ = 1

lineage(wi) = vi


Functional model /2

v = [[...], ...[...]]

w = [[..] ...[...]] - depth = n-m

X

Y

Pdd = m

ad =n

δ = n-m ≥ 0

The simple iteration modelgeneralises by induction to a generic δ=n-m


Functional model /2

v = [[...], ...[...]]

w = [[..] ...[...]] - depth = n-m

X

Y

Pdd = m

ad =n

δ = n-m ≥ 0

The simple iteration modelgeneralises by induction to a generic δ=n-m

v = !a1 . . . an"

(evall P v) =

!(P v) if l = 0(map (evall!1 P ) v) if l > 0

This leads to a recursive functional formulation for simple collection processing:


Functional model - multiple inputs /3

X1 X2

Y

P

X3v1 = [v11 ... vin] v3 = [v31 ... v3m]

v2 = [v21 ... v2k]

w = [ [w11 ... w1n], ... [wm1 ...wmn] ]

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1



X1 X2

Y

P

X3v1 = [v11 ... vin] v3 = [v31 ... v3m]

v2 = [v21 ... v2k]

w = [ [w11 ... w1n], ... [wm1 ...wmn] ]

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

Cross-product involving v1 and v2 (but not v3):

v1 ⊗ v3 = [ [ <v1i, v3j> | j:1..m ] | i:1..n ] // cross product

and including v2: [ [ <v1i, v2, v3j> | j:1..m ] | i:1..n ]



X1 X2

Y

P

X3v1 = [v11 ... vin] v3 = [v31 ... v3m]

v2 = [v21 ... v2k]

w = [ [w11 ... w1n], ... [wm1 ...wmn] ]

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

Cross-product involving v1 and v2 (but not v3):

v1 ⊗ v3 = [ [ <v1i, v3j> | j:1..m ] | i:1..n ] // cross product

and including v2: [ [ <v1i, v2, v3j> | j:1..m ] | i:1..n ]

lineage(wii) = < v1i, v2, v3j>

a! b = [["ai, bj#]|bj $ b]|ai $ a]

(eval2 P !a, b") = (map (eval1 P ) a # b)


Generalised cross product

Binary product, δ = 1:

a! b = [["ai, bj#]|bj $ b]|ai $ a]




!i:1...n(vi, di)

(v, d1)! (w, d2) =

!"""#

"""$

[[(vi, wj)|wj " w]|vi " v] if d1 > 0, d2 > 0[(vi, w)|vi " v] if d1 > 0, d2 = 0[(v, wj)|wj " w] if d1 = 0, d2 > 0(v, w) if d1 = 0, d2 = 0

Generalized to arbitrary depths:

...and to n operands:


a! b = [["ai, bj#]|bj $ b]|ai $ a]




!i:1...n(vi, di)

(v, d1)! (w, d2) =

!"""#

"""$

[[(vi, wj)|wj " w]|vi " v] if d1 > 0, d2 > 0[(vi, w)|vi " v] if d1 > 0, d2 = 0[(v, wj)|wj " w] if d1 = 0, d2 > 0(v, w) if d1 = 0, d2 = 0

Generalized to arbitrary depths:

...and to n operands:


Finally: general functional semantics for collection-based processing

(evall P !(v1, d1), . . . , (vn, dn)")

=

!(P !v1, . . . , vn") if l = 0(map (evall!1 P ) #i:1...n !vi, di") if l > 0


Static mapping of output to input values

X1 X2

Y

P

X3

this leads to a simple mapping rule:

index of an output list value ➔ {index of input values}

Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] =

The iteration structure can be determined statically

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1



X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)



Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] =


dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1



X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)



Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] = δ1


dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1



X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)



Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] =

X1


dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1



X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)



Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] = δ2

X1


dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1



X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)



Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] =

X1 X2


dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1



X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)



Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] = δk

X1 X2


dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1



X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)



Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] =

X1 X2 Xk


dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1


Workflow graph as index for query processing

1. traverse the workflow graph (small) rather than the provenance trace (large)2. use static prediction of iterations to trace through collection elements3. “parachute” into the actual trace only at the end

y11

a1 b1

ymn

bman

wv1 vn

... ...

...

...


X1 X2

Y

P

X3

y = [ [y11 ... y1n], ... [ym1 ... ymn] ]

X

Y

R

X

Y

Q




y11

a1 b1

ymn

bman

wv1 vn

... ...

...

...


X1 X2

Y

P

X3

y = [ [y11 ... y1n], ... [ym1 ... ymn] ]

X

Y

R

X

Y

Q

[1][n]

[]

[][1]




y11

a1 b1

ymn

bman

wv1 vn

... ...

...

...


X1 X2

Y

P

X3

y = [ [y11 ... y1n], ... [ym1 ... ymn] ]

X

Y

R

X

Y

Q

[1][n]

[]

[][1]


From granularity to efficient query processing

Summary so far:

• whenever iterations are involved, we can trace the provenance of individual elements of a processor’s output

• iterations are explained in terms of a functional model and based on list depth discrepancies

• The relationships between output and input indexes are derived using the workflow specification graph (statically)

How about expressivity and efficient processing of lineage queries?


Lineage query model /1

I - Focusing:Not all processors are interesting:–report lineage only at

specified nodes in the graph


Lineage query model /2List-structured KEGG gene ids:

[ [ mmu:26416 ], [ mmu:328788 ] ]



II - Granularity:Trace lineage for individual elements within collections- when possible!


Lineage query model and language

<pquery> <scope workflow="keggPathways"> <run id="ae1e2b6b-3bc5-4c93-a250-c4dd0210c3b3"/> </scope> <select> <outputPort name="paths_per_gene" index="[1,2]"/> <outputPort name="paths_per_gene" index="[3,4]"/> <outputPort name="commonPathways" index="[1]"/> <processor name="getPathwayDescriptions"> <outputPort name="return"/> </processor> </select> <focus> <processor name="get_pathway_by_genes" /> </focus></pquery>

optionally specifies one or more runs for the target workflow

port values for which lineage is sought:global outputs or processor-qualified

processors where lineage is to be reported- possibly workflow-qualified

workflow scopedefaults to latest run


Fine granularity + efficient processing• Scalability:

– query time depends on size of workflow graph, not size of provenance graph

– workflow graphs are small, fit in memory, can be indexed easily

• Graceful degradation:– worst case is a completely unfocused query– no worse than other approaches

• Fine-grain answers provided at the same time


• Assumption:– Black box provenance of workflow data products

✓Fine-grained provenance:– tracking provenance through collection elements– motivation, functional model of collection-oriented

workflow processing

✓Efficient query processing:✓leveraging the functional model to achieve efficient

processing for a simple query model

➡Experimental evaluation

Outline


Experimental setup - I

20

• Performance evaluation performed on programmatically generated dataflows

– the “T-towers”

parameters:- size of the lists involved- length of the paths- includes one cross product


Experimental results - II

10 28 50 75 100 150 2000

250

500

750

1000

1250

1500

1750

2000

2250

2500

2750

3000

3250

3500

124256

440

700

1000

2024

3257

workflow pre-processing time by graph size

path length l

tim

e (

ms)

1.33% 6.67% 13.33% 20.00% 26.67% 33.33% 40.00% 46.67%

0

10

20

30

40

50

60

70

80

90

100

110

120

130

response times for PP on unfocused queries (l=150)

% of processors in target set

tim

e (

ms)

performance degradation on fully unfocused queries

10 28 50 75 100 150

0

25

50

75

100

125

150

175

200

225

d=10

NI

NGQ

PP

path length l

tim

e (

ms)

10 25 50 75 100 150

0

25

50

75

100

125

150

175

200

225

d=150

NR

NGQ

PP

path length l

tim

e (

ms)

10 28 50 75 100 150

0

25

50

75

100

125

150

175

200

225

d=10

NI

NGQ

PP

path length l

tim

e (

ms)

10 25 50 75 100 150

0

25

50

75

100

125

150

175

200

225

d=150

NR

NGQ

PP

path length l

tim

e (

ms)

Naive traversal of provenance graph

single multi-join query workflow as index(our approach)

10 elements in input list

150 elements in input list


Summary• A simple lineage query model for Taverna

–grounded in the semantics of collection-oriented processing

–combines fine-grain answers with efficient query processing

• Ongoing work:– space compression, indexing– QLP?– semantic provenance (initial paper submitted)

• Currently part of the Taverna 2.1 release

Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance.

Technology

substantial

mmu04010 mapk

mmu04370 vegf

simple mapping

x1 x2 x3

black box

product involving

cross product