Top Banner
Granular workflow provenance in Taverna 1 Paolo Missier Information Management Group School of Computer Science, University of Manchester, UK Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008
76

Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

May 11, 2015

Download

Technology

Paolo Missier
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Granular workflow provenance in Taverna

1

Paolo MissierInformation Management Group

School of Computer Science, University of Manchester, UK

Symposium on Provenance in Scientific WorkflowsSalt Lake City, Oct. 2008

Page 2: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Outline

2

• Collection values in [bioinformatics] workflows are important• Granular provenance over collections: model and issues• Measuring “provenance friendliness” of dataflows• Increasing friendliness of existing dataflows• Extending the Open Provenance Model graph to describe

granular data derivations

• Provenance service architecture - brief description

Page 3: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

IPAW'08 – Salt Lake City, Utah, June 2008

Example (Taverna) dataflow

QTL -> genes -> Kegg pathways

Page 4: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

IPAW'08 – Salt Lake City, Utah, June 2008

Example (Taverna) dataflow

Page 5: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections example: from genes to SNPs

4

• See myexperiment.org: http://www.myexperiment.org/workflows/166

Page 6: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections example: from genes to SNPs

4

gene -> genomic region

• See myexperiment.org: http://www.myexperiment.org/workflows/166

Page 7: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections example: from genes to SNPs

4

gene -> genomic region

extend region

• See myexperiment.org: http://www.myexperiment.org/workflows/166

Page 8: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections example: from genes to SNPs

4

gene -> genomic region

extend region

retrieve SNPs in the region

• See myexperiment.org: http://www.myexperiment.org/workflows/166

Page 9: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections example: from genes to SNPs

4

gene -> genomic region

extend region

retrieve SNPs in the region

rearrange SNP details

• See myexperiment.org: http://www.myexperiment.org/workflows/166

Page 10: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections example: from genes to SNPs

4

gene -> genomic region

extend region

retrieve SNPs in the region

rearrange SNP details

• See myexperiment.org: http://www.myexperiment.org/workflows/166

[ ENSG00000139618 , ENSG00000083093 ]

[[<1,23554512,16,rs45585833>, <1,23554712,16,rs45594034>,...],[<1,31820153,13,ENSSNP10730823>, <1,31818497,13,ENSSNP10730820>,...] ]

Page 11: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Computational model for collections

5

Depth mismatch between declared / offered type:

type(P4:X1) = s but type(a) = list(s)

type(P4:X2) = type(c) = list(s)

type(P4:X3) = s but type(c) = list(s)

Execution at P4:

Y = (map P1 <(a ⊗ b) , c>) // cross product

Y = [ (P1 <a1,b1,c>) ... (P1 <an,bm,c>) ]

Page 12: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures

Page 13: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

Page 14: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

Page 15: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

Page 16: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]

Page 17: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]Dot product

Page 18: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

<16, 23560179,..> [16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

Dot product

Page 19: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809,...>

[23520984, 31786617][16,13]

<16, 23560179,..> [16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

Page 20: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809,...>

[23520984, 31786617][16,13]

<16, 23560179,..> [16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

Page 21: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809,...>

[23520984, 31786617][16,13]

<16, 23560179,..> [16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

Page 22: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809,...>

[23520984, 31786617][16,13]

<16, 23560179,..> [16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

Page 23: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809,...>

[23520984, 31786617][16,13]

<16, 23560179,..> [16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

Page 24: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Tracing granular lineage

7

• Provenance traces are most useful when they are granular– trace individual items in a collection– “which geneID is responsible for the presence of SNP

rs169546 in the output?”

• Curse of black box processors:– M-M (many-many) and M-1 (many-one) processors

destroy granularity

Page 25: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Granular lineage I: no loss of precision

8

X1 X2

Y2:l(s)Y1:l(s)

P0

P1 ≡ λ X . X2

P2 ≡ λ X . 2XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2 = [b1...bm]

Then, P1:Y = [a12...an2], P2:Y=[2b1...2bm]P3:Y = [a12+2b1... an2+2bm]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X:s

P2

Y:s

Andlineage(P3:Y[i], {P0}) = { P0:Y1[i], P0:Y2[j] }

[a1...ai...an] [b1...bi...bm]

[a12+2b1... ai2+2bi ... an2+2bm]

[2b1... 2bj ...2bm][a12... ai2 ...an2]

Cross product

Page 26: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Granular lineage I: no loss of precision

8

X1 X2

Y2:l(s)Y1:l(s)

P0

P1 ≡ λ X . X2

P2 ≡ λ X . 2XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2 = [b1...bm]

Then, P1:Y = [a12...an2], P2:Y=[2b1...2bm]P3:Y = [a12+2b1... an2+2bm]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X:s

P2

Y:s

Andlineage(P3:Y[i], {P0}) = { P0:Y1[i], P0:Y2[j] }

[a1...ai...an] [b1...bi...bm]

[a12+2b1... ai2+2bi ... an2+2bm]

[2b1... 2bj ...2bm][a12... ai2 ...an2]

Cross product

Page 27: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Granular lineage II: loss of precision

9

X1 X2

Y2Y1

P0

P1 ≡ λ X . X2

P2 ≡ λ X . min XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]

Then, P1:Y = [a12...an2], P2:Y = c = min {b1...bm} P3:Y = [a12+c... am2+c]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X: l(s)

P2

Y:s

Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }

[a1...ai...an] [b1...bi...bm]

[a12+c... ai2+c ... am2+c]

c[a12... ai2 ...an2]

Page 28: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Granular lineage II: loss of precision

9

X1 X2

Y2Y1

P0

P1 ≡ λ X . X2

P2 ≡ λ X . min XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]

Then, P1:Y = [a12...an2], P2:Y = c = min {b1...bm} P3:Y = [a12+c... am2+c]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X: l(s)

P2

Y:s

Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }

[a1...ai...an] [b1...bi...bm]

[a12+c... ai2+c ... am2+c]

c[a12... ai2 ...an2]

Page 29: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

III: recoverable loss of precision

10

X1 X2

Y2Y1

P0

P1 ≡ λ X . X2

P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]

Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X: l(s)

P2

Y:l(s)

Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }

[a1...ai...an] [b1...bi...bm]

[a12+c1... ai2+ci ... am2+cm]

[a12... ai2 ...an2] [c1...ci...cm]

Page 30: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

III: recoverable loss of precision

10

X1 X2

Y2Y1

P0

P1 ≡ λ X . X2

P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]

Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X: l(s)

P2

Y:l(s)

Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }

[a1...ai...an] [b1...bi...bm]

[a12+c1... ai2+ci ... am2+cm]

[a12... ai2 ...an2] [c1...ci...cm]

“f is index-preserving”

Page 31: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

III: recoverable loss of precision

10

X1 X2

Y2Y1

P0

P1 ≡ λ X . X2

P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]

Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X: l(s)

P2

Y:l(s)

Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }

[a1...ai...an] [b1...bi...bm]

[a12+c1... ai2+ci ... am2+cm]

[a12... ai2 ...an2] [c1...ci...cm]

“f is index-preserving”

Page 32: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

III: recoverable loss of precision

10

X1 X2

Y2Y1

P0

P1 ≡ λ X . X2

P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]

Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X: l(s)

P2

Y:l(s)

Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }

[a1...ai...an] [b1...bi...bm]

[a12+c1... ai2+ci ... am2+cm]

[a12... ai2 ...an2] [c1...ci...cm]

“f is index-preserving”

lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2[i] }

Page 33: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Multi-level nesting and lineage precision

11

Page 34: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Adding annotations to the original workflow

12

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures

Page 35: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Adding annotations to the original workflow

12

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Page 36: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Adding annotations to the original workflow

12

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

[139618, 83093]

CR:result[0,i]

CR:result[1,j]

lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }

geneIdList:

Page 37: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Adding annotations to the original workflow

12

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

[139618, 83093]

“f is index-preserving”

“f is index-preserving”

CR:result[0,i]

CR:result[1,j]

lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }

geneIdList:

Page 38: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Adding annotations to the original workflow

12

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

[139618, 83093]

“f is index-preserving”

“f is index-preserving”

CR:result[0,i]

CR:result[1,j]

lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }

geneIdList:

Page 39: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Adding annotations to the original workflow

12

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

[139618, 83093]

“f is index-preserving”

“f is index-preserving”

lineage(CR:result[0,i]) = { geneIdList[0] }lineage(CR:result[1,j]) = { geneIdList[1] }

CR:result[0,i]

CR:result[1,j]

lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }

geneIdList:

Page 40: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Granular lineage: recap

13

• Lineage query model accounts for granular traces over nested collections

• arbitrary nesting levels:– values are trees in general– lineage query identifies the correct sub-trees

• Lineage queries are efficient– recursion problem “compiled away” by query rewriting – (shameless claim - details omitted)

• But:– One single M-* processor can destroy granularity– in some cases annotations are a remedy

Page 41: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Towards provenance-friendly workflows

14

Page 42: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Towards provenance-friendly workflows

1.Define metrics for workflow provenance precision– how well is granularity preserved over a lineage trace?– what is the impact of M-* processors?– use to prioritize remedial actions

14

Page 43: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Towards provenance-friendly workflows

1.Define metrics for workflow provenance precision– how well is granularity preserved over a lineage trace?– what is the impact of M-* processors?– use to prioritize remedial actions

2.Make workflows more provenance friendly:– Add knowledge (static):

• “lightweight annotations” [MBZ+08] -- see IPAW08– Add knowledge (dynamic):

–provenance-active workflow processors– Redesign processors / workflow

• general guidelines, provenance friendly patterns

14

[MBZ+08] Missier, Khalid Belhajjame, Jun Zhao, Carole Goble, Data lineage model for Taverna workflows with lightweight annotation requirements, Procs. International Provenance and Annotation Workshop (IPAW 2008)

Page 44: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

a = [a1, a2]

Page 45: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

lineage(P4:Y1[1.2.2], {P0, P2, P3}) =

a = [a1, a2]

Page 46: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

lineage(P4:Y1[1.2.2], {P0, P2, P3}) =

a = [a1, a2]

Page 47: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

lineage(P4:Y1[1.2.2], {P0, P2, P3}) =

a = [a1, a2]

Page 48: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

lineage(P4:Y1[1.2.2], {P0, P2, P3}) =

a = [a1, a2]

Page 49: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

lineage(P4:Y1[1.2.2], {P0, P2, P3}) =

a = [a1, a2]

{ P0:Y[1]= a1, P2:X=c, P3:X=e }

Page 50: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

lineage(P4:Y1[1.2.2], {P0, P2, P3}) =

a = [a1, a2]

precision = (1 + .5 + .5) / 3 = 2/3

{ P0:Y[1]= a1, P2:X=c, P3:X=e }

Page 51: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Precision relative to a sub-graph

16

• Refining the previous idea:– precision relative to a set O of output variables and a set I of input variables

• because not all variables are equally interesting... • weights WI, WO account for relative importance of variables

O1

I1 I2

O2 O3

Page 52: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

!

wi!WI

wi =!

wj!WO

wj = 1

prec(I, WI , O, WO) =!

j:1...|O|

"WO(Oj)

!

Xi(pi)!lin(Oj ,I)

WI(Xi) · len(pi)nl(Xi)

#

Precision relative to a sub-graph

16

• Refining the previous idea:– precision relative to a set O of output variables and a set I of input variables

• because not all variables are equally interesting... • weights WI, WO account for relative importance of variables

O1

I1 I2

O2 O3

Page 53: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

reach(P, v) =

!1 if v is reachable from P

0 otherwise

impact(P,O) =!

o!O

W (o) · reach(P, o)

Impact of M-* processors on precision

17

O1

I1 I2

O2 O3

Count the number of variables in O that can be reached from P

• weighted sumP

Page 54: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Improving provenance precision

18

• Impact used to prioritize user actions on processors

• Precision used to assess improvement

• add index-preserving annotations

✓illustrated earlier

• refactor M-* processors

• make processors provenance-active

Page 55: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]Dot product

Page 56: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]Dot product

s → s

Page 57: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]Dot product

139618

<16, 23520984>

s → s

Page 58: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]Dot product

139618 83093

<16, 23520984> <13, 31786617>

s → s

Page 59: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13]<16, 23560179> [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

Dot product

139618 83093

<16, 23520984> <13, 31786617>

s → s

Page 60: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809>

[23520984, 31786617][16,13]

[16,13]<16, 23560179> [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

<16, 23520984> <13, 31786617>

s → s

Page 61: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809>

[23520984, 31786617][16,13]

[16,13]<16, 23560179> [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

<16, 23520984> <13, 31786617>

s → s

Page 62: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809>

[23520984, 31786617][16,13]

[16,13]<16, 23560179> [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

<16, 23520984> <13, 31786617>

s → s

Page 63: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

IPAW'08 – Salt Lake City, Utah, June 2008

Provenance-active processors

X: l(s) = [a1, a2, a3]

P

Y: s = b

P

X: l(s) = [a1, a2, a3]

Y: l(s) = [b1, b2]

–Passive processors do not contribute explicit provenance info

–provenance-active processors actively feed metadata to the lineage service

Dynamic annotations:

Static annotations:

aggregation f()‏ P is index-preserving

b = X[i]‏ sorting:Y = Π(X)

b = f(X[1]...X[k])

Page 64: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Open Provenance Model

• A graph notation to represent process provenance– independent of the provenance producers– suitable for exchanging provenance across different workflow

systems• State: draft 1.01 (July 2008)

21

Page 65: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Mapping to OPM - granularity issue

22

X1 X2

Y2Y1

P0

X:s

P1

Y:s

X:s

P2

Y:s

a b

c d

fe

P0

P1

P2

a

b

c

dused

usedused

used

wgb

wgb

Page 66: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Mapping to OPM - granularity issue

22

X1 X2

Y2Y1

P0

X:s

P1

Y:s

X:s

P2

Y:s

a b

c d

fe

P0

P1

P2

a

b

c

dused

usedused

used

wgb

wgb

wasDerivedFrom

Page 67: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Mapping to OPM - granularity issue

22

X1 X2

Y2Y1

P0

X:s

P1

Y:s

X:s

P2

Y:s

a b

c d

fe

P0

P1

P2

a

b

c

dused

usedused

used

wgb

wgb

☐ ☐

wasDerivedFrom

Page 68: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Mapping to OPM - granularity issue

22

X1 X2

Y2Y1

P0

X:s

P1

Y:s

X:s

P2

Y:s

a b

c d

fe

P0

P1

P2

a

b

c

dused

usedused

used

wgb

wgb

☐ ☐b[p] d[p’]wasDerivedFrom

wasDerivedFrom

Page 69: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Mapping to OPM - granularity issue

22

X1 X2

Y2Y1

P0

X:s

P1

Y:s

X:s

P2

Y:s

a b

c d

fe

How can this granular dependency be described for all arbitrary paths p?

Currently cannot be expressed using OPM

P0

P1

P2

a

b

c

dused

usedused

used

wgb

wgb

☐ ☐b[p] d[p’]wasDerivedFrom

wasDerivedFrom

Page 70: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Path mapping rules

23

P1

P2

P3

a

b

c

dused

usedused

used

wgb

wgb

☐ ☐b[p] d[p’]actual lineage

wasDerivedFrom

Static graph structure sufficient to provide this (in Taverna)

But this is only known at query time

(extensional enumeration not an option)

Page 71: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Path mapping rules

23

P1

P2

P3

a

b

c

dused

usedused

used

wgb

wgb

☐ ☐b[p] d[p’]actual lineage

wasDerivedFrom

Static graph structure sufficient to provide this (in Taverna)

But this is only known at query time

(extensional enumeration not an option)

Observation: • only need to consider individual processor transformations• exploit local processor rules for propagating granular lineage

Page 72: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Path mapping rules

23

P1

P2

P3

a

b

c

dused

usedused

used

wgb

wgb

☐ ☐b[p] d[p’]actual lineage

wasDerivedFrom

Static graph structure sufficient to provide this (in Taverna)

But this is only known at query time

(extensional enumeration not an option)

Observation: • only need to consider individual processor transformations• exploit local processor rules for propagating granular lineage

Hint: granularity is only determined by depth of the pathAt query time, the Taverna lineage query algorithm encodes a path mapping rule to compute p’ given p

Page 73: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Architecture provenance-active processors

24

Taverna workflow engine provenancemanager

inputs outputs

provenanceinformationrepository

provenanceevents

lineage queryinterface

lin( P:Y, , Psel, E(D))

1. Common content:–processor execution details–binding of input/output variables to values–completion status

externalservices

Page 74: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Architecture provenance-active processors

24

Taverna workflow engine provenancemanager

inputs outputs

provenanceinformationrepository

provenanceevents

lineage queryinterface

lin( P:Y, , Psel, E(D))

1. Common content:–processor execution details–binding of input/output variables to values–completion status

2. Optional content for provenance-active processors:– explicit output → input dependency assertions:

let I, O be the input, resp. output variables setdepends(Y, X[p], <depType>), X ∈ I, Y ∈ O

externalservices

Page 75: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Architecture provenance-active processors

24

Taverna workflow engine provenancemanager

inputs outputs

provenanceinformationrepository

provenanceevents

lineage queryinterface

lin( P:Y, , Psel, E(D))

1. Common content:–processor execution details–binding of input/output variables to values–completion status

2. Optional content for provenance-active processors:– explicit output → input dependency assertions:

let I, O be the input, resp. output variables setdepends(Y, X[p], <depType>), X ∈ I, Y ∈ O

externalservices

p-active API

Page 76: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

• Experimental evaluation:– to what extent is granularity a real practical problem?– Quantify provenance friendliness by analysing a large

collection of workflows from myExperiment– Quantify available improvements (i.e. by refactoring)

• Compare collection management in Taverna with other workflow models– can we sucessfully exchange provenance graphs?

• Integration of the provenance service with the new version of Taverna– to be released before end of year

25

Ongoing work