Page 1
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.VisTrails
Second Provenance ChallengeTommy Ellkvist
David Koop
Juliana Freire
Joint work with:Erik Andersen, Steven P. Callahan, Emanuele Santos, Carlos E. Scheidegger, Cláudio Silva, and Huy T. Vo
Page 2
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Outline VisTrails Introduction VisTrails Demo Provenance Model and API Challenge Results Issues and Future Work
Page 3
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.VisTrails Comprehensive provenance infrastructure
for computational tasks Support for exploratory tasks such as
visualization and data mining Workflows are iteratively refined as users
generate and test hypotheses New change-based provenance model
Uniformly captures data and workflow provenance
Page 4
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Change-based Provenance Provenance is stored as a tree of actions
add module
add connection
Page 5
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Provenance: Storing Actions Each change writes new actions to the tree
<action id=“27” prevId=“26” user=“dakoop” date=“2007-06-20”> <add what=“module” objectId=“12”> <module id=“12” name=“vtkProperty” cache=“1”> <location id=“17” x=“-7.0” y=“97.0”/> </module> </add> <add what=“connection” objectId=“13”> <connection id=“13”> <port type=“source” moduleId=“10”/> <port type=“destination” moduleId=“12”/> </connection> </add></action>
Page 6
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Change-based Provenance Data provenance: where does a specific
data product come from? Workflow evolution: how has workflow
structure changed over time? Treat workflow versions as data–store
provenance of workflows
Page 7
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Layered Provenance
Page 8
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Layered Provenance
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Page 9
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Layered Provenance
Page 10
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Layered Provenance
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Page 11
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.VisTrails Provenance Normalized information–no redundancy!
Each layer provides more specific information but refers to parent layers
Workflow EvolutionWorkflowExecution Extensible storage options
Support for both relational and XML Flexible annotation framework–users can
specify application-specific provenance information
Page 12
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Provenance for Reproducibility and Beyond
Infrastructure for querying and reusing provenance Query workflows by example Create workflows by analogy
Collaborative exploration Scalable derivation of data products
Page 13
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.VisTrails Demo
Page 14
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Supporting Different Provenance Backends
VisTrails has powerful tools to query and reuse provenance information
There are many powerful workflow systems that produce such information
Problem: How to integrate different provenance backends?
Our approach: A mediation-based approach to provenance interoperability
Page 15
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Mediator Architecture
Mapping from global schema to data source specific schema
Page 16
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Mediated Provenance
Mapping from general model to engine-specific model
Page 17
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Combining Provenance Establish model Produce an API for this model Wrap provenance access for each
system so that queries become native over their provenance data
Page 18
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Provenance Model Follows the layered architecture
Versions map to a workflows Workflows are modeled as graphs Parameters capture module state User-defined annotations are available at
each layer of the model Module Definition stores information about
the computational pieces
Page 19
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Provenance Model
Page 20
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Provenance API Implements common access queries and
operations over the provenance model Examples:
getParent(module)
getChildren(module)
getUpstream(module)
getDownstream(module)
getAnnotations(module | workflow | …)
getDataItems(module_exec)
getParameters(module)
getVersion(time)
getExecutedModules(workflow)
getConnection(data_item)
getPorts(connection)
findModulesByParameter(search_params)
findModulesByAnnotation(search_params)
findExecutionsByAnnotation(search_params)
findVersionsByModules(search_params)
Page 21
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Provenance API Example
getExecutedModules(wf_exec)
VisTrails (XPath) def getExecutedModules(self, wf_exec): newdataitems = [] q = '//exec[@id="' + wf_exec.pid.key + '"]/@moduleId' dataitems = self.logcontext.xpathEval(q)
Pasoa (XPath)def getExecutedModules(self, wf_exec): q = "//ps:relationshipPAssertion[ps:localPAssertionId='" + wf_exec.pid.key + "']/ps:relation" dataitems = self.context.xpathEval(q)
Taverna (SPARQL)def getExecutedModules(self, wf_exec): " " q = ''' SELECT ?mi FROM <''' + self.path + '''> WHERE { <''' + wf_exec.pid.key + '''> <http://www.mygrid.org.uk/provenance#runsProcess> ?mi } ''' return self.processQueryAsList(q, pModuleInstance)
Page 22
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Provenance API Results Implemented queries for each system
and a combination of all three Annotation issues for a couple queries Example: Query 1 Results
vt3:4 --> vt3:7vt3:1 --> vt3:4vt3:0 --> vt3:1pas2:http://relation.org/softmean --> vt3:0myg1:urn:www.mygrid.org.uk/process#reslice1 --> pas2:http://relation.org/softmeanmyg1:urn:www.mygrid.org.uk/process#reslice2 --> pas2:http://relation.org/softmeanmyg1:urn:www.mygrid.org.uk/process#reslice3 --> pas2:http://relation.org/softmeanmyg1:urn:www.mygrid.org.uk/process#reslice4 --> pas2:http://relation.org/softmeanmyg1:urn:www.mygrid.org.uk/process#align_warp1 --> myg1:urn:www.mygrid.org.uk/process#reslice1myg1:urn:www.mygrid.org.uk/process#align_warp2 --> myg1:urn:www.mygrid.org.uk/process#reslice2myg1:urn:www.mygrid.org.uk/process#align_warp3 --> myg1:urn:www.mygrid.org.uk/process#reslice3myg1:urn:www.mygrid.org.uk/process#align_warp4 --> myg1:urn:www.mygrid.org.uk/process#reslice4
Page 23
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Provenance API Integration Developed VisTrails Provenance Query
Language for first challenge Plan to integrate API with query
language Plan to integrate query language with
VisTrails interfaces
Page 24
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Interoperability Issues Uniquely identifying intermediate results Intermediate file names were not
specified and varied Tracing ids is difficult for users–this
should be transparent A common query language should use
concepts familiar to users Mediator vs. Warehousing approach
Page 25
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Performance Issues Redundant information can make queries
inefficient What is the best storage backend?
RDBMS vs. XML database? What is the best data model?
XML vs. Relational vs. RDF? Need good benchmarks–large data!
Page 26
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Questions?
Page 27
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Mediated ProvenanceUser queries
General Provenance Model
wrapperwrapper wrapper
Taverna
Mappingfrom genericprovenance
modelinto the models of
different systems
Pasoa …
Prov API
Page 28
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.Mediator ArchitectureUser SQL/ODBC queries
Mediator
Global Schema
wrapperwrapper wrapper
DataSource
Mappingfrom global
schemainto sourceschemas
DataSource
DataSource