Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems
Mar 28, 2015
Layering in Provenance Systems
Margo Seltzer
May 13, 2009
Provenance in Secure and Advanced Computer Systems
PSACS: May 2009 2
The Vision: Provenance Everywhere
• All data has provenance.• Applications generate provenance.• Systems generate provenance.• Users generate provenance.• Provenance is:
– Secure.– Queryable.– Globally searchable.
• There are provenance-aware algorithms.
PSACS: May 2009 3
The Problem: Provenance Comes from Different Places
• Depending on the source, provenance is attached to different kinds of objects:– Operating system: files– Database systems: tuples– Workflow engines: objects– Applications:
• Variables (from an interpreter)
• Links (from a browser)
PSACS: May 2009 4
Data are related
• Tuples live in files.• Files comprise data sets.• Browsers write files.• Variables relate to each other.• Objects may be files, tuples, or data sets.
Must integrate provenance from different representations.
PSACS: May 2009 5
Why Integrate Provenance?
PSACS: May 2009 6
Outline
• Provenance disclosure and integration• Layering and provenance• Parting remarks
PSACS: May 2009 7
Provenance Observation versus Disclosure
• Disclosed provenance:– Provenance that is explicitly provided.– Provider understands semantics of the data referenced by
provenance.– Example: This image is the result of aligning these other two
images.
• Observed provenance:– Provenance deduced by interpreting events.– Observer translates event into a provenance relationship.– Example: Process P wrote file F, therefore file F depends on
file P
PSACS: May 2009 8
Your observed provenance is my disclosed provenance.
• The distinction between observed and disclosed provenance is one of vantage point.
• A file system observes that the workflow engine produced the file atlas.x.gif.
• The workflow engine can disclose that atlas.x.gif is the result of a 5-step process that began with reading warp.air.
PSACS: May 2009 9
Problem Overview
• Systems capture provenance at different levels of abstraction:– File systems: files and processes– Database systems: tuples and queries– Workflow engines: objects and operators– Interpreters: variable and operations– Browsers: URLs and traversals
• Users want to query across these abstractions.
PSACS: May 2009 10
Use Case: PA-Browser
• Browsers capture a user’s search and traversal patterns.
• Action: User inadvertently downloads a virus.• Without layering:
– Browser knows this came from virus.com.– File system knows what files were affected.
• With layering:– How did user get to the virus?– What else was downloaded from that site?– Are there other files that might be similarly tainted?
PSACS: May 2009 11
Use Case: PA-Python Applications
• Python wrappers generate trace of processing steps internal to python.
• Usage: Program reads 100 input files, uses two of them to produce a graph.
• Without layering:– Python knows which files were actually used to produce the
graph.– File system knows that Python read 100 files and produced
an output file.
• With layering– Can identify that two input files lead directly to output file.
PSACS: May 2009 12
Integrating Requires Layering
• Layering implies that provenance collection and tracking systems interact directly with one another.
• Why not a centralized provenance repository?– Requires a mechanism to translate names.– Every participant must agree on naming convention.– Must be able to generate references to objects created by
other participants.– What happens when you add a new participant with a new
naming mechanism?
• Layering provides a natural way to transmit and integrate provenance.
PSACS: May 2009 13
Outline
• Provenance disclosure and integration• Layering and provenance• Parting remarks
PSACS: May 2009 14
Provenance-Aware Agents
• An agent that is provenance-aware:– Accepts disclosed provenance from others.– Observes events and generates provenance from
them.– Discloses provenance to others.
• Implications:– Both input and output are disclosed provenance– Participation in an integrated provenance-aware
system requires an API for disclosed provenance.
PSACS: May 2009 15
DPAPI: The Disclosed Provenance API
• Grew out of our experience designing and building PASS (Provenance-Aware Storage Systems).
• Used as the universal internal API between components in the PASS architecture.
• Used to extend PASS to NFS.• Used by provenance-aware applications.• Has evolved through three generations.
PSACS: May 2009 16
DPAPI Concepts
• Pnode– Unique ID assigned at object creation.– Never recycled.– Used to access an object’s provenance.
• Provenance record– An attribute/value pair.– Plain value or cross-reference.
• Version– Objects change; changes are reflected in versions.
PSACS: May 2009 17
DPAPI Functions
• Pass_read: Reads data with a reference to its provenance.
• Pass_write: Writes data with provenance.• Pass_freeze: Subsequent modifications to object
create a new version.• Pass_mkobj: Create an object to represent
something at a different abstraction layer.• Pass_reviveobj: Given a pnode number, obtain a
reference to the appropriate object.• Pass_sync: Flush an object’s provenance to disk.
PSACS: May 2009 18
Example Stack: NFS
Application
PASS
NFS
PA-Applicationlibpass
DPAPI
Syscall API
DPAPI
PSACS: May 2009 19
Example 5-stack
PA-Python ApplicationPA Python
Library
DPAPISyscall API
DPAPI
PA-Python Interpreter
PASS
NFS
DPAPI DPAPI
DPAPI
lib API
PSACS: May 2009 20
Benefits to Layering
• Ability to query across layers.• Access objects by the name that is
meaningful to the user.• Automatic association between names at
different layers.• Associate related objects named differently.• Extensible data model.
PSACS: May 2009 21
Outline
• Provenance disclosure and integration• Layering and provenance• Parting remarks
PSACS: May 2009 22
Lessons Learned (1)
• Guidelines for making applications or systems provenance-aware:– Identify what provenance you want to collect.
• Create objects as necessary using dpapi_mkobj
• Accumulate provenance records for those objects
– Replace read calls with dpapi_read calls.– Replace write calls with dpapi_write calls.– Use cross-references to relate objects.– If necessary, export DPAPI to higher layers
PSACS: May 2009 23
Lessons Learned (2)
• Application architecture dictates how difficult this is.– Firefox’s modular architecture makes it difficult to
have provenance and data flow together hrough the browser
• APIs are never done.– DPAPI continues to evolve.– Added two new calls early in 2009.
PSACS: May 2009 24
Lessons Learned (3)
• Differentiating applications from substrates:– We initially thought that our Python wrappers
made Python provenance-aware.– Instead they enabled provenance-aware Python
appcliations.– Making Python provenance-aware requires
changes to the interpreter -- similar to those to make an operating system provenance-aware.
PSACS: May 2009 25
Making Provenance Ubiquitous
• One size does not fit all.• Provenance is useful at all levels of the
system:– Capture semantics of applications.– Capture execution mode of interpreter.– Capture system dependencies.
• Data and provenance live in a world with many names.
PSACS: May 2009 26
Layering Enables Interoperability
• Data objects are the point of interoperability.– Users exchange or share data, not provenance.– Users query provenance.
• The names people associate with their data must be available in provenance queries.
• A layered approach associates names with one another.
• Layering enables consistency between provenance and data.
PSACS: May 2009 27
New Layers
• We have explored layering in:– Operating system– Network-attached storage– Interpreters– Language libraries– Browsers– Workflow engines (Kepler)
• We welcome new layers to our stack:– Database?
PSACS: May 2009 28
Thank You!
Margo Seltzer
May 13, 2009
Provenance in Secure and
Advanced Computer Systems
PSACS: May 2009 29
DPAPI (detail)
int dpapi_freeze(int fd); int dpapi_mkobj(int reference_fd);int dpapi_revive_obj(int reference_fd,
__pnode_t pnode, version_t version);ssize_t paread(int fd, void *data, size_t
datalen, __pnode_t *pnode_ret, version_t *version_ret);
ssize_t pawrite(int fd, const void *data, size_t datalen, const struct dpapi_addition
*records, unsigned numrecords);int dpapi_sync(int fd);
PSACS: May 2009 30
Why Integrate Provenance?