Top Banner
Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems
30

Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

Mar 28, 2015

Download

Documents

Elijah Ware
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

Layering in Provenance Systems

Margo Seltzer

May 13, 2009

Provenance in Secure and Advanced Computer Systems

Page 2: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 2

The Vision: Provenance Everywhere

• All data has provenance.• Applications generate provenance.• Systems generate provenance.• Users generate provenance.• Provenance is:

– Secure.– Queryable.– Globally searchable.

• There are provenance-aware algorithms.

Page 3: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 3

The Problem: Provenance Comes from Different Places

• Depending on the source, provenance is attached to different kinds of objects:– Operating system: files– Database systems: tuples– Workflow engines: objects– Applications:

• Variables (from an interpreter)

• Links (from a browser)

Page 4: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 4

Data are related

• Tuples live in files.• Files comprise data sets.• Browsers write files.• Variables relate to each other.• Objects may be files, tuples, or data sets.

Must integrate provenance from different representations.

Page 5: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 5

Why Integrate Provenance?

Page 6: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 6

Outline

• Provenance disclosure and integration• Layering and provenance• Parting remarks

Page 7: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 7

Provenance Observation versus Disclosure

• Disclosed provenance:– Provenance that is explicitly provided.– Provider understands semantics of the data referenced by

provenance.– Example: This image is the result of aligning these other two

images.

• Observed provenance:– Provenance deduced by interpreting events.– Observer translates event into a provenance relationship.– Example: Process P wrote file F, therefore file F depends on

file P

Page 8: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 8

Your observed provenance is my disclosed provenance.

• The distinction between observed and disclosed provenance is one of vantage point.

• A file system observes that the workflow engine produced the file atlas.x.gif.

• The workflow engine can disclose that atlas.x.gif is the result of a 5-step process that began with reading warp.air.

Page 9: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 9

Problem Overview

• Systems capture provenance at different levels of abstraction:– File systems: files and processes– Database systems: tuples and queries– Workflow engines: objects and operators– Interpreters: variable and operations– Browsers: URLs and traversals

• Users want to query across these abstractions.

Page 10: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 10

Use Case: PA-Browser

• Browsers capture a user’s search and traversal patterns.

• Action: User inadvertently downloads a virus.• Without layering:

– Browser knows this came from virus.com.– File system knows what files were affected.

• With layering:– How did user get to the virus?– What else was downloaded from that site?– Are there other files that might be similarly tainted?

Page 11: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 11

Use Case: PA-Python Applications

• Python wrappers generate trace of processing steps internal to python.

• Usage: Program reads 100 input files, uses two of them to produce a graph.

• Without layering:– Python knows which files were actually used to produce the

graph.– File system knows that Python read 100 files and produced

an output file.

• With layering– Can identify that two input files lead directly to output file.

Page 12: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 12

Integrating Requires Layering

• Layering implies that provenance collection and tracking systems interact directly with one another.

• Why not a centralized provenance repository?– Requires a mechanism to translate names.– Every participant must agree on naming convention.– Must be able to generate references to objects created by

other participants.– What happens when you add a new participant with a new

naming mechanism?

• Layering provides a natural way to transmit and integrate provenance.

Page 13: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 13

Outline

• Provenance disclosure and integration• Layering and provenance• Parting remarks

Page 14: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 14

Provenance-Aware Agents

• An agent that is provenance-aware:– Accepts disclosed provenance from others.– Observes events and generates provenance from

them.– Discloses provenance to others.

• Implications:– Both input and output are disclosed provenance– Participation in an integrated provenance-aware

system requires an API for disclosed provenance.

Page 15: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 15

DPAPI: The Disclosed Provenance API

• Grew out of our experience designing and building PASS (Provenance-Aware Storage Systems).

• Used as the universal internal API between components in the PASS architecture.

• Used to extend PASS to NFS.• Used by provenance-aware applications.• Has evolved through three generations.

Page 16: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 16

DPAPI Concepts

• Pnode– Unique ID assigned at object creation.– Never recycled.– Used to access an object’s provenance.

• Provenance record– An attribute/value pair.– Plain value or cross-reference.

• Version– Objects change; changes are reflected in versions.

Page 17: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 17

DPAPI Functions

• Pass_read: Reads data with a reference to its provenance.

• Pass_write: Writes data with provenance.• Pass_freeze: Subsequent modifications to object

create a new version.• Pass_mkobj: Create an object to represent

something at a different abstraction layer.• Pass_reviveobj: Given a pnode number, obtain a

reference to the appropriate object.• Pass_sync: Flush an object’s provenance to disk.

Page 18: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 18

Example Stack: NFS

Application

PASS

NFS

PA-Applicationlibpass

DPAPI

Syscall API

DPAPI

Page 19: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 19

Example 5-stack

PA-Python ApplicationPA Python

Library

DPAPISyscall API

DPAPI

PA-Python Interpreter

PASS

NFS

DPAPI DPAPI

DPAPI

lib API

Page 20: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 20

Benefits to Layering

• Ability to query across layers.• Access objects by the name that is

meaningful to the user.• Automatic association between names at

different layers.• Associate related objects named differently.• Extensible data model.

Page 21: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 21

Outline

• Provenance disclosure and integration• Layering and provenance• Parting remarks

Page 22: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 22

Lessons Learned (1)

• Guidelines for making applications or systems provenance-aware:– Identify what provenance you want to collect.

• Create objects as necessary using dpapi_mkobj

• Accumulate provenance records for those objects

– Replace read calls with dpapi_read calls.– Replace write calls with dpapi_write calls.– Use cross-references to relate objects.– If necessary, export DPAPI to higher layers

Page 23: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 23

Lessons Learned (2)

• Application architecture dictates how difficult this is.– Firefox’s modular architecture makes it difficult to

have provenance and data flow together hrough the browser

• APIs are never done.– DPAPI continues to evolve.– Added two new calls early in 2009.

Page 24: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 24

Lessons Learned (3)

• Differentiating applications from substrates:– We initially thought that our Python wrappers

made Python provenance-aware.– Instead they enabled provenance-aware Python

appcliations.– Making Python provenance-aware requires

changes to the interpreter -- similar to those to make an operating system provenance-aware.

Page 25: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 25

Making Provenance Ubiquitous

• One size does not fit all.• Provenance is useful at all levels of the

system:– Capture semantics of applications.– Capture execution mode of interpreter.– Capture system dependencies.

• Data and provenance live in a world with many names.

Page 26: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 26

Layering Enables Interoperability

• Data objects are the point of interoperability.– Users exchange or share data, not provenance.– Users query provenance.

• The names people associate with their data must be available in provenance queries.

• A layered approach associates names with one another.

• Layering enables consistency between provenance and data.

Page 27: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 27

New Layers

• We have explored layering in:– Operating system– Network-attached storage– Interpreters– Language libraries– Browsers– Workflow engines (Kepler)

• We welcome new layers to our stack:– Database?

Page 28: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 28

Thank You!

Margo Seltzer

[email protected]

May 13, 2009

Provenance in Secure and

Advanced Computer Systems

Page 29: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 29

DPAPI (detail)

int dpapi_freeze(int fd); int dpapi_mkobj(int reference_fd);int dpapi_revive_obj(int reference_fd,

__pnode_t pnode, version_t version);ssize_t paread(int fd, void *data, size_t

datalen, __pnode_t *pnode_ret, version_t *version_ret);

ssize_t pawrite(int fd, const void *data, size_t datalen, const struct dpapi_addition

*records, unsigned numrecords);int dpapi_sync(int fd);

Page 30: Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems.

PSACS: May 2009 30

Why Integrate Provenance?