Auditing and Maintaining Provenance in Software Packages Quan Pham 1 Tanu Malik 2 Ian Foster 1,2 Department of Computer Science 1 and Computation Institute 2 , The University of Chicago, Chicago, IL 60637, USA [email protected], [email protected]Presented by Boris Glavic Illinois Institute of Technology IPAW14 June, 10 th , 2014 Provenance in Software Packages June, 10 th , 2014 1 / 29
42
Embed
Auditing and Maintaining Provenance in Software Packages
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Auditing and Maintaining Provenance inSoftware Packages
Quan Pham1 Tanu Malik2 Ian Foster1,2
Department of Computer Science1 and Computation Institute2,The University of Chicago,
Provenance in Software Packages June, 10th, 2014 1 / 29
Outline
1 Introduction
2 Software Pipeline Usecase
3 CDE-SP: Software Provenance in CDE
4 Experiment and Evaluation
5 Related Work
6 Conclusion
Provenance in Software Packages June, 10th, 2014 2 / 29
Current Solutions for Ensuring Reproducibility and Issues
1 Publish source code and data− GitHub, Figshare, Research CompendiaX Pros: (in many cases) easy to accomplish× Cons: need to recompile and re-execute
2 Publish software package including source code, data, andenvironment dependencies− CDE, RunMyCode.orgX Pros: re-execute without installation× Cons: not easy to combine and merge shared packages
3 Publish a virtual machine image (VMI) that includes OS, source code,data, and environment− Cloud BioLinux (NEBC), Swift Appliance (RDCEP)X Pros: no additional modules or components needed to rerun× Problem: too hard to provision and understand
Our philosophy:”... releasing shoddy VMs is easy to do, but it doesn’t help you learn howto do a better job of reproducibility along the way. Releasing softwarepipelines, however crappy, is on the path towards better reproducibility.”
C. Tituss Brown1
Reproducibility problem: How can we make it easy to combine andmerge shared packages, while correctly attributing authorship of softwarepackages?
No need to provision VMIs or publish simply source code and data.
Use CDE2 to capture and create portable software package
Extend, partially re-use, and combine CDE packages to create newreproducible software pipelines
Attribute authorship of software packages in new software pipelines
CDE has an OVERLAP conflict!
2Guo, P.J., Engler, D.: CDE: using system call interposition to automatically createportable software packages. USENIX Association, Portland, OR (2011)
Alice, Bob, and Charlie are scientists at the Center for Robust DecisionMaking on Climate and Energy Policy (RDCEP)
A develops data integration methods to produce higher-resolutiondatasets depicting inferred land use over time.
B develops computational models to do model-based comparativeanalysis. B’s software environment consists of A’s software modulesto produce high-resolution datasets.
C uses A and B’s software modules within data-intensivecomputing methods to run them in parallel.
The Center wants to predict future yields of staple agriculturalcommodities given changes in the climate.
Table 1 : Ratio of different files having the same path in 5 popular AMIs. Thedenominator is number of files having the same path in two distributions, and thenumerator is the number of files with the same path but different md5 checksum.Ommited are manual pages in /usr/share/ directory.
Amz Amazon Linux AMIRH Red Hat Enterprise Linux 6.4
SUSE SUSE Linux Enterprise Server 11U12 Ubuntu Server 12.04.3 LTSU13 Ubuntu Server 13.10
CDE-SP: Enhanced CDE that includes software provenance
Describe tools and methods to audit, store, and query provenanceProvenance queries
Determine the environment under which a dependency was buildExamine the dependencies which must be presentAnswer if packages in a pipeline can satisfy a new packageAttribute authorship of software packages in a pipeline
Combine and validate authorship from stored provenance
Capture additional details of the origins of a library or a binary
Use these details for compiling and creating software pipelines
Methods
Create a dependency tree
Process system calls are monitoredWhenever a process executes a file system call, a dependency of thatprocess is recordedDependency can be a data file or a shared library
Extract information about binaries and required shared libraries
file, ldd, strings, and objdump UNIX commandsuname -a and function getpwuid(getuid())
CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th, 2014 18 / 29
CDE-SP Audit
Objectives
Capture additional details of the origins of a library or a binary
Use these details for compiling and creating software pipelines
Methods
Create a dependency tree
Process system calls are monitoredWhenever a process executes a file system call, a dependency of thatprocess is recordedDependency can be a data file or a shared library
Extract information about binaries and required shared libraries
file, ldd, strings, and objdump UNIX commandsuname -a and function getpwuid(getuid())
CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th, 2014 18 / 29
CDE-SP Audit
Objectives
Capture additional details of the origins of a library or a binary
Use these details for compiling and creating software pipelines
Methods
Create a dependency tree
Process system calls are monitoredWhenever a process executes a file system call, a dependency of thatprocess is recordedDependency can be a data file or a shared library
Extract information about binaries and required shared libraries
file, ldd, strings, and objdump UNIX commandsuname -a and function getpwuid(getuid())
CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th, 2014 18 / 29
Storage
Store provenance within the package itself
Use LevelDB: a fast and light-weight key-value storage library
Encode in the key the UNIX process identifier along with spawn time
Key Value Explanationpid.PID1.exec.TIME PID2 PID1 wasTriggeredBy PID2
pid.PID.[path, pwd, args] VALUES Other properties of PID
Table 2 : LevelDB key-value pairs that store file and process provenance. Capital letter words are arguments.
CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th, 2014 19 / 29
Query
LevelDB provides a minimal API for querying
Simple, light-weight query interface
Input: a program whose dependencies need to be retrievedOutput: a GraphViz file displaying file and process dependencies
Use depth first search algorithm to create a dependency tree with theinput program as its root
Exclusion option to remove uninteresting dependencies:/lib/, /usr/lib/, /usr/share/, /etc/
CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th, 2014 20 / 29
Authorship of Software Modules
Combine authorship of the contributing packages
Validate authorship from the provenance stored in the originalpackage
Generate the subgraph associated with the part of the new packageUse subgraph isomorphism (NP-Hard) to validate with the originalprovenance graphMatch provenance nodes of processes with the same paths of theirbinaries and working directoriesMatch provenance nodes of files with the same path
CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th, 2014 21 / 29
Table 3 : Increase in CDE-SP performance is negligible in comparison with CDE
4Guo, P.J., Engler, D.: CDE: using system call interposition to automatically createportable software packages. USENIX Association, Portland, OR (2011)
Experiment and Evaluation Provenance in Software Packages June, 10th, 2014 23 / 29
Redirection Overhead in CDE-SP
Pipelined output of Aggregation to input of Generate Image
3 output files of Aggregation package were moved to Generate Imagepackage
2 cross-package execve() system calls
Less than a 1% slowdown of CDE-SP
Experiment and Evaluation Provenance in Software Packages June, 10th, 2014 24 / 29
Kameleon
Use the Kameleon engine to make a bare bone VM appliance
Self-written YAML-formatted recipesSelf-written macrosteps and microsteps
Kameleon can create virtual machine appliances in different formatsfor different Linux distributions
Generates bash scripts to create an initial virtual image of a LinuxdistributionPopulates the image with more Linux packagesPopulates with content of a CDE-SP package
Experiment and Evaluation Provenance in Software Packages June, 10th, 2014 25 / 29
CDE-SP Vs Kameleon
0
200
400
600
800
1000
1200
1400
1600
Kameleon CDE-SP
Seco
nds
Figure 1 : Overhead when using CDE with Kameleon VM appliance
Experiment and Evaluation Provenance in Software Packages June, 10th, 2014 26 / 29
Related Work
Research Objects: packages scientific workflows with auxiliaryinformation about workflows, including provenance information andmetadata, such as the authors, the version
CDE and Sumatra can capture an execution environment in alightweight fashion
SystemTap, being a kernel-based tracing mechanism, has betterperformance compared to ptrace but needs to run at a higherprivilege level
Provenance-to-Use (PTU) and ReproZip include provenance inself-contained software packages
Related Work Provenance in Software Packages June, 10th, 2014 27 / 29
Conclusion
CDE does not encapsulate provenance of associated dependencies ina software package
The lack of information about the origins of dependencies in asoftware package creates issues when constructing software pipelinesfrom packages
CDE-SP can include software provenance as part of a softwarepackage
CDE-SP can use software package provenance to build softwarepipelines
CDE-SP can maintain provenance when used to construct softwarepipelines