Top Banner
Recording and Reasoning Over Data Provenance in Web and Grid Services Martin Szomszor and Luc Moreau [email protected] University of Southampton
30

Recording and Reasoning Over Data Provenance in Web and Grid Services

May 11, 2015

Download

Technology

Martin Szomszor
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Recording and Reasoning Over Data Provenance in Web and Grid Services

Recording and Reasoning Over Data Provenance in

Web and Grid Services

Martin Szomszor and Luc Moreau

[email protected]

University of Southampton

Page 2: Recording and Reasoning Over Data Provenance in Web and Grid Services

Contents

A definition of provenance Example 1: Aerospace engineering Example 2: Organ transplant management Example 3: Bioinformatics grid Provenance architecture Provenance service Conclusion

Page 3: Recording and Reasoning Over Data Provenance in Web and Grid Services

The Grid and Virtual Organisations

The Grid problem is defined as coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organisations [FKT01].

Effort is required to allow users to place their trust in the data produced by such virtual organisations

Understanding how a given service is likely to modify data flowing into it, and how this data has been generated is crucial.

Page 4: Recording and Reasoning Over Data Provenance in Web and Grid Services

Provenance and Virtual Organisations

Given a set of services in an open grid environment that decide to form a virtual organisation with the aim to produce a given result;

How can we determine the process that generated the result, especially after the virtual organisation has been disbanded?

The lack of information about the origin of results does not help users to trust such open environments.

Page 5: Recording and Reasoning Over Data Provenance in Web and Grid Services

Provenance and Workflows

Workflow enactment has become popular in the Web Services and Grid communities

Workflow enactment can be seen as a scripted form of virtual organisation.

The problem is similar: how can we determine the origin of enactment results.

Page 6: Recording and Reasoning Over Data Provenance in Web and Grid Services

Provenance: Definition

Provenance is an annotation able to explain how a particular result has been derived.

In a service-oriented architecture, provenance identifies what data is passed between services, what services are available,and what results are generated for particular sets of input values, etc.

Using provenance, a user can trace the “process” that led to the aggregation of services producing a particular output.

Page 7: Recording and Reasoning Over Data Provenance in Web and Grid Services

Provenance in Aerospace Engineering

Aerospace engineering requires to undertake scientific simulations, data pre- and post-processing and visualisation, composed in complex workflows.

Page 8: Recording and Reasoning Over Data Provenance in Web and Grid Services

Provenance in Aerospace Engineering

Provenance is crucially required in this context, as the need to maintain a historical record of outputs from each sub-system is an important requirement for many customers that utilise the end result of simulations.

For instance, aircrafts’ provenance data need to be kept for up to 99 years when sold to some countries.

Currently, however little direct support is available for this.

Page 9: Recording and Reasoning Over Data Provenance in Web and Grid Services

Provenance in Organ Transplant Management

Medical information systems, and in particular decision support systems for organ and tissue transplant, rely on a wide range of data sources, patient data, and knowledge added by doctors, surgeons and other individuals using the systems.

Page 10: Recording and Reasoning Over Data Provenance in Web and Grid Services

Provenance in Organ Transplant Management

Such a domain is heavily regulated European, national, regional and site specific rules govern how

decisions are made Application of these rules must be ensured, be auditable and may

change over time

Patient recovery is highly dependent on organ allocation choice, extraction and insertion methods, care/recovery regime.

Page 11: Recording and Reasoning Over Data Provenance in Web and Grid Services

Provenance in Organ Transplant Management

Tracking back previous decisions in any one centre to identify whether the best match was made, who was involved in the decision, what was the context.

Maximise the efficiency in matching and recovery rate of patients.

Page 12: Recording and Reasoning Over Data Provenance in Web and Grid Services

Provenance in a Bioinformatics Grid (myGrid)

myGrid aims to build a personalised problem-solving environment, in which:

the scientist can construct in silico experiments, find and adapt others, store results in data repositories, have their own view on public repositories, be better informed as to the provenance and

the currency of the tools and data directlyrelevant to their experimental space.

Page 13: Recording and Reasoning Over Data Provenance in Web and Grid Services

Provenance in a Bioinformatics Grid (myGrid)

Two major forms of provenance [Greenwood03]: The derivation path records the process by which

results are generated from input data.Derivation data provides the answer to questions about what initial data was used for a result, and how was the transformation from initial data to result achieved. FDA requirement on drug companies to keep a record of provenance of drug discovery as long as the drug is in use (up to 50 years sometimes).

Page 14: Recording and Reasoning Over Data Provenance in Web and Grid Services

Provenance in a Bioinformatics Grid (myGrid)

Two major forms of provenance [Greenwood03]: Annotations are attached to objects, or collections

of objects. Annotation data provides more contextual information that might be of interest: who performed an experiment, when did they supply any comments on the specific methods and materials used, when an object was created, last updated,who owns it and its format.

Useful to provide personalised environment.

Page 15: Recording and Reasoning Over Data Provenance in Web and Grid Services

Other Provenance Requirements and Uses

Standard lineage representation, automated lineage recording, unobtrusive information collecting [Frew and Brose]

To give reliability and quality, justification and audit, re-usability, reproducibility and repeatability, change and evolution, ownership, security, credit and copyright [Goble]

Page 16: Recording and Reasoning Over Data Provenance in Web and Grid Services

What is the problem?

Provenance recording should be part of the infrastructure, so that users can elect to enable it when they execute their complex tasks over the Grid or in Web Services environments.

Currently, the Web Services protocol stack and the Open Grid Services Architecture do not provide any support for recording provenance.

Page 17: Recording and Reasoning Over Data Provenance in Web and Grid Services

Our Contributions

A service-oriented architecture for provenance support in Grid and Web Services environments, based on the idea of a provenance service;

A client-side API for recording provenance data for Web Service invocation;

A data model for storing provenance data; A server-side interface for querying provenance data; Two components making use of provenance:

provenance browsing and provenance validation.

Page 18: Recording and Reasoning Over Data Provenance in Web and Grid Services

Overall Architecture

Page 19: Recording and Reasoning Over Data Provenance in Web and Grid Services

Overall Architecture

Provenance gathering is a collaborative process that involves multiple entities, including the workflow enactment engine, the enactment engine's client, the service directory, and the invoked services.

Provenance data will be submitted to one or more “provenance repositories” acting as storage for provenance data.

Upon user's requests, some analysis, navigation and reasoning over provenance data can be undertaken.

Page 20: Recording and Reasoning Over Data Provenance in Web and Grid Services

Overall Architecture

Storage could be achieved by a provenance service.

A library, optionally hosted in the provenance service, would perform the analysis, navigation or reasoning.

A client side library would submit provenance data to the provenance service.

Page 21: Recording and Reasoning Over Data Provenance in Web and Grid Services

System Overview

Page 22: Recording and Reasoning Over Data Provenance in Web and Grid Services

Sequence Diagram

To identify the interactions between provenance service, client side library and enactment engine

Creation of a session Need to be able to support the most complex

workflows including conditional branching, iteration, recursion and parallel execution.

Support asynchronous submission of provenance data so that provenance submission does not delay workflow execution.

Page 23: Recording and Reasoning Over Data Provenance in Web and Grid Services

Sequence Diagram

Page 24: Recording and Reasoning Over Data Provenance in Web and Grid Services

Provenance Data Model

Must support recording of all information necessary to replay execution

Must support all complex forms of workflows (recursion, iterations, parallel execution).

Page 25: Recording and Reasoning Over Data Provenance in Web and Grid Services

Provenance Data Model

Page 26: Recording and Reasoning Over Data Provenance in Web and Grid Services
Page 27: Recording and Reasoning Over Data Provenance in Web and Grid Services

Discussion

In order for provenance data to be useful, we expect such a protocol to support some “classical” properties of distributed algorithms.

Using mutual authentication, an invoked service can ensure that it submits data to a specific provenance server, and vice-versa, a provenance server can ensure that it receives data from a given service.

With non-repudiation, we can retain evidence of the fact that a service has committed to executing a particular invocation and has produced a given result.

We anticipate that cryptographic techniques will be useful to ensure such properties

Page 28: Recording and Reasoning Over Data Provenance in Web and Grid Services

The purpose of project PASOA to investigate provenance in Grid architectures

Funded by EPSRC under the “fundamental computer science for e-Science call”

In collaboration with Cardiff

www.pasoa.org

Page 29: Recording and Reasoning Over Data Provenance in Web and Grid Services

Conclusion

Provenance is a rather unexplored domain Strategic to bring trust in open environment Our provenance service is the first attempt to

incorporate provenance in the infrastructure of Web and Grid services

Need to further investigate the algorithmic foundations of provenance, which will lead to scalable and secure industrial solutions.

Page 30: Recording and Reasoning Over Data Provenance in Web and Grid Services

Acknowledgements

Syd Chapman, IBM Omer Rana, Cardiff Andreas Schreiber and Rolf Hempel, DLR Lazslo Varga, SZTAKI Ulises Cortes and Steven Willmott, UPC Mark Greenwood, Carole Goble, Manchester