Top Banner
Karma Karma Provenance Collection Provenance Collection Framework for Data- Framework for Data- driven Workflows driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian, Abhijit Borude, et al Indiana University
35

Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Dec 25, 2015

Download

Documents

Philip Harvey
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

KarmaKarmaProvenance Collection Provenance Collection Framework for Data-driven Framework for Data-driven WorkflowsWorkflows

Yogesh SimmhanMicrosoft Research

Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian, Abhijit Borude, et alIndiana University

Page 2: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Putting the ‘e’ in e-Putting the ‘e’ in e-ScienceScienceMany scientific domains are moving to in Silico

experiments…Earth Sciences, Life Sciences, Astronomy

Common requirements◦ Complex & Dynamic Systems, ◦ Adaptive Resources◦ Data Deluge◦ Need for Collaboration

Cyberinfrastructure to support these needs◦ Massively Parallel Systems◦ High Bandwidth Computer Networks◦ Petascale Data Archives

Grid Middleware provides the glue to tie these using a Service Oriented Architecture

Page 3: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Workflows as ExperimentsWorkflows as ExperimentsData-driven applications designed

as workflowsData flows across applications as

they are transformed, fused and used generating derived data

Control flows determine path to execute but data flow determines data movement and dependency

Manually keeping track of input & derived data to experiments is challenging given the number of data and complexity of application

Page 4: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Data Management Data Management ChallengesChallengesComplex, dynamic data-processing pipelinesRemote execution on Grid resourcesHow was a particular dataset created?

Collaboratory environments with shared resources

Large search space & missing metadataHow good is a given dataset for one’s

application?

Page 5: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Data ProvenanceData Provenance

Metadata that describes the causality of an event◦ Along with context to interpret it

What, when, where, who, how, …We consider provenance for

◦ Workflow execution◦ Service invocations◦ Data products

Workflow & Service Provenance◦ Describes execution of a workflow & invocation of

serviceData Provenance

◦ Describes usage and generation of data products

Provenance /’prɒv ə nəns, -,nɑns/ The history or pedigree of a work of art, manuscript, etc. A record of the ultimate derivation and passage of an item through its various owners.Source: The Oxford English Dictionary

Page 6: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

BenefitsBenefitsWhat if the experiment fails?

◦ Did the workflow run correctly? Completely?◦ Was the correct data/service/parameter used?◦ Verification, Validation

Can my peer run the experiment & get the same result?◦ Repeatability

Can I use the results in my publication?◦ Attribution, Copyright

Can I trust the results of prediction?◦ Data Quality

How much did it cost? How much will it cost?◦ Resource Usage & Prediction

Page 7: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

[7/43] [2007-08-16]

Gateway ServicesGateway Services

Core Grid ServicesCore Grid Services

LEAD Science Gateway LEAD Science Gateway ArchitectureArchitecture

Grid Portal Server

Grid Portal Server

ExecutionManagement

ExecutionManagement

InformationServices

InformationServices

SelfManagement

SelfManagement

DataServices

DataServices

ResourceManagement

ResourceManagement

SecurityServices

SecurityServices

Resource Virtualization Resource Virtualization (OGSA)(OGSA)

Compute Resources Data Resources Instruments & Sensors

Proxy CertificateServer (Vault)

Proxy CertificateServer (Vault)

Events & Messaging

Events & Messaging

Resource BrokerResource Broker

Community & User Metadata Catalog

Community & User Metadata Catalog

Workflow engine

Workflow engine Resource

Registry

Resource Registry

ApplicationDeployment

ApplicationDeployment

User’s Grid DesktopUser’s Grid Desktop

Page 8: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

What is KarmaWhat is KarmaProvenance Framework A standalone framework to collect data provenance for adaptive workflows with low overhead and lightweight

schema able to answer complex queries Data Provenance is

a form of metadatato track derivation history of datacreated by a workflow runexecuting across organizations (space)over a period of time

Data Usage: Move forward in time Workflow trace: Inverse view from the

actors

Page 9: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

A Typical e-Science ExperimentWeather forecast using WRF in LEAD

Pre-ProcessingPre-Processing AssimilationAssimilation VisualizatioVisualizationn

ForecastForecast

Page 10: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

WorkflowsWorkflows

Abstract Workflow ModelAbstract Workflow ModelTemporal & Spatial composition

◦ Data Flow vs. Invocation Flow

Central vs. Distributed Orchestration

AssumptionDirected Graph of Service Nodes & Data Edges

◦ Data Driven ApplicationsHierarchical Composition: Workflows a form of

ServiceWorkflow definition not required

Standalone, independent of Workflow System

Provides Port

Uses Port

Data Flow

Page 11: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Workflows Workflows

Simple & Complex Workflow Simple & Complex Workflow ModelsModels

Workflow Engine

ServiceS2

ServiceS1

D1 D2 D3

WorkflowWF

D1 D3

Workflow Engine

ServiceS2

ServiceS1

D1 D2 D3

WorkflowWF

D1 D3

D1

ServiceS1

D2

WorkflowWF1

D1

WorkflowWF2

ServiceS3

D2

ServiceS2

D3 D4

Page 12: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

[12/43] [2007-08-16]

Provenance Framework in Support of

Data

Quality

Estimation

Activities Activities

Collecting ProvenanceCollecting ProvenanceActivities generated during lifecycle of

workflow“Sensors” generate activities: Instrumentation

of services, clientsTrack execution across space, time, depth &

operation◦ Space: which service◦ Time: when (logical time)◦ Depth: distance from invocation root (client »

workflow » service … nested workflows)◦ Operation: Track dataflow

18 activities definedSupport Dynamic, Adaptive Workflow

Page 13: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

WF Engine

Web Service

Instrumentation of Services Instrumentation of Services & WF& WF

WS Client

Page 14: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Karma Provenance ServiceKarma Provenance Service

ProvenanceListener

ProvenanceListener

ActivityDB

ActivityDB

Karma ArchitectureKarma Architecture

Workflow Instance10 Data Products Consumed & Produced by each Service

Workflow Instance10 Data Products Consumed & Produced by each Service

Service2

Service2 ……Service

1Service

1Service

10Service

10Service

9Service

910P/10C

10C

10P 10C 10P/10C

10P

Workflow Engine

Workflow Engine

Message Bus WS-Eventing Service API Message Bus WS-Eventing Service API WS-Messenger

Notification BrokerWS-Messenger

Notification Broker

Publish Provenance Activities as Notifications

Application–Started & –Finished, Data–Produced & –ConsumedActivities

Workflow–Started & –Finished Activities

ProvenanceQuery API

ProvenanceQuery API

Provenance Browser ClientProvenance

Browser Client

Query for Workflow, Process,& Data Provenance

Subscribe & Listen toActivity Notifications

A Framework for Collecting Provenance in Data-Centric Scientific WorkflowsA Framework for Collecting Provenance in Data-Centric Scientific Workflows, Simmhan, Y., et al.; ICWS, 2006

Page 15: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Service Invocation State Service Invocation State DiagramDiagram

Service Invoked

Data Transfer

InComputati

on

Data Consume

d

Data Produced

Data Transfer

Out

Sending Result

SERVICE

CLIENT

Page 16: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Activities Activities

Types & SourceTypes & SourceActivity Generated By

[Service | Workflow] Initialized Service

[Service | Workflow] Terminated Service

Invoking Service Client

Service Invoked Service

Invoking Service [Succeeded | Failed] Client

Data Transfer Service

Computation Service

Data Produced Service

Data Consumed Service

Sending [Result | Fault] Service

Received [Result | Fault] Client

Sending Response [Succeeded | Failed] Service

Type

Independent

Independent

Bounding

Bounding

Bounding

Operational

Operational

Operational

Operational

Bounding

Bounding

Bounding

Page 17: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

[17/43][2007-08-

16]Provenance Framework in Support of Data Quality Estimation

Client ServiceD1

D2

Tim

e

Space Operation

S: Initialize

S: Terminate

S: Send Response Successful

C: Receive Response

S: Send Response

S: Transfer Output Data D2

S: Produce Data D2

S: Perform Computation

S: Consume Data D1

S: Transfer Input Data D1

C: Invocation Successful

S: Invoked

C: Invoke Service

TransferConsume

ProduceCompute

Client Service

Depth

Activities Sequence Diagram for Basic Workflow

Page 18: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

[18/43] [2007-08-16]

Provenance Framework in Support of

Data

Quality

Estimation

Workflow Engine

ServiceS2

D1

D2

ServiceS1

D2

D3

D1 D2 D3

WorkflowWF

D1 D3

Tim

e

Operation

S1,S2,WF: Initialize

S1,S2,WF: Terminate

S1: Send Response Successful

WF: Receive Response

S1: Send Response

S1: Produce Data D2

S1: Consume Data D1

WF: Invocation Successful

S1: Invoked

WF: Invoke Service S1

ConsumeProduce

WF S1 S2

S2: Send Response Successful

WF: Receive Response

S2: Send Response

S2: Produce Data D3

S2: Consume Data D2

WF: Invocation Successful

S2: Invoked

WF: Invoke Service S2

Space

DepthSequence Diagram for Simple Workflow

Page 19: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

[19/43] [2007-08-16]

Provenance Framework in Support of

Data

Quality

Estimation

Activities Activities

NamingNamingUniquely identifying data & services is

critical for provenanceData product has GUID. Replicas have

URLs.Service & Workflow instances have GUIDServices defined in the context of

workflows have a Node ID in the workflow name space

Clients have GUIDEntity: 4-tuple

◦ <Workflow ID, Service ID, Node ID, Timestep>

Invocation: 2-tuple◦ <Invoker Entity, Invokee Entity>

Page 20: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Activities Activities

Provenance Activity ContentsProvenance Activity ContentsActivity TypeSource Entity: 4-tuple

◦<Workflow ID, Service ID, Node ID, Timestep>

Remote Entity: 4-tupleAttributes

◦todoAnnotations

Page 21: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Activities Activities

Modeling Activities in XMLModeling Activities in XML<serviceInvoked xmlns=“http://lead.extreme.indiana.edu/namespaces/2006/06/workflow_tracking”> <notificationSource workflowNodeID=“ConvertService_4” workflowTimestep=“36” workflowID=“tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1” serviceID=“urn:qname:http://www.extreme.indiana.edu/karma/challenge06:ConvertService” /> <timestamp>2006-09-10T23:56:28.677Z</timestamp> <description>Convert Service was Invoked</description> <request><header>...</header><body>...</body></request> <initiator serviceID=“tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1” /> </serviceInvoked>

<dataProduced xmlns=“http://lead.extreme.indiana.edu/namespaces/2006/06/workflow_tracking”> <notificationSource workflowNodeID=“ConvertService_4” workflowTimestep=“36” workflowID=“tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/

instance1”

serviceID=“urn:qname:http://www.extreme.indiana.edu/karma/challenge06:ConvertService” />

<timestamp>2006-09-10T23:56:32.324Z</timestamp> <dataProduct> <id>lead:uuid:1157946992-atlas-x.gif</id> <location> gsiftp://tyr1.cs.indiana.edu/tmp/20060910235628_Convert/outputData/atlas-x.gif</

location> <timestamp>2006-09-10T23:56:32.324Z</timestamp> </dataProduct></dataProduced>

Page 22: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Activities Activities

Publishing Activities as Publishing Activities as NotificationsNotificationsActivities are modeled as notifications

that are sent by different components◦Loosely coupled, easy to generate

provenanceXML Representation of provenance

activitiesWS-Messenger Notification Broker

acts as message bus◦WS-Eventing & WS-Notification

Provenance service & interested clients subscribe to notification

Page 23: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Backend Backend

Provenance DatabaseProvenance Database~Union of provenance modelProvenance incrementally builtRelational database (MySQL)

Page 24: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Information ModelInformation Model

Data Provenance ViewData Provenance View

Data ProvenanceEntity is the state of a service or a client Invocation relates a client (invoker) to a service

(invokee). Status.Data provenance of produced data relates

invocation with consumed data

Lightweight schemaKarma2: Provenance Management for Data Driven WorkflowsKarma2: Provenance Management for Data Driven Workflows, Simmhan, Y., et al.; J. Web Svc. Res., 2008

ClientENTITY (Invoker)

ServiceENTITY (Invokee)

Request

Response

Page 25: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Information Model Information Model

Data Provenance & Usage Data Provenance & Usage ViewsViews

ClientENTITY (Invoker)

ServiceENTITY (Invokee)

Request

Response

Page 26: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Information ModelInformation Model

Workflow & Process Provenance Workflow & Process Provenance ViewsViews

ClientENTITY (Invoker)

ServiceENTITY (Invokee)

Request

Response

Page 27: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

DisseminationDissemination

Querying ProvenanceQuerying ProvenanceAll 5 provenance models can be queried for

by ID◦ Data Provenance (by Data ID)◦ Recursive Data Provenance (by Data ID, depth)◦ Data Usage (by Data ID)◦ Process provenance (by Invoker & Invokee)◦ Workflow Trace (by Invoker & Invokee, depth)

Service API to query and return results as XML Document

Provenance Challenge Workshop◦Direct API, Incremental client, Graph

matching algorithm

Incremental building of complex queries

Query Capabilities of the Karma Provenance FrameworkQuery Capabilities of the Karma Provenance Framework, Simmhan, Y., et al.; 1st Provenance Challenge & CCPE J., 2007

Page 28: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Applications: Process MonitoringApplications: Process Monitoring

Realtime Monitoring using Realtime Monitoring using XBayaXBaya

Page 29: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Applications: Information Applications: Information IntegrationIntegrationVisual Exploration using Karma Visual Exploration using Karma GUIGUI

Page 30: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Performance & Scalability Performance & Scalability StudyStudyExperimental SetupExperimental Setup

odin001

odin065

odin064

odin128…

Provenance Clients

tyr10 tyr12tyr11

tyr13

Karma WS-MessengerBroker

PReServ in Tomcat 5.0, Embedded Java DB

MySQL

Gbps Network

Dual-Processor 2.0 GHz 64-bit Opteron,4GB RAM

Dual-Processor 2.0 GHz 64-bit Opteron,16GB RAM, Local IDE disk

Generate Provenance

Query Provenance

Karma Service, WS-Messenger Notification Broker, MySQL

PReServ in Tomcat 5.0 container Tyr web-services cluster (16 Nodes) Odin computer cluster (128 Nodes) Gigabit Ethernet, local IDE disk storage SLURM job manager for parallel job submission on Odin Java 1.5, Jython

Provenance Service Components

Page 31: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

[31/43] [2007-08-16]

Performance & Scalability Performance & Scalability StudyStudyCollecting ProvenanceCollecting Provenance Comparative Study of

Karma with PReServ (U. Soton)

Provenance services on tyr (2Ghz/16GB/64bit) & clients on odin (2Ghz/4GB/64bit)

Time to collect provenance activities synchronously1.Single service with

increasing number of service invocationsKarma scales linearly

2.Linear workflow with increasing number of data produced/ consumedKarma scales linearly, PReServ constant

Performance Evaluation of the Karma Provenance FrameworkPerformance Evaluation of the Karma Provenance Framework, Simmhan, Y., et al.; IPAW & LNCS 4145, 2006

Page 32: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

[32/43] [2007-08-16]

Performance & Scalability Performance & Scalability StudyStudyCollecting ProvenanceCollecting Provenance

Performance Evaluation of the Karma Provenance FrameworkPerformance Evaluation of the Karma Provenance Framework, Simmhan, Y., et al.; IPAW & LNCS 4145, 2006

Time to collect provenance from simulated ensemble WRF forecasting workflow

Scalability with increasing # of parallel runs

1–20 concurrent workflows

Karma scales sub-linear

Page 33: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

[33/43] [2007-08-16]

Performance & Scalability Performance & Scalability StudyStudyQuerying ProvenanceQuerying Provenance

Performance Evaluation of the Karma Provenance FrameworkPerformance Evaluation of the Karma Provenance Framework, Simmhan, Y., et al.; IPAW & LNCS 4145, 2006

Response time to query workflow, process, and data provenance from Karma (PReServ was order of magnitude slower)

Scalability with increasing # of concurrent clients Karma contains 1000 workflow invocations Query for 20 workflow/200 process/200 data provenance

documents

Page 34: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Related WorkRelated WorkPReServ, U. of Southampton (Luc Moreau)

Standalone, Annotation supportNo data provenance, workflow concept; poor

performanceVisTrails, U. of Utah (Juliana Freire)

Workflows for graphical modelingConstrained to browser

PASS, Harvard U. (Margo Seltzer)System level provenanceNo service/data abstraction

Trio, Stanford U. (Jennifer Widom)Tuple level provenance on Database operationsRestricted to databases

Data Collector, IBM (alphaworks)Automatically record & track SOAP MessagesNo data provenance

Page 35: Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

What is new in KarmaWhat is new in Karma33??Process control flow trackingVertical integration across

applications◦Support for database queries

Process & data abstractionMining provenance logs

◦WF composition◦Semantic support (S-OGSA)