Karma Karma Provenance Collection Provenance Collection Framework for Data- Framework for Data- driven Workflows driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian, Abhijit Borude, et al Indiana University
35
Embed
Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
KarmaKarmaProvenance Collection Provenance Collection Framework for Data-driven Framework for Data-driven WorkflowsWorkflows
Yogesh SimmhanMicrosoft Research
Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian, Abhijit Borude, et alIndiana University
Putting the ‘e’ in e-Putting the ‘e’ in e-ScienceScienceMany scientific domains are moving to in Silico
experiments…Earth Sciences, Life Sciences, Astronomy
Common requirements◦ Complex & Dynamic Systems, ◦ Adaptive Resources◦ Data Deluge◦ Need for Collaboration
Cyberinfrastructure to support these needs◦ Massively Parallel Systems◦ High Bandwidth Computer Networks◦ Petascale Data Archives
Grid Middleware provides the glue to tie these using a Service Oriented Architecture
Workflows as ExperimentsWorkflows as ExperimentsData-driven applications designed
as workflowsData flows across applications as
they are transformed, fused and used generating derived data
Control flows determine path to execute but data flow determines data movement and dependency
Manually keeping track of input & derived data to experiments is challenging given the number of data and complexity of application
Data Management Data Management ChallengesChallengesComplex, dynamic data-processing pipelinesRemote execution on Grid resourcesHow was a particular dataset created?
Collaboratory environments with shared resources
Large search space & missing metadataHow good is a given dataset for one’s
application?
Data ProvenanceData Provenance
Metadata that describes the causality of an event◦ Along with context to interpret it
What, when, where, who, how, …We consider provenance for
◦ Workflow execution◦ Service invocations◦ Data products
Workflow & Service Provenance◦ Describes execution of a workflow & invocation of
serviceData Provenance
◦ Describes usage and generation of data products
Provenance /’prɒv ə nəns, -,nɑns/ The history or pedigree of a work of art, manuscript, etc. A record of the ultimate derivation and passage of an item through its various owners.Source: The Oxford English Dictionary
BenefitsBenefitsWhat if the experiment fails?
◦ Did the workflow run correctly? Completely?◦ Was the correct data/service/parameter used?◦ Verification, Validation
Can my peer run the experiment & get the same result?◦ Repeatability
Can I use the results in my publication?◦ Attribution, Copyright
Can I trust the results of prediction?◦ Data Quality
How much did it cost? How much will it cost?◦ Resource Usage & Prediction
[7/43] [2007-08-16]
Gateway ServicesGateway Services
Core Grid ServicesCore Grid Services
LEAD Science Gateway LEAD Science Gateway ArchitectureArchitecture
Compute Resources Data Resources Instruments & Sensors
Proxy CertificateServer (Vault)
Proxy CertificateServer (Vault)
Events & Messaging
Events & Messaging
Resource BrokerResource Broker
Community & User Metadata Catalog
Community & User Metadata Catalog
Workflow engine
Workflow engine Resource
Registry
Resource Registry
ApplicationDeployment
ApplicationDeployment
User’s Grid DesktopUser’s Grid Desktop
What is KarmaWhat is KarmaProvenance Framework A standalone framework to collect data provenance for adaptive workflows with low overhead and lightweight
schema able to answer complex queries Data Provenance is
a form of metadatato track derivation history of datacreated by a workflow runexecuting across organizations (space)over a period of time
Data Usage: Move forward in time Workflow trace: Inverse view from the
actors
A Typical e-Science ExperimentWeather forecast using WRF in LEAD
A Framework for Collecting Provenance in Data-Centric Scientific WorkflowsA Framework for Collecting Provenance in Data-Centric Scientific Workflows, Simmhan, Y., et al.; ICWS, 2006
Service Invocation State Service Invocation State DiagramDiagram
Service Invoked
Data Transfer
InComputati
on
Data Consume
d
Data Produced
Data Transfer
Out
Sending Result
SERVICE
CLIENT
Activities Activities
Types & SourceTypes & SourceActivity Generated By
[Service | Workflow] Initialized Service
[Service | Workflow] Terminated Service
Invoking Service Client
Service Invoked Service
Invoking Service [Succeeded | Failed] Client
Data Transfer Service
Computation Service
Data Produced Service
Data Consumed Service
Sending [Result | Fault] Service
Received [Result | Fault] Client
Sending Response [Succeeded | Failed] Service
Type
Independent
Independent
Bounding
Bounding
Bounding
Operational
Operational
Operational
Operational
Bounding
Bounding
Bounding
[17/43][2007-08-
16]Provenance Framework in Support of Data Quality Estimation
Client ServiceD1
D2
Tim
e
Space Operation
S: Initialize
S: Terminate
S: Send Response Successful
C: Receive Response
S: Send Response
S: Transfer Output Data D2
S: Produce Data D2
S: Perform Computation
S: Consume Data D1
S: Transfer Input Data D1
C: Invocation Successful
S: Invoked
C: Invoke Service
TransferConsume
ProduceCompute
Client Service
Depth
Activities Sequence Diagram for Basic Workflow
[18/43] [2007-08-16]
Provenance Framework in Support of
Data
Quality
Estimation
Workflow Engine
ServiceS2
D1
D2
ServiceS1
D2
D3
D1 D2 D3
WorkflowWF
D1 D3
Tim
e
Operation
S1,S2,WF: Initialize
S1,S2,WF: Terminate
S1: Send Response Successful
WF: Receive Response
S1: Send Response
S1: Produce Data D2
S1: Consume Data D1
WF: Invocation Successful
S1: Invoked
WF: Invoke Service S1
ConsumeProduce
WF S1 S2
S2: Send Response Successful
WF: Receive Response
S2: Send Response
S2: Produce Data D3
S2: Consume Data D2
WF: Invocation Successful
S2: Invoked
WF: Invoke Service S2
Space
DepthSequence Diagram for Simple Workflow
[19/43] [2007-08-16]
Provenance Framework in Support of
Data
Quality
Estimation
Activities Activities
NamingNamingUniquely identifying data & services is
critical for provenanceData product has GUID. Replicas have
URLs.Service & Workflow instances have GUIDServices defined in the context of
workflows have a Node ID in the workflow name space
Publishing Activities as Publishing Activities as NotificationsNotificationsActivities are modeled as notifications
that are sent by different components◦Loosely coupled, easy to generate
provenanceXML Representation of provenance
activitiesWS-Messenger Notification Broker
acts as message bus◦WS-Eventing & WS-Notification
Provenance service & interested clients subscribe to notification
Backend Backend
Provenance DatabaseProvenance Database~Union of provenance modelProvenance incrementally builtRelational database (MySQL)
Information ModelInformation Model
Data Provenance ViewData Provenance View
Data ProvenanceEntity is the state of a service or a client Invocation relates a client (invoker) to a service
(invokee). Status.Data provenance of produced data relates
invocation with consumed data
Lightweight schemaKarma2: Provenance Management for Data Driven WorkflowsKarma2: Provenance Management for Data Driven Workflows, Simmhan, Y., et al.; J. Web Svc. Res., 2008
ClientENTITY (Invoker)
ServiceENTITY (Invokee)
Request
Response
Information Model Information Model
Data Provenance & Usage Data Provenance & Usage ViewsViews
ClientENTITY (Invoker)
ServiceENTITY (Invokee)
Request
Response
Information ModelInformation Model
Workflow & Process Provenance Workflow & Process Provenance ViewsViews
ClientENTITY (Invoker)
ServiceENTITY (Invokee)
Request
Response
DisseminationDissemination
Querying ProvenanceQuerying ProvenanceAll 5 provenance models can be queried for
by ID◦ Data Provenance (by Data ID)◦ Recursive Data Provenance (by Data ID, depth)◦ Data Usage (by Data ID)◦ Process provenance (by Invoker & Invokee)◦ Workflow Trace (by Invoker & Invokee, depth)
Service API to query and return results as XML Document
Query Capabilities of the Karma Provenance FrameworkQuery Capabilities of the Karma Provenance Framework, Simmhan, Y., et al.; 1st Provenance Challenge & CCPE J., 2007
Applications: Process MonitoringApplications: Process Monitoring
Realtime Monitoring using Realtime Monitoring using XBayaXBaya
Applications: Information Applications: Information IntegrationIntegrationVisual Exploration using Karma Visual Exploration using Karma GUIGUI
Dual-Processor 2.0 GHz 64-bit Opteron,16GB RAM, Local IDE disk
Generate Provenance
Query Provenance
Karma Service, WS-Messenger Notification Broker, MySQL
PReServ in Tomcat 5.0 container Tyr web-services cluster (16 Nodes) Odin computer cluster (128 Nodes) Gigabit Ethernet, local IDE disk storage SLURM job manager for parallel job submission on Odin Java 1.5, Jython
Provenance Service Components
[31/43] [2007-08-16]
Performance & Scalability Performance & Scalability StudyStudyCollecting ProvenanceCollecting Provenance Comparative Study of
Karma with PReServ (U. Soton)
Provenance services on tyr (2Ghz/16GB/64bit) & clients on odin (2Ghz/4GB/64bit)
Time to collect provenance activities synchronously1.Single service with
increasing number of service invocationsKarma scales linearly
2.Linear workflow with increasing number of data produced/ consumedKarma scales linearly, PReServ constant
Performance Evaluation of the Karma Provenance FrameworkPerformance Evaluation of the Karma Provenance Framework, Simmhan, Y., et al.; IPAW & LNCS 4145, 2006
Performance Evaluation of the Karma Provenance FrameworkPerformance Evaluation of the Karma Provenance Framework, Simmhan, Y., et al.; IPAW & LNCS 4145, 2006
Time to collect provenance from simulated ensemble WRF forecasting workflow
Performance Evaluation of the Karma Provenance FrameworkPerformance Evaluation of the Karma Provenance Framework, Simmhan, Y., et al.; IPAW & LNCS 4145, 2006
Response time to query workflow, process, and data provenance from Karma (PReServ was order of magnitude slower)
Scalability with increasing # of concurrent clients Karma contains 1000 workflow invocations Query for 20 workflow/200 process/200 data provenance
documents
Related WorkRelated WorkPReServ, U. of Southampton (Luc Moreau)
Standalone, Annotation supportNo data provenance, workflow concept; poor
performanceVisTrails, U. of Utah (Juliana Freire)
Workflows for graphical modelingConstrained to browser
PASS, Harvard U. (Margo Seltzer)System level provenanceNo service/data abstraction
Trio, Stanford U. (Jennifer Widom)Tuple level provenance on Database operationsRestricted to databases
Data Collector, IBM (alphaworks)Automatically record & track SOAP MessagesNo data provenance
What is new in KarmaWhat is new in Karma33??Process control flow trackingVertical integration across