National Center for Supercomputing Applications The Way Things Go e-Science is a complex activity Scientific knowledge is comprehensible only in the context.

National Center for Supercomputing Applications

The Way Things Go

e-Science is a complex activity

Scientific knowledge is comprehensible only in the context of those activities

Adopt the Rube Goldberg view

Rube Goldberg


Grand challenge: systems-scale science

Observation and modeling of multiple systems at multiple scales

Linking data and tools from different disciplines

to get a valid global result!

“... modeling complex systems will be a major research challenge for the 21st century”- National Science Foundation


Building current practices up isn't working

Heterogeneous tools, data formats

Little global coordination of research

Little funding for sustained stewardship of tools and data

M.C. Escher, “Tower of Babel” (1928)


Proposed solutions aren't working

e-Journals – not machine-interpretable Collaboration tools

scientists just use email like everyone else Portals and digital libraries – typically:

centralized domain-specific

The Grid – can orchestrate complex processing jobs, but that's not science


Only networks work at scale

Single researcher Ad hoc data mgt,

single-user apps Community

Community tools, resources, control

Global No global practice,

tools, control

Desktop

Workgroup

Network


How do we get there?

e-Science means managing Process, and Data

Current approaches favor one or the other

Information is getting lost

model

refine

observe

predict

data

criticalinterface


Trends: process data

Data Semantics

Batch

Metadata

Interactive

Workflow

* mainframes

* digital libraries

* portals

* ontologies

* provenance

* desktop apps

* formats

* e-notebooks

* the grid

process

data

* rules


Key technologies

Semantic web: data/metadata Provides means of merging descriptive

information even if it only partially agrees (e.g., comes from two different communities)

Workflow: process Describes complex procedures independently

of how they are executed Provenance: process + data/metadata

Links workflow, data, and any ancillary descriptive information (e.g., attribution)


Semantics: data to knowledge

Data

Information

Knowledge

Concrete

Abstract

Aggregation, annotation

Learning, inference

Streams, arrays,swaths, etc.(a.k.a. files)

Collections, tags,attributes, etc.(a.k.a. metadata)

Ontologies, rules,models, etc.(a.k.a. semantics)

(cf Reagan Moore)


Semantic web: RDF triple

Declarative: asserts a fact Subject and object URI's identify arbitrary

entities (things, people, concepts, events) Predicate identifies the relationship

between them

subject objectpredicate


Triples form an open network

Subject nodes aren't “owned” by any single agent or container

Any actor can add arcs to the implicit, total, world graph

Any two graphs can be joined

hasBreed


Non satis non scire(to know is not enough)

Semantic web “layer cake”

Where do we manage process? User interface? Applications?

“Semantic Grid” (D. DeRoure, C. Goble)

(source: World Wide Web Consortium)


Workflow: process description

Describe complex operations as networks of simpler operations

Abstract operation execution from description

Can be shared (but may not be portable)

(Taverna)

(Kepler)


Anatomy of a workflow

Declarative: says what do to

Modules identify arbitrary procedures

Arcs identify flow of control and/or data (data flow is usually implicit)“Module”

Control flow

Execution model (usu. implicit)


Workflow systems

Modules representing units of computation

Language for specifying WF modules control flow

Engine for executing WF

D2K (source: NCSA)


Work vs. workflow systems

Scientists are not WF modules

Science work also involves social organization

incl. funding field and “wet lab”

manual work discourse: review,

validation(source: CNRS/UCSD)


Provenance: what happened

Answers critical questions What led to this

result? When and how

were observations made, conclusions reached?

Is a causal network of events


Complementary incomplete notions of provenance

Artifact-centric (e.g., digital libraries) “lineage”= events

in lifecycle of artifact e.g., custody

IR's focus on curation events (not antecedent processes)

Process-centric (e.g., workflow) computational

events (e.g., service invocations)

control flow artifacts are either

not mentioned or opaque (tool-specific)


Provenance Challenges 1 & 2

IPAW 2006, HPDC 2007

20 teams, 1 workflow, 9 queries major players

Interoperability? lots of manual work

required call for standards

(source: gridprovenance.org)


Artifact + process provenance = “open provenance”

Can describe any process, not just WF execution (e.g., science!)

Allows alternate accounts by different observers

Rules for inferring transitive causal relationships

(source: Luc Moreau et al)


Open Provenance Model

3 node types – artifact, process, agent 5 arc types – used, generated, triggered,

derived, controlled – and inference rules Generic – extensibility via annotation Choice of granularity and focus (e.g.,

artifact or process-centric)

(source: Luc Moreau et al)


NCSA Provenance Infrastructure

Open Provenance Model

Tupelo Semantic Content Repository

Context ContextContext

OPM toolkit

Store Store Store

OPM toolkit

Visualization,interaction

Tracking,modeling,presentation

Abstraction,inference,storage

destkop,portal,etc.


Tupelo: semantic content

Abstracts content from storage impls (e.g., Sesame, Mulgara)

Provides location-independent addressing of content and metadata

Supports transparent mirroring, caching, failover, etc.

(tupeloproject.org)


CyberIntegrator: workflow by example

Records what users do as provenance source,

intermediate, and final artifacts

steps and parameters

Can re-enact interaction as a workflow


MAEviz: analaysis/viz app, workflow “behind the scenes”

GIS app. platform Earthquake hazard

analysis plug-in Data catalog

built environment fragility/hazard

models Driven by workflow

-> provenance


CyberCollaboratory: collaboration + provenance

User interaction with tools generates events

Events are captured using the OPM and published to Tupelo

Non-portal apps can browse / use provenance


Summary

“The way things go” is critical to e-Science at scale

Provenance is an open causal network

New infrastructure supports provenance


Resources / acknowledgements Grid Provenance Challenge

http://twiki.gridprovenance.org/ NCSA technologies

Tupelo: http://tupeloproject.org/ CyberIntegrator: http://isda.ncsa.uiuc.edu/ MAEviz: http://maeviz.cee.uiuc.edu/ CyberCollaboratory:

http://ecid.ncsa.uiuc.edu/cybercollab/ Acknowledgements:

Jim Myers, Luc Moreau, Juliana Friere, Patrick Paulson, Simon Miles, Bob McGrath, and more ...

National Center for Supercomputing Applications The Way Things Go e-Science is a complex activity Scientific knowledge is comprehensible only in the context.

Documents

national center

science slide

predicate slide

data mgt

grid process data

data critical interface

attribution slide

data current approaches