National Center for Supercomputing Applications The Way Things Go e-Science is a complex activity Scientific knowledge is comprehensible only in the context of those activities Adopt the Rube Goldberg view Rube Goldberg
Dec 17, 2015
National Center for Supercomputing Applications
The Way Things Go
e-Science is a complex activity
Scientific knowledge is comprehensible only in the context of those activities
Adopt the Rube Goldberg view
Rube Goldberg
National Center for Supercomputing Applications
Grand challenge: systems-scale science
Observation and modeling of multiple systems at multiple scales
Linking data and tools from different disciplines
to get a valid global result!
“... modeling complex systems will be a major research challenge for the 21st century”- National Science Foundation
National Center for Supercomputing Applications
Building current practices up isn't working
Heterogeneous tools, data formats
Little global coordination of research
Little funding for sustained stewardship of tools and data
M.C. Escher, “Tower of Babel” (1928)
National Center for Supercomputing Applications
Proposed solutions aren't working
e-Journals – not machine-interpretable Collaboration tools
scientists just use email like everyone else Portals and digital libraries – typically:
centralized domain-specific
The Grid – can orchestrate complex processing jobs, but that's not science
National Center for Supercomputing Applications
Only networks work at scale
Single researcher Ad hoc data mgt,
single-user apps Community
Community tools, resources, control
Global No global practice,
tools, control
Desktop
Workgroup
Network
National Center for Supercomputing Applications
How do we get there?
e-Science means managing Process, and Data
Current approaches favor one or the other
Information is getting lost
model
refine
observe
predict
data
criticalinterface
National Center for Supercomputing Applications
Trends: process data
Data Semantics
Batch
Metadata
Interactive
Workflow
* mainframes
* digital libraries
* portals
* ontologies
* provenance
* desktop apps
* formats
* e-notebooks
* the grid
process
data
* rules
National Center for Supercomputing Applications
Key technologies
Semantic web: data/metadata Provides means of merging descriptive
information even if it only partially agrees (e.g., comes from two different communities)
Workflow: process Describes complex procedures independently
of how they are executed Provenance: process + data/metadata
Links workflow, data, and any ancillary descriptive information (e.g., attribution)
National Center for Supercomputing Applications
Semantics: data to knowledge
Data
Information
Knowledge
Concrete
Abstract
Aggregation, annotation
Learning, inference
Streams, arrays,swaths, etc.(a.k.a. files)
Collections, tags,attributes, etc.(a.k.a. metadata)
Ontologies, rules,models, etc.(a.k.a. semantics)
(cf Reagan Moore)
National Center for Supercomputing Applications
Semantic web: RDF triple
Declarative: asserts a fact Subject and object URI's identify arbitrary
entities (things, people, concepts, events) Predicate identifies the relationship
between them
subject objectpredicate
National Center for Supercomputing Applications
Triples form an open network
Subject nodes aren't “owned” by any single agent or container
Any actor can add arcs to the implicit, total, world graph
Any two graphs can be joined
hasBreed
National Center for Supercomputing Applications
Non satis non scire(to know is not enough)
Semantic web “layer cake”
Where do we manage process? User interface? Applications?
“Semantic Grid” (D. DeRoure, C. Goble)
(source: World Wide Web Consortium)
National Center for Supercomputing Applications
Workflow: process description
Describe complex operations as networks of simpler operations
Abstract operation execution from description
Can be shared (but may not be portable)
(Taverna)
(Kepler)
National Center for Supercomputing Applications
Anatomy of a workflow
Declarative: says what do to
Modules identify arbitrary procedures
Arcs identify flow of control and/or data (data flow is usually implicit)“Module”
Control flow
Execution model (usu. implicit)
National Center for Supercomputing Applications
Workflow systems
Modules representing units of computation
Language for specifying WF modules control flow
Engine for executing WF
D2K (source: NCSA)
National Center for Supercomputing Applications
Work vs. workflow systems
Scientists are not WF modules
Science work also involves social organization
incl. funding field and “wet lab”
manual work discourse: review,
validation(source: CNRS/UCSD)
National Center for Supercomputing Applications
Provenance: what happened
Answers critical questions What led to this
result? When and how
were observations made, conclusions reached?
Is a causal network of events
National Center for Supercomputing Applications
Complementary incomplete notions of provenance
Artifact-centric (e.g., digital libraries) “lineage”= events
in lifecycle of artifact e.g., custody
IR's focus on curation events (not antecedent processes)
Process-centric (e.g., workflow) computational
events (e.g., service invocations)
control flow artifacts are either
not mentioned or opaque (tool-specific)
National Center for Supercomputing Applications
Provenance Challenges 1 & 2
IPAW 2006, HPDC 2007
20 teams, 1 workflow, 9 queries major players
Interoperability? lots of manual work
required call for standards
(source: gridprovenance.org)
National Center for Supercomputing Applications
Artifact + process provenance = “open provenance”
Can describe any process, not just WF execution (e.g., science!)
Allows alternate accounts by different observers
Rules for inferring transitive causal relationships
(source: Luc Moreau et al)
National Center for Supercomputing Applications
Open Provenance Model
3 node types – artifact, process, agent 5 arc types – used, generated, triggered,
derived, controlled – and inference rules Generic – extensibility via annotation Choice of granularity and focus (e.g.,
artifact or process-centric)
(source: Luc Moreau et al)
National Center for Supercomputing Applications
NCSA Provenance Infrastructure
Open Provenance Model
Tupelo Semantic Content Repository
Context ContextContext
OPM toolkit
Store Store Store
OPM toolkit
Visualization,interaction
Tracking,modeling,presentation
Abstraction,inference,storage
destkop,portal,etc.
National Center for Supercomputing Applications
Tupelo: semantic content
Abstracts content from storage impls (e.g., Sesame, Mulgara)
Provides location-independent addressing of content and metadata
Supports transparent mirroring, caching, failover, etc.
(tupeloproject.org)
National Center for Supercomputing Applications
CyberIntegrator: workflow by example
Records what users do as provenance source,
intermediate, and final artifacts
steps and parameters
Can re-enact interaction as a workflow
National Center for Supercomputing Applications
MAEviz: analaysis/viz app, workflow “behind the scenes”
GIS app. platform Earthquake hazard
analysis plug-in Data catalog
built environment fragility/hazard
models Driven by workflow
-> provenance
National Center for Supercomputing Applications
CyberCollaboratory: collaboration + provenance
User interaction with tools generates events
Events are captured using the OPM and published to Tupelo
Non-portal apps can browse / use provenance
National Center for Supercomputing Applications
Summary
“The way things go” is critical to e-Science at scale
Provenance is an open causal network
New infrastructure supports provenance
National Center for Supercomputing Applications
Resources / acknowledgements Grid Provenance Challenge
http://twiki.gridprovenance.org/ NCSA technologies
Tupelo: http://tupeloproject.org/ CyberIntegrator: http://isda.ncsa.uiuc.edu/ MAEviz: http://maeviz.cee.uiuc.edu/ CyberCollaboratory:
http://ecid.ncsa.uiuc.edu/cybercollab/ Acknowledgements:
Jim Myers, Luc Moreau, Juliana Friere, Patrick Paulson, Simon Miles, Bob McGrath, and more ...