Top Banner
eScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara
29

EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Jan 20, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

eScience Workshop onScientific Workflows

Matthew B. JonesNational Center for Ecological Analysis and Synthesis

University of California Santa Barbara

Page 2: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Outline

What is SEEK? What is a scientific workflow system?

Kepler as an example system Interoperability among workflow systems

Models of computation Incorporating space, time, and other constraints

Languages for representing scientific workflows Distributed computation and the Grid Challenges from existing scientific codes Data and model integration and semantics Discussion sessions for the day

Page 3: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

What is SEEK?

Science Environment for Ecological Knowledge (SEEK) Multidisciplinary research project to create:

Distributed data network (EcoGrid) Environmental, ecological, and systematics data

Scalable systems for scientific analysis (workflow systems)

Systems for semi-automated data and model integration

Collaborators NCEAS, UNM, SDSC, U Kansas Vermont, Napier, ASU, UNC

Page 4: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

What is a scientific workflow?

Scientists conduct analyses in varied systems

They mentally coordinate the export and import of data across these systems

This is a flow of data, analogous to business workflows Strong parallels with scripting and visual programming

Scientific workflows formalize this process to: Design Execute Communicate

Systems: Kepler/PtolemyII, DiscoveryNet, Pipeline Pilot, Taverna, Triana, Chimera, Pegasus, …

analytical procedures efficiently

Page 5: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

A Trivial Workflow

Modeled as a directed graph Data ingestion/cleaning can be metadata driven Output generation includes creating appropriate

metadata The analysis pipeline itself becomes metadata

Query Grid to find data

Archive output to Grid

Page 6: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

More realistic workflows

Scientific workflows represent knowledge about the analytical and modeling process

Page 7: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

GARP Invasive Species Model

Training sample (d)

GARPrule set (e)

Test sample (d)

Integrated layers

(native range) (c)

DiGIRSpecies

presence &absence points(native range)

(a)

EcoGridQuery

EcoGridQuery

LayerIntegration

LayerIntegration

Sample

+A3+A2

+A1

DataCalculation

Map Validation

User

ValidationMap

SRBEnvironmental layers (invasion

area) (b)

Integrated layers

(invasion area) (c)

Invasionarea

prediction map (f)

DiGIR Species presence &absence points

(invasion area) (a)

Native range

predictionmap (f)

Model qualityparameter (g)

SRBEnvironmental layers (native

range) (b)

Model qualityparameter (g)

Slide from D. Pennington

Page 8: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Metadata driven data ingestion

Key information needed to read and machine process a data file is in the metadata

Physical descriptors (CSV, Excel, RDBMS, etc.) Logical Entity (table, image, etc) and Attribute (column)

descriptions Name Type (integer, float, string, etc.) Codes (missing values, nulls, etc.) Integrity constraints

Semantic descriptions (ontology-based type systems)

Page 9: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Provenance of derived data

Metadata needs to be revised following any data transformation

Versioning metadata and data is important to reuse/repeatability

The workflow describes the lineage of data processing Derived data sets can be stored in Grid with provenance

Question: which workflow languages are most effective for archiving

Page 10: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Kepler: scientific workflows

Open, collaborative effort of: SEEK, SciDAC/SDM, GEON, Ptolemy Project Ecology, biodiversity, molecular bio, geology,

engineering

Kepler aims to extend the Ptolemy system with: Domain-specific computational models Web and grid service access Data integration support Semantic reasoning

Kepler actors are written in Java but can wrap other applications (such as MATLAB, GRASS)

Actors can call arbitrary Web (or Grid) Services Ptolemy already has a very large inventory of

actors

Page 11: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Kepler understands EML data*

* EML = Ecological Metadata Language,

Support is only partially implemented

Page 12: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Kepler: database access

Page 13: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Kepler: web services access

Page 14: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Kepler: grid services access

Page 15: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Kepler: ecological modeling

Page 16: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Models of Computation

How data flows among workflow nodes is typically not explicitly represented

Scientific models have specific data flow requirements

E.g., simulations sometimes use discrete and sometimes continuous time

Ptolemy introduced specific “Directors” that explicitly control data flow

Process Networks, Discrete Event, Continuous Time, Synchronous Data Flow

Spatial/Temporal/Taxonomic domains

Page 17: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Workflow languages

Modeling Markup Language (MoML) Discovery Process Markup Language

(DPML) …

BPEL WS Invocation Framework (WSIF) WS Choreography

Page 18: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Distributed Computation

Traditional Distributed systems CORBA, DCOM, RMI

Emerging Distributed systems Web services Grid

Existing scheduling systems

Challenge of linking these together in integrated workflows

Data movement can be limiting, so mobile code is attractive Moving code among computational nodes is limiting Security issues for mobile code Implicit models of computation hinder interoperability

Among workflow execution systems Among existing scientific models

Page 19: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Existing scientific codes

Many existing applications in science Codes in analytical environments (SAS, Matlab,

ArcGIS, R, …) Custom models and simulations (C, C++,

FORTRAN,…) Network-accessible services (e.g., Web and Grid

services)

All use different models of computation

Granularity of implementation is always an issue for use in modular workflows

Page 20: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Data and Model Integration

Complex workflows utilize variety of data E.g., in ecology, species distribution, climate,

hydrology, molecular genetics, physiology

Challenges Easily bind heterogeneous data to workflows Locate type-compatible workflow components Create semantically-correct metadata for

derived products of workflows

Page 21: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Homogeneous data integration

Integration of homogeneous or mostly homogeneous data via EML metadata is relatively straightforward

Page 22: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Heterogeneous Data integration

Requires advanced metadata and processing

Attributes must be semantically typed Collection protocols must be known Units and measurement scale must be known Measurement relationships must be known

e.g., that ArealDensity=Count/Area

Page 23: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Label data with semantic types Label inputs and outputs of analytical components with

semantic types

Use reasoning engines to generate transformation steps Beware analytical constraints

Use reasoning engine to discover relevant components

Semantic Mediation

Data Ontology Workflow Components

Page 24: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Discussion sessions

Challenges with making web services work together Compatibility, composition

Workflow language interoperability Workflow environment interoperability Distributed computation Models of computation Workshop findings

Page 25: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Discussion Points 1

Workflows are not necessary in some contexts Pre-compute intermediate products that can then be accessed by db lookup,

especially when it is expensive to compute that product Workflows are a way of documenting what has been done (provenance)

Can be seen as their conceptual model of what needs to be done, need for more descriptive information in the process

Highlights a hot topic: combine the conceptual view with the executable workflow

Go from napkin diagram to formal conceptual workflow to executable workflow As or more important an aspect to design the workflow than to execute it Need to be able to get more information about the workflow than the wsdl

provided Existing work been done on getting people involved in the documentation of

processes: see Soft systems methodology by Peter Checkland Documentation contributes to reproducability of results because of the exact

record a workflow creates Annontation of usage history for workflows gives new users an idea of the

quality, appropriateness, and reliability of the workflow for their own usage Useful to be able to print the WF out in a reference, maybe part of methods, or

at least cite it

Page 26: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Discussion Points 2

Distributed computing with workflows good idea but the human cost of coordinating the system is still

too high to be practical But, still need to make progress through projects that focus on

infrastructure Process flows could also demonstrate the benefits of infrastructure

development to the domain scientists Last mile in terms of usability is often missed by pure

infrastructure efforts – need domain investment to make it seamless

Build collaboration into the proposals, but what is the real research reward in that for the domain scientists?

WebServices++: includes “agreement” on how to pass data by reference (e.g., by LSID)

But also need this to be a long-term solution, which is harder to achieve, yet can’t really wait for the Ws-* standards before we try to make progress

Page 27: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Discussion Points 3

Models of computation There’s an important point in them, but has as much to do with

how you separate different scientific problems – I.e, does ecology have different needs than bioinformatics that is implicit in the discipline

Need much clearer ways of communicating about these models, and the need for different models may not ever arise

Partly driven by how you scope the domain of usefulness for a tool, for example if you’re handling just web services you’ll never need a continuous time model

User probably shouldn’t have to select the model of computation, especially for workflows that can only use one model

How should an end-user choose a workflow system? Don’t really have a good comparison of the various wf systems out

there Track time to create workflows to get estimate of effort

Page 28: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Discussion Points 4

Workflow languages It doesn’t matter too much that they don’t interoperate because there are

so few workflows People aren’t used to digitizing these methodologies so its not considered

an issue Two separate languages: for designing the actors and the workflow

You can describe the workflow without understanding what each component does

Need another language to describe semantics of individual components (e.g. OWL-S, Web service model ontology (WSMO))

Our current efforts focus on describing semantics of data flow, not processing Simplest descriptions of components are name, can extend it over time with

better and better approximations of a formal specification Inputs and outputs alone doesn’t cut it Mathematical description alone doesn’t cut it Really need concept that constrains how the statistical approach is used Mathematically simple models are rare in ecology, complex arbitrary designs are

common and extremely difficult to design Until we learn how to represent models declaratively, we’ll never fully

understand these complex models

Page 29: EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Acknowledgements

This material is based upon work supported by the National Science Foundation under awards 0225676 for SEEK and 0225673 (AWSFL008-DS3) for GEON and by the Department of Energy under Contract No. DE-FC02-01ER25486 for SciDAC/SDM and by DARPA under Contract No. F33615-00-C-1703 for Ptolemy. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus.

The Andrew W. Mellon Foundation.

PBI Collaborators: NCEAS, University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research)

Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON