eScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara
Jan 20, 2016
eScience Workshop onScientific Workflows
Matthew B. JonesNational Center for Ecological Analysis and Synthesis
University of California Santa Barbara
Outline
What is SEEK? What is a scientific workflow system?
Kepler as an example system Interoperability among workflow systems
Models of computation Incorporating space, time, and other constraints
Languages for representing scientific workflows Distributed computation and the Grid Challenges from existing scientific codes Data and model integration and semantics Discussion sessions for the day
What is SEEK?
Science Environment for Ecological Knowledge (SEEK) Multidisciplinary research project to create:
Distributed data network (EcoGrid) Environmental, ecological, and systematics data
Scalable systems for scientific analysis (workflow systems)
Systems for semi-automated data and model integration
Collaborators NCEAS, UNM, SDSC, U Kansas Vermont, Napier, ASU, UNC
What is a scientific workflow?
Scientists conduct analyses in varied systems
They mentally coordinate the export and import of data across these systems
This is a flow of data, analogous to business workflows Strong parallels with scripting and visual programming
Scientific workflows formalize this process to: Design Execute Communicate
Systems: Kepler/PtolemyII, DiscoveryNet, Pipeline Pilot, Taverna, Triana, Chimera, Pegasus, …
analytical procedures efficiently
A Trivial Workflow
Modeled as a directed graph Data ingestion/cleaning can be metadata driven Output generation includes creating appropriate
metadata The analysis pipeline itself becomes metadata
Query Grid to find data
Archive output to Grid
More realistic workflows
Scientific workflows represent knowledge about the analytical and modeling process
GARP Invasive Species Model
Training sample (d)
GARPrule set (e)
Test sample (d)
Integrated layers
(native range) (c)
DiGIRSpecies
presence &absence points(native range)
(a)
EcoGridQuery
EcoGridQuery
LayerIntegration
LayerIntegration
Sample
+A3+A2
+A1
DataCalculation
Map Validation
User
ValidationMap
SRBEnvironmental layers (invasion
area) (b)
Integrated layers
(invasion area) (c)
Invasionarea
prediction map (f)
DiGIR Species presence &absence points
(invasion area) (a)
Native range
predictionmap (f)
Model qualityparameter (g)
SRBEnvironmental layers (native
range) (b)
Model qualityparameter (g)
Slide from D. Pennington
Metadata driven data ingestion
Key information needed to read and machine process a data file is in the metadata
Physical descriptors (CSV, Excel, RDBMS, etc.) Logical Entity (table, image, etc) and Attribute (column)
descriptions Name Type (integer, float, string, etc.) Codes (missing values, nulls, etc.) Integrity constraints
Semantic descriptions (ontology-based type systems)
Provenance of derived data
Metadata needs to be revised following any data transformation
Versioning metadata and data is important to reuse/repeatability
The workflow describes the lineage of data processing Derived data sets can be stored in Grid with provenance
Question: which workflow languages are most effective for archiving
Kepler: scientific workflows
Open, collaborative effort of: SEEK, SciDAC/SDM, GEON, Ptolemy Project Ecology, biodiversity, molecular bio, geology,
engineering
Kepler aims to extend the Ptolemy system with: Domain-specific computational models Web and grid service access Data integration support Semantic reasoning
Kepler actors are written in Java but can wrap other applications (such as MATLAB, GRASS)
Actors can call arbitrary Web (or Grid) Services Ptolemy already has a very large inventory of
actors
Kepler understands EML data*
* EML = Ecological Metadata Language,
Support is only partially implemented
Kepler: database access
Kepler: web services access
Kepler: grid services access
Kepler: ecological modeling
Models of Computation
How data flows among workflow nodes is typically not explicitly represented
Scientific models have specific data flow requirements
E.g., simulations sometimes use discrete and sometimes continuous time
Ptolemy introduced specific “Directors” that explicitly control data flow
Process Networks, Discrete Event, Continuous Time, Synchronous Data Flow
Spatial/Temporal/Taxonomic domains
Workflow languages
Modeling Markup Language (MoML) Discovery Process Markup Language
(DPML) …
BPEL WS Invocation Framework (WSIF) WS Choreography
Distributed Computation
Traditional Distributed systems CORBA, DCOM, RMI
Emerging Distributed systems Web services Grid
Existing scheduling systems
Challenge of linking these together in integrated workflows
Data movement can be limiting, so mobile code is attractive Moving code among computational nodes is limiting Security issues for mobile code Implicit models of computation hinder interoperability
Among workflow execution systems Among existing scientific models
Existing scientific codes
Many existing applications in science Codes in analytical environments (SAS, Matlab,
ArcGIS, R, …) Custom models and simulations (C, C++,
FORTRAN,…) Network-accessible services (e.g., Web and Grid
services)
All use different models of computation
Granularity of implementation is always an issue for use in modular workflows
Data and Model Integration
Complex workflows utilize variety of data E.g., in ecology, species distribution, climate,
hydrology, molecular genetics, physiology
Challenges Easily bind heterogeneous data to workflows Locate type-compatible workflow components Create semantically-correct metadata for
derived products of workflows
Homogeneous data integration
Integration of homogeneous or mostly homogeneous data via EML metadata is relatively straightforward
Heterogeneous Data integration
Requires advanced metadata and processing
Attributes must be semantically typed Collection protocols must be known Units and measurement scale must be known Measurement relationships must be known
e.g., that ArealDensity=Count/Area
Label data with semantic types Label inputs and outputs of analytical components with
semantic types
Use reasoning engines to generate transformation steps Beware analytical constraints
Use reasoning engine to discover relevant components
Semantic Mediation
Data Ontology Workflow Components
Discussion sessions
Challenges with making web services work together Compatibility, composition
Workflow language interoperability Workflow environment interoperability Distributed computation Models of computation Workshop findings
Discussion Points 1
Workflows are not necessary in some contexts Pre-compute intermediate products that can then be accessed by db lookup,
especially when it is expensive to compute that product Workflows are a way of documenting what has been done (provenance)
Can be seen as their conceptual model of what needs to be done, need for more descriptive information in the process
Highlights a hot topic: combine the conceptual view with the executable workflow
Go from napkin diagram to formal conceptual workflow to executable workflow As or more important an aspect to design the workflow than to execute it Need to be able to get more information about the workflow than the wsdl
provided Existing work been done on getting people involved in the documentation of
processes: see Soft systems methodology by Peter Checkland Documentation contributes to reproducability of results because of the exact
record a workflow creates Annontation of usage history for workflows gives new users an idea of the
quality, appropriateness, and reliability of the workflow for their own usage Useful to be able to print the WF out in a reference, maybe part of methods, or
at least cite it
Discussion Points 2
Distributed computing with workflows good idea but the human cost of coordinating the system is still
too high to be practical But, still need to make progress through projects that focus on
infrastructure Process flows could also demonstrate the benefits of infrastructure
development to the domain scientists Last mile in terms of usability is often missed by pure
infrastructure efforts – need domain investment to make it seamless
Build collaboration into the proposals, but what is the real research reward in that for the domain scientists?
WebServices++: includes “agreement” on how to pass data by reference (e.g., by LSID)
But also need this to be a long-term solution, which is harder to achieve, yet can’t really wait for the Ws-* standards before we try to make progress
Discussion Points 3
Models of computation There’s an important point in them, but has as much to do with
how you separate different scientific problems – I.e, does ecology have different needs than bioinformatics that is implicit in the discipline
Need much clearer ways of communicating about these models, and the need for different models may not ever arise
Partly driven by how you scope the domain of usefulness for a tool, for example if you’re handling just web services you’ll never need a continuous time model
User probably shouldn’t have to select the model of computation, especially for workflows that can only use one model
How should an end-user choose a workflow system? Don’t really have a good comparison of the various wf systems out
there Track time to create workflows to get estimate of effort
Discussion Points 4
Workflow languages It doesn’t matter too much that they don’t interoperate because there are
so few workflows People aren’t used to digitizing these methodologies so its not considered
an issue Two separate languages: for designing the actors and the workflow
You can describe the workflow without understanding what each component does
Need another language to describe semantics of individual components (e.g. OWL-S, Web service model ontology (WSMO))
Our current efforts focus on describing semantics of data flow, not processing Simplest descriptions of components are name, can extend it over time with
better and better approximations of a formal specification Inputs and outputs alone doesn’t cut it Mathematical description alone doesn’t cut it Really need concept that constrains how the statistical approach is used Mathematically simple models are rare in ecology, complex arbitrary designs are
common and extremely difficult to design Until we learn how to represent models declaratively, we’ll never fully
understand these complex models
Acknowledgements
This material is based upon work supported by the National Science Foundation under awards 0225676 for SEEK and 0225673 (AWSFL008-DS3) for GEON and by the Department of Energy under Contract No. DE-FC02-01ER25486 for SciDAC/SDM and by DARPA under Contract No. F33615-00-C-1703 for Ptolemy. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).
The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus.
The Andrew W. Mellon Foundation.
PBI Collaborators: NCEAS, University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research)
Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON