Advances in Scientific Workflow Environments

Post on 17-Feb-2017

168 Views

Category:

Science

2 Downloads

Preview:

Click to see full reader

Transcript

2016-09-04 BioExcel SIG, ECCB, Amsterdam

Advances in Scientific Workflow Environments

Carole Goble, Stian Soiland-ReyesThe University of Manchester

carole.goble@manchester.ac.ukhttp://esciencelab.org.uk/

What is a Workflow? • Orchestrating multiple

computational tasks• Managing the control and

data flow between them• In a world that is

homogeneous or heterogeneous

• Tasks– Local / remote– Local / third party– White, grey or black boxes– Reliable / fragile– Reserved / dynamic– Various underpinning

infrastructure– Various access controls

BioExcel: Biomolecular recognition

What is a Workflow? Automation

– Automate computational aspects– Repetitive pipelines, sweep campaigns

Scaling – compute cycles– Make use of computational

infrastructure & handle large dataAbstraction – people cycles

– Shield complexity and incompatibilities– Report, re-use, evolve, share, compare– Repeat – Tweak - Repeat– First class commodities

Provenance - reporting– Capture, report and utilize log and

data lineage auto-documentation– Traceable evolution, audit,

transparency– Compare

With thanks to Bertram Ludascher: WORKS 2015 Keynote

FindableAccessibleInteroperableReusable(Reproducible)

https://pegasus.isi.edu/2016/02/11/pegasus-powers-ligo-gravitational-waves-detection-analysis/

Laser Interferometer Gravitational-Wave Observatory – first detection of gravitational waves from colliding black holes

Morphological, hemodynamic and structural analyses linked to aneurysm genesis, growth and rupture.

[Susheel Varma] http://www.vph-share.eu/

http://taverna.org.uk

Galaxy https://usegalaxy.org/

Marine metagenomics

Workflow Driven

+ Bespoke Scripts

[Rob Finn]

Open PHACTShttps://www.knime.org/

BioExcel workflow

https://www.openphacts.org/

Targets

Pharmacological queriestarget, compound and pathway data

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115460

Scripts, Ensemble toolkit, execution patterns

http://www.extasy-project.org/

http://www.myexperiment.org

WF Zoo

Workflow Patterns, templates

Data wrangling& analytics

Simulations

Instrumentpipelines++

http://tpeterka.github.io/maui-project/The Future of Scientific Workflows, Report of DOE Workshop 2015, http://science.energy.gov/~/media/ascr/pdf/programdocuments/docs/workflows_final_report.pd

Workflow Patterns, templates

Data wrangling& analytics

Simulations

Instrumentpipelines++ Garijo et al Common Motifs in Scientific Workflows: An Empirical Analysis, FGCS, 36, July 2014, 338–351

Workflow Patterns, templates• Long running and complex code• Tunable parameters and input sets• Simulation sweeps / iterations• Ensembles, comparisons • Tricky set-ups, human-in-the-loop

interaction• Computational steering• In situ workflows – multiple tasks,

same box, within fixed time– data locality. – human-in-the-loop. – capture provenance.

Data wrangling& analytics

Simulations

Instrumentpipelines++

Traction + ExamplesReuse behaviours

Exploratory vs ProductionDifferent kinds of user / deployment

Developer – User Ratios

BiologistDeveloper ComputationalScientist

Embe

d in A

pplic

ation

Embe

d in p

latfor

m

Embe

d in in

frastr

uctu

re

Existing computational research workflow systems

https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems

“Multi-scale” WFMS• Workflow

Management System– Its design and

reporting environment– Its execution

environment• The tasks

– tools, codes and services and their execution environments

• Stack layer– App level, infrastructure

level

Component making

Tasks loosely coupled through files, • execute on geographically

distributed clusters, clouds, grids across systems

• execute on multiple facilities• call host services (web / grid

services)

DAICDistributed Area/Instrument Computing

“Multi-scale” WFMS

Tasks tightly coupled• exchanging info over

memory/storage• network of supercomputers • In situ workflows – multiple tasks, same

box, within fixed time

HPC

InteroperabilityPortabilityGranularityMaintenance

Workflow Environment Ecosystem

Copernicus workflow engine for parallel adaptive molecular dynamics

• Peer-to-peer distributed computing platform– high-level parallelization of

statistical sampling problems• Consolidation of

heterogeneous compute resources

• Automatic resource matching of jobs against compute resources

• Automatic fault tolerance of distributed work

• Workflow execution engine to define a problem (reporting) and trace its results live (provenance)

• Flexible plugin facilities – programs to be integrated to the

workflow execution engine

Free Energy Workflow using GROMACS

http://copernicus-computing.org/

COMPs/PyCOMPs: Programmer Productivity framework

• Sequential programming– Parallelisation and distribution

heavy-lifting– Dependency detection

• Infrastructure unaware– Abstract application from

underlying infrastructure– Portability

• Standard Programming Languages– Java, Python, C/C++

• No (or few!) APIs– Standard Java

Shield the user/programmer

Exposure to the infrastructure

System Design

Resource provisioning

Adaptive/dynamic workflows

Manage/minimize data transfers

Smart parallelism

Code staging

Data stagingFail-over

Human in the loop

OS/R Guarantees

Service Guarantees

Stop Press!GUIs not essential!• Canvas, drag-drop blocks,

arrows, run button• Command-line & embedding

in developer or user applications

Scripts can be workflows!• WMS<->Scripts• Script vs Workflows/ASAP:

– Automation: *****– Scaling: **– Abstraction: *– Provenance: **

Stop Press!GUIs not essential!• Canvas, drag-drop blocks,

arrows, run button• Command-line & embedding

in developer or user applications

Scripts can be workflows!• WMS <-> Scripts• Script vs Workflows/ASAP:

– Automation: *****– Scaling: **– Abstraction: *– Provenance: **

Work close to a problem-specific ad-hoc data model

Domain Specific Language "programming-lite" scripts

• wire with declarative "makefile"-like DAG

Plus

• procedural scripting and expressions in languages like Javascript and Python

Nextflow, SnakeMake, Common Workflow Language

GUIs Are Essential take-up by the user base

Workflowising script software eco-systemsprime example: provenance

ASAP• common, interoperable

provenance recording– W3C PROV

ASAP• YesWorkflow.org

– Annotations in script yield workflow view

ASAP• Library profilers

– noWorkflow• runtime provenance

recorders– Sumatra, RDataTracker

Provenance the link between computation and results

W3C PROV model standard

record for reportingcompare diffs/discrepanciesprovenance analyticstrack changes, adapt partial repeat/reproducecarry attributionscompute creditscompute data quality/trustselect data to keep/releaseoptimisation and debugging

Metadata propagation –where was the physical sample collected, and who should be attributed?

Task-based abstractions: simplifying provenance using motifs and tool annotations“Free energy calculation” rather than 5 steps including preparation of PDB files and GROMACS execution

Provenance the link workflow variants and workflow reuse and repurpose

W3C PROV model standard?record for reportingcompare diffs/discrepanciesprovenance analyticstrack changes, adapt carry attributionscompute design creditsversioning, forking, cloning

Nested workflows functions by stealth

Copy and paste fragmentationDesigning for reuse Find and Go

Software practicesSystematic reuse

Guidelines for persistently identifying software using DataCitehttps://epubs.stfc.ac.uk/work/24058274

https://www.force11.org/software-citation-principles

ASAP Wfms for FAIR Science

Automate: workflows, programs and services folks already use or want to use

Scale: Enable computational productivity

Abstract: Enable human productivity

Provenance: Record and use

Provenance

Reproducibility

PortabilityReuse

UsabilityUnderstanding

Validation

Workflow Plugged in Code

Reporting Comparison

Interoperability

Thanks to Bertram Ludascher

● Task-specific “mini-workflow” fragments– e.g. using Gromacs, CPMD,

HADDOCK● Packaged

– EGI VM images and Docker containers

● Backed by existing registries– ELIXIR’s bio.tools and EGI

App DB● Instantiated as cloud

instances– private (Open Nebula, Open

Stack)– public (e.g. Amazon AWS )

Application Building BlocksBioExcel Virtualised Software Library“transversal workflow units”, higher level operations

BioExcel Use cases

● Genomics● Ensembl Molecular

simulations● Free Energy simulations● Multiscale modelling of

molecular basis for odor and taste

● Biomolecular recognition● Pharmacological queries● Virtual Screening

Finding valid pathways through free-energy landscapes: implementation of the “string of swarms” method using Copernicus as a workflow manager, and GROMACS as a compute engine.

Workflow Interoperability. • Common format for bioinformatics tool

& workflow execution• Community based standards effort• Designed for clusters & clouds• Supports the use of containers (e.g.

Docker)• Specify data dependencies between

steps• Scatter/gather on steps• Nest workflows in steps

• Develop your pipeline on your local computer (optionally with Docker)

• Execute on your research cluster or in the cloud

• Deliver to users via workbenches

• EDAM ontology (ELIXIR-DK) to specify file formats and reason about them: “FASTQ Sanger” encoding is a type of FASTQ file

Workflow Research Object Bundleresearchobject.org

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects, J Web Semantics doi:10.1016/j.websem.2015.01.003

application/vnd.wf4ever.robundle+zip

Generic Grid middleware

Workflow bus: provide services for1) Interoperability and integration, 2) composition, 3) provenance,

4) Enactment, 5) Human in the loop computing

Taverna Kepler Triana VLAMG

Sub workflow 1

Sub workflow 2

Sub workflow 3

Scientific experiment: a meta workflow

Sub workflow 4

Generic Grid middleware

Workflow bus: provide services for1) Interoperability and integration, 2) composition, 3) provenance,

4) Enactment, 5) Human in the loop computing

Taverna Kepler Triana VLAMG

Sub workflow 1

Sub workflow 2

Sub workflow 3

Scientific experiment: a meta workflow

Sub workflow 4

Z. Zhao et al., “Workflow bus for e-Science”, in IEEE e-Science 2006, Amsterdam

2007

2015

http://bioexcel.eu/events/bioexcel-workflow-training-for-computational-biomolecular-research/

Adam Hospital (IRB), Anna Montras (IRB), Stian Soiland-Reyes (UNIMAN), Alexandre Bonvin (UU), Adrien Melquiond (UU), Josep Lluís Gelpí (BSC), Daniele Lezzi (BSC), Steven Newhouse (EBI), Jose A. Dianes (EBI), Mark Abraham (KTH), Rossen Apostolov (KTH), Emiliano Ippoliti (Jülich), Adam Carter (UEDIN), Darren J. White (UEDIN)

Slides: Bertram Ludascher, Ewa Deelman, Vasa Curcin, Paolo Missier, Pinar Alper, Susheel Varma, Rob Finn, Michael Crusoe, Rizos Sakellariou

Sign upASAP!

Bonus Slides

top related