Science Demonstrator Panel Session 1 on Life Sciences
Science Demonstrator Panel Session 1 on Life Sciences
PanCancer Science Demonstrator - Sergei Yakneen, EMBL
2www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
The Science Challenge
3www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
- Collect Next Generation Sequencing Data from several cohorts of cancer patients generated at multiple sequencing centres and across multiple cancer types.
- Reanalyze the data using a uniform and consistent data processing pipeline utilizing established best practices from the International Cancer Genomics Consortium.
- Analyze the integrated data set to identify patterns of germline and somatic mutation that act across cancer types in a PanCancer fashion.
The Science Demonstrator
4www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
- Utilize Butler, a cloud-based large-scale scientific workflow framework developed in the context of ICGC’s Pancancer Analysis of Whole Genomes project to perform a coordinated data analysis across multiple clouds.- Code - https://github.com/llevar/butler- Paper - https://doi.org/10.1101/185736
- Perform automated repeatable deployments and configuration of the entire processing infrastructure at three academic cloud computing environments.- EMBL-EBI Embassy Cloud- ComputeCanada West Cloud- Cyfronet
- Deliver a large dataset (>50 TB) to each cloud computing centre.- Use Butler to run PanCancer pipelines and monitor progress.
Successes
5www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
EMBL/EBI Embassy Compute Canada Cyfronet
vCPU 1000 1000 700
RAM 4 TB 4 TB 2.6 TB
Disk 1 PB 150 TB 200 TB
Data 448 samples from 224
prostate cancer donors
422 samples from 211 pediatric
brain tumour donors
2081 samples from 1000
Genomes Project
71 TB raw data 62 TB raw data 50 TB raw data
Status Alignment and variant
calling completed
Alignment and variant calling
completed
Alignment completed
- Developed configurations for each cloud - https://github.com/llevar/eosc_pilot
- Developed extensive documentation and examples - https://butler.readthedocs.io/en/latest/
- Developed Butler self-healing capabilities.
- Performed data staging via Cyfronet Onedata.
Issues
6www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
- Biggest issue encountered by the SD was the initial shortage of resources for operating at “cloud scale”.- Used 20% of data set that was utilized for PCAWG- < 0.5% of data set for 100k Genomes Project.
- Repeatable provisioning of large clusters of VMs.- >10% of provisioning jobs experience failures
- Data movement and staging.- 50 TB data set takes up to two weeks to move locations- Genomics data requires encryption and network security
measures- Shared access to network-accessible storage creates
processing bottlenecks.
Lessons Learned
7www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
- Effectively supporting life sciences use cases like cancer genomics will require A LOT of resources.
- Diverse data-sets have diverse data handling requirements, thus it is better to provide a variety of tools to make solutions with rather than a single “solution”.
- Automated detection and resolution of issues with infrastructure (a la Butler self-healing) are imperative for effective operation at cloud-scale.
EGA – FAIR Genomic DatasetsTony Wildish on behalf of Nino Spataro andthe EGA-CRG team
8www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
The Science Challenge
9www.eoscpilot.e
u
The principal objectives of our SD are:
i. Test the feasibility of data reproducibility in genomics
ii. Prove the possibility to remaster genomic datasets
iii. Render genomic datasets more FAIR
The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
The Science Demonstrator
10www.eoscpilot.e
u
How we made it:
◆ Implementing portable containerized genomic pipelines
◆ Using a language enabling scalable and reproducible scientific work-flows(Nextflow available at: https://www.nextflow.io/)
◆ Storing the pipelines in a public repository together with metadatadescribing each pipeline step and the used tools and versions
The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
Successes
11www.eoscpilot.e
u
✓ Genomic pipelines portabilility
Pipelines were successfully implemented and executed in a third-party infrastructure.
✓ Genomic pipelines FAIRificationPipelines were deposited jointly with metadata describing the relevant variables relevant
for pipeline description and re-use.Pipelines available at:https://dockstore.org/workflows/github.com/CRG-CNAG/EOSC-Pilot
✓ Feasibility of reproducibility and remastering in genomics
Overall, 97.38% of the obtained variants are shared and 99.66% of the called genotypesperfectly agreed.
The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
Issues
12www.eoscpilot.e
u
✓ Unavailable original version of some softwares
Solved using of the closest available version
✓ Size of the selected dataset to replicate
Solved limiting the replicability to a subset of the original data
Time-consuming understanding of original pipelines
The absence of consolidated standards to store and describe the original pipelinesslowed down the pipeline implementation process
The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
Lessons Learned
13www.eoscpilot.e
u
The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
➢ Reproducibility is a time consuming task on both the implementation andcomputational side.
➢ Universal methods to describe pipelines are required along with long termrepositories to keep the whole experiment reproducible.
➢ A FAIR-compliant semantic repository on which to represent objects and theirrelationships is missing in the EOSC ecosystem.
➢ Open science is still not perceived as scientific obligation by scientificstakeholders. Continuous training and education is required to form a newgeneration of scientists.
CryoEMCarlos Oscar Sorzano (CSIC)
14www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
The Science Challenge
15www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
CryoEM aims to improve reproducibility of their work using image processing workflows through the production of a Scipion workflow file that describes their image processing steps. This allows full reproduction of the same results when the data is reprocessed outside the microscope facility. This description can also be uploaded to public databases, so that other users can understand the process followed to achieve a given structure.
The Science Demonstrator
16www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
• Adapt Scipion (an image processing workflow engine) to be able to thoroughly report in a Json file all the inputs, outputs, and used parameters so that the same processing can be reproduced.
• Adapt Scipion to be able to reproduce an already existing workflow producing exactly the same results as in the first run.
• Connecting Scipion to a public database (Electron Microscopy Data Bank) in order to allow the user to automatically submit his/her results.
• Allow other users to visualize the workflow performed by other scientists.
Successes
17www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
Issues
18www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
1. Create a public repository of acquisition metadata and image processing workflows for new acquisitions, as a temporary repository until the data is finally analyzed and deposited in the standard public databases (EMDB and EMPIAR).
2. Create an authentication policy such that biologists coming out from an EM facility could continue the image processing in some of the EOSC cloud machines.
Lessons Learned
19www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
• There is a big gap between technological advances and their adoption in EU facilities and scientists. Much of it due to funding:
• Local resources for stream processing
• Existence of temporary repositories
• Access to high-end computer clusters
• There is a gap between open science promotion and the obligation of facilities to keep and disclose publicly funded data.
The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
BioimagingBeatriz Serrano-Solano
Jean-Karim Hériché
2
0www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
The Science Challenge
2
1www.eoscpilot.eu
▸ Biological images contain more information than described in their original publications.
▸ Re-analyzing the images with machine learning algorithms can extract new knowledge from these unexploited resources.
The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
The Science Demonstrator
2
2www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
Successes
2
3www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
Issues
2
4www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
Lessons Learned
2
5www.eoscpilot.eu
▸ EOSC Ecosystem
▸ Technical
▸ Lack of high-performance file system
▸ Lack of big memory machines (1 TB of RAM)
▸ Services
▸ User-unfriendly deployment and set-up (e.g. ElastiCluster)
▸ Inadequate training
It would have been more efficient to use the local HPC
Photon and Neutron Michael Schuh, DESY
26www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
The Science Challenge
27www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
Data
● Volume of hundreds of PBs
● Fast data ingest, tens of GB/s per detector
● File creation at kHz rates
Computing
● Fast resources for immediate online
analysis, monitoring running experiments
● Highly specialized offline analysis
frameworks used in physics, chemistry,
materials science, biology, nanotechnology
Policy
● Data Management Plans
● Sharing of FAIR data, methods, results
between users, sites and communities
● Control access during data embargos
● Persistence, long term archival
Images: desy.de/~twhite/crystfel, cid.cfel.de/research/femtosecond_crystallography
The Science Demonstrator
28www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
Motivation:Data sets too large to take home
○ Execute codes on cloud
resources close to the data,
avoid downloading large
amounts of data to user systems
Solution:IaaS and PaaS
○ No stack implementation
by the user
○ Efficient resource management
○ Prepare federation of DESY
OpenStack as EOSC resource
CaaS
○ Libraries for containerized
software, tools and functions
○ Run user defined software stacks
○ Container orchestration
FaaS
○ Containers as cloud functions
Service oriented architecture with cloud computing technologies
Successes
29www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
Automated data processing
● Data comes in, FaaS
automatically triggered
○ Create derived data
○ Extract metadata
Interactive data analysis
● Share and re-use complete workflows
● Jupyter Notebooks as graphical frontend,
run anywhere from EOSC to small remote
system
● Notebooks and functions published and
continuously integrated via GitLab/Docker
Issues
30www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
● Fully integrated template solutions (Magnum/Heat, TOSCA) for scaling COE
clusters (Docker Swarm, kubernetes, Mesos) still cumbersome.
○ EOSC can do a great job in facilitating this with good cluster on demand
service as open science solution
● Cloud Functions (FaaS) have proven to be a good solution for short running
functions, micro-services. Integration with present HPC and HTC systems still
undefined, request routing based on job profile needs research.
○ Submitting into present HPC clusters
○ Virtualizing HPC clusters in the EOSC on demand
● Many licenses are not aware of new container distribution channels and
deployments as cloud functions, as a service.
● Integrated AAI solution needed technical and policy-wise
● Will EOSC provide cloud application building blocks?
○ Container registries
○ Message hubs
○ GitLab
○ JupyterHub
Lessons Learned
31www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
● Scaling highly specialized scientific applications means effort,
splitting into micro-services, containerizing, cloud deployments.
○ Strengthen co-development between cloud, infrastructure, platform
DevOps and software developers as well as data analysts.
● User interaction feels different with graphical applications, Window-
Forwarding from cloud resources often low-performing.
○ Clearly define where batch, headless, API ready and GUI applications
are in focus.
● Fully templated virtualized HPC cluster solutions still to emerge,
same for native deployments and for container clusters
○ EOSC to provide collaborative templates as know-how
as well as cluster on demand solutions.
○ EOSC to provide sufficient resources
for large-scale deployments suitable for big data.