“CC*Data: National Cyberinfrastructure for Scientific Data Analysis at Scale (SciDAS) NSF-CC* [Award#1659300] Claris Castillo, Fan Jiang, Wenzhao Zhang, Paul Ruth, Mert Cevik, Michael Stealey, Hong Yi, Gokkul Sudan, Terrell Russel, Jason Coposky, Ray Idaszak (RENCI); Alex Feltus, Melissa Smith, Ben Stealy, William Poehlman, Nicholas Mills (Clemson University); Stephen Ficklin; Tyler Biggs, Josh Burns (WSU); Blaine Lee (NCBI); Anthony Castronova (CUAHSI); Pabistra Dash (Virginia Tech); Jeff Horsburgh; David Tarbaton (Utah State University); Mats Rynge (USC) F. Alex Feltus, Ph.D. Clemson Dept. of Genetics & Biochemistry (Associate Professor) Allele Systems LLC (CEO) Internet2 Board of Trustees (Member) [email protected]OSG All Hands Meeting: 20 March 2018 @ 4pm
34
Embed
“CC*Data: National Cyberinfrastructure for Scientific Data Analysis … · 2018-03-20 · clouds and CI facilities • Parameterizable templates will be provided with sane defaults
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
“CC*Data: National Cyberinfrastructure for Scientific Data Analysis at Scale (SciDAS)NSF-CC* [Award#1659300]
Claris Castillo, Fan Jiang, Wenzhao Zhang, Paul Ruth, Mert Cevik, Michael Stealey, Hong Yi, Gokkul Sudan, Terrell Russel, Jason Coposky, Ray Idaszak (RENCI); Alex Feltus, Melissa Smith, Ben Stealy, William Poehlman, Nicholas Mills (Clemson University); Stephen Ficklin; Tyler Biggs, Josh Burns (WSU); Blaine Lee (NCBI); Anthony Castronova(CUAHSI); Pabistra Dash (Virginia Tech); Jeff Horsburgh; David Tarbaton (Utah State University); Mats Rynge (USC)
F. Alex Feltus, Ph.D.Clemson Dept. of Genetics & Biochemistry (Associate Professor)
Allele Systems LLC (CEO)Internet2 Board of Trustees (Member)
Soon, many pharmacies, subways, hospitals, research labs, public health facilities, police stations, etc. will have a DNA sequencer generating Exabytes of data in aggregate each week.
qPCR is about to be replaced by highthroughput DNA sequencing.
20th Century 21st Century
Distributed petascale/exascalesystems will never be turn-key so our focus is on the “informaticist”.
SciDAS is Building a Scalable Ecosystem that Works for Researchers
Giga-/Tera-/Peta-scale
END USERS
www.iowaturfgrass.org
“If you build it they will come”
“They will help build you it while using it.”
Geneticist
StorageEngineer
SoftwareEngineer
SciDAS embeds active high-scale data usersin the cyberinfrastructure engineering stack.
SciDAS Ecosystem: CI, clouds and community platforms
End User Facing
Cloud/infrastructure/compute
Networks
Storage infrastructure
+100 sites +1500 users
CLI
SciDAS Breakdown
Leverage network capabilities to enable efficient data movement
Infrastructure-agnosticabstraction layer for compute
Programmable and policy-able storage-agnostic abstraction layer for data
User interface that is consumable and builds on reproducible artifacts+100 sites +1500 users
CLI
SciDAS Breakdown
Leverage network capabilities to enable efficient data movement
Infrastructure-agnosticabstraction layer for compute
Programmable and policy-able storage-agnostic abstraction layer for data
User interface that is consumable and builds on reproducible artifacts+100 sites +1500 users
CLI
SciApps: Towards Reproducible Science• Scientific applications will be available in the form of SciApps “virtual
appliances” (CC-ADAMANT, [works15])• Borrowed concept from ’virtual appliance’, i.e., virtual machine image• A SciApp is configured with the application software needed to reproduce an experiment with
the highest fidelity possible• A SciApp may consists of multiple containers spanning over a virtual network across multiple
clouds and CI facilities • Parameterizable templates will be provided with sane defaults to meet the needs of
the scientists
[works15] Enabling Workflow Repeatability with Virtualization Support, Fan Jiang et.al. Workshop on Workflows of Large-Scale Science, Supercomputing Conference (SC15), Austin, Texas,2015.
Jupyter SciApp• Jupyter notebooks are a critical
tool to enable reproducibilescience
• A Jupyter notebook provides a workspace that present data and code on a cohesive workspace environment
• Mesos was initially developed to target single owner cluster environments• RENCI extended Mesos to support Geo-Distributed Environments• RENCI developed a meta-orchestrator to which independent Mesos clusters
subscribe to and handle meta-scheduling on behalf of frameworks • Mesos extensions: (2500 + 200 lines of code)• Meta-orchestrator (600 lines of code)
• Mesos as a resource discovery layer
Wenzhao
Requester
Orchestrator
Chronos Marathon
Mesos Cluster 1
…
Geo-Mesos
++
Chronos Marathon
Mesos Cluster 2
…
Resourcediscovery
++…
1. Identify resources available among registered Mesos clusters
Virtualization System Metadata to encode rich information
Rule engine programmed with rules to enact policies Data Federation
iRODS: Integrated Rule Oriented Data System
• iRODS provides a unified namespace over SciDAS storage infrastructure across Clemson, RENCI and WSU
• iRODS provides enable policy-driven management critical to data-sharing collaborations in SciDAS
WAN
SciDAS: Wide area iRODS deploymentSciDAS provided a very significant first for the iRODS community
iCat
resourcenodes
iRODS Zone
Traditional iRODS deployment SciDAS (desired) iRODS deployment
iRODS federation as a workaround
iRODS Zone: administrative unit with full localized governance
iRODS over the Wide Area Network (WAN)
• iRODS team connected iRODS to a MariaDB Galera Cluster to provide a multi-master, distributed iRODS catalog.
SciDAS Zone
MariaDB Galleracluster
“Distributing the iRODS Catalog: a way forward”, M. Stealey, et. al. iRODS User Group Meeting (UGM), Netherlands, 2017.
Michael S.
SciDAS provided a very significant first for the iRODS community
Jason C
Terrell R.
SciDAS Breakdown
Leverage network capabilities to enable efficient data movement
Infrastructure-agnosticabstraction layer for compute
Programmable and policy-able storage-agnostic abstraction layer for data
+100 sites +1500 usersCLI User interface that is consumable and
builds on reproducible artifacts
Network-aware Data and Compute Management• Layer-2 connectivity between NCBI and iRODS data grid via stitch-ports through dynamic network provided by ExoGENI
(CC-ADAMANT) • PerfSONAR network deeply integrated with middleware management
• Deployment of a PerfSONAR network across compute and storage node to drive intelligent placement of compute and data• Deploying PerfSONAR infrastructure across commercial clouds comes with financial challenges
• Development of Network-Optimizer Algorithm (as-a-service) • Bridge PerfSONAR, compute and storage networks• Identify optimal placement of compute or data based on network monitoring information
• Development of iRODS Shim Service• Map logical path of data object to hostname of iRODS resource node hosting data object
• Procurement of FIONA nodes for NRP integration• Clemson done. WSU done. RENCI’s in progress.
Network
SciDAS Data and Network Infrastructure
Network-Aware Compute Placement
1PB Stge/ FIONA 2PB Stge./FIONA 1PB Stge./FIONA
Cost-AwareOptimize
iRODSShim (aaS)
API
PerfSONARShim (aaS)
API PerfSONARmapping
2. Get _hostname_ of iRODS resource node hosting _logicalpath_ (_logicalpath_)
1. Find host with best network connectivity to transfer data object (_logicalpath_) to among a set of candidate hostsParams: (_logical_path_, {_hostA_,_hostB,…})
3. Get _networkconnectivity_ (bw) between two hosts (_hostA_, _hostB)
Service maintains mapping of _hostname_ PerfSONAR node
SciDAS Current Directions• Need more democratized compute for scale up
• We are working with OSG to flock out of SciDAS onto OSG• Experimenting with XSEDE allocations• FUTURE: SLATE?/NRP-ML?/More Commercial Cloud
• Building Gene Co-expression Networks (GCNs) with HTCondor (OSG-KINC) SciApp• Yeast unit test constructed many times• Lung Tumor/Normal GCN almost complete• 100TB (raw) Arabidopsis experiment underway• More Species in the queue
• Building Additional SciApps• SLURM/Nextflow Dockerized “Tuxedo Suite” Genomics Applications• FUTURE: Distributed Visualization and Deep Learning Apps we have developed
• iRODS Data Grid Production• Build iRODS data retention policies (qualitative then quantitative)• Leveraging 1000 species indexed genomes with metadata• 100TB Arabidopsis raw data input experiment• Data movement optimization (NCBI, WSU, RENCI, Clemson, StashCache?)
• Solve Authentication Issues • CILogon
Bio Application
Biology community needs addressed by SciDAS• Containerized Scientific Workflows: Gene Expression Matrix Construction
(GEM), Gene Coexpression Network (GCN) Construction, and Gene Network Visualization (GNV). These have broad applicability.
• Automated Resource Wrangling: These workflows need to map to user-authorized cloud resources with heterogenous resources to run at tera-/peta-scale, which will be the normal range for many researchers. We are stress testing using public RNAseq data from hundreds of organisms.
• Computational Experimental Design and User Intervention: A UI is needed to predict the wall-time of high-scale experiments, monitor jobs, and allow for user control of job resets and prioritization.
• Collaborative Data Organization: WAN collaborative storage with standardized data retention policies (iRODs).
Petascale Bio Driver: What are the ancestral Gene Interaction Networks for all Organisms?