Data-Centric I/O and Next Generation Interfaces Limitless Storage Limitless Possibilities https://hps.vi4io.org N G Department of Computer Science Copyright University of Reading 2019-06-20 LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT Julian M. Kunkel and partners from the NGI forum HPC-IODC Workshop S H ∞
20
Embed
Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 2 / 13
Workflows The NGI Strategy Summary
Workflows
� Consider workflow from 0 to insight
I Needs/produces dataI Uses tasks
• Parallel apps?• Big data tools?• Manual analysis
I May need month to completeI Manual tasks are unpredictableI What are users interested in?
� Not well described in HPC
I Mostly hardcoded in scripts
� Can we exploit workflows?
I Does it matter where data is?I Vendors simulations?I Enforce ILM as needed by users
Task 1
Data 1 Data 2
Task 2
Product 2
Manual QC check
Product 1
Task 3
[OK]
Product 3
Manual usage
Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 3 / 13
Workflows The NGI Strategy Summary
Planning HPC Resources
Planning for Cern/LHC and other big experiments
� A detailed planning of activities is performed
� Experiments are proposed with plans (time, resource utilization)
Planning for Data Centers
� May include: Time needed, CPU (GPU) hours, storage space
� After resources are granted scientists do what they want
I Some limitations, e.g., quota, compute limitI But access patterns?I The system is not aware what possibly could happenI The data center does not know suffiently what users do
� Additionally: Execution uses often tools with 40year old concepts
Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 4 / 13
Workflows The NGI Strategy Summary
Planning HPC Resources: An Alternative Universe� Scientists deliver
I detailed but abstract workflow orchestrationI containers with all softwareI data management plan with data lifecycleI time constraints and budget
� Data centers and vendorsI Simulate the execution before workflow is executedI Estimate costs, energy consumptionI Determine if it is the best option to run
� SystemsI Utilize the information to orchestrate I/OI Make decisions about data location and placement:
• Trade compute vs. storage and energy/costs vs. runtime
I Ensure proper execution
� Provocing: Big data is ahead in such an agenda!
Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 5 / 13
Workflows The NGI Strategy Summary
An Alternative Universe: Separation of Concerns
Decisions made by users
� Defining relevant metadata
� Declaring workflows
I Covering data ingestion, processing, product generation and analysisI Data life cycle (and archive/exchange file format)I Constraints on: accessibility (permissions), ...I Expectations: completion time (interactive feedback human/system)
� Monitoring and if needed modifying workflows on the fly
� Analyzing data interactively, e.g., Visual Analytics
� Declaring value of data (logfile, data-product, observation)
Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 6 / 13
Workflows The NGI Strategy Summary
Separation of Concerns
Programmers of models/tools
� Decide about the most appropriate API to use (e.g., NetCDF + X)
� Register compute snippets (analytics) to API
� Do not care where and how compute/store
Decisions made by the (compute/storage) system
� Where and how to store data, including file format
� Complete management of available storage space
� Performed data transformations, replication factors, storage to use
� Including scheduling of compute/storage/analysis jobs (using, e.g., ML)
� Where to run certain data-driven computations (Fluid-computing)
I Client, server, in-network, cloud, your connected laptopJulian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 7 / 13
Workflows The NGI Strategy Summary
Outline
1 Workflows
2 The NGI Strategy
3 Summary
Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 8 / 13
Workflows The NGI Strategy Summary
The Next Generation Interfaces Initiative
Goals
� The standardization of a high-level data model & interface
I Targeting data intensive and HPC workloadsI Lifting semantic access to a new levelI To have a future: must be beneficial for Big Data + Desktop, too
� Development of a reference implementation of a smart runtime system
I Implementing key features
� Demonstration of benefits on socially relevant data-intense apps
Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 9 / 13
Workflows The NGI Strategy Summary
Next Generation Interfaces
Towards a new data centric compute/IO stack considering:
� Smart hardware and software components
� Storage and compute are covered together
� User metadata and workflows as first-class citizens
� Self-aware instead of unconscious
� Improving over time (self-learning, hardware upgrades)
NGWhy do we need a new domain/funding independent API?
� Many domains have similar issues; projects are competitive
� It is a hard problem approached by countless approaches
� Harness RD&E effort across domains
Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 10 / 13
Workflows The NGI Strategy Summary
Development of the Data Model and API
� Establishing a Forum similarly to MPI
� Define data model for HPC
� Open board: encourage community collaborationS
tand
ard
1.0
Data Model
Interface
Reference Implementation
Standard-Forum
Steering Board
Committee
WorkgroupUse
-cas
es
Mini-apps
Workflows
Industry
Data centers
Scientists
Pseudo code
Mem
ber
s
Bod
ies
Nex
t Gen
erat
ion
Sta
ndar
d 2.
0
Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 11 / 13
Workflows The NGI Strategy Summary
One Year NGI in Retrospective
� One year ago, we started NGI as informal collaboration
� Research: Countless storage system prototypes every year
� Projects: EXAHDF, Maestro (FET Proactive)
Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 17 / 13
Challenges Potential Interfaces
Outline
4 Challenges
5 Potential Interfaces
Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 18 / 13
Challenges Potential Interfaces
A Pragmatic View� Take existing data model like VTK (or NetCDF) as baseline� With a hint of:
I Scientific metadata handlingI Workflow and processing interfaceI Information lifecycle managementI Hardware model interface (hardware provides its own performance models)
� First prototype utilizes existing software stackI Like Cylc for workflowsI Like MongoDB for metadataI Like a parallel file system (or object storage)
� Work on:I Scheduler for performant mapping of data/compute to storage/computeI A FUSE client for flexible data mappings on semantic metadataI Importer/Exporter tools for standard file formats
� Add magic (knowledge of experts developing APIs)� Next prototype: move on with true implementation
Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 19 / 13
Challenges Potential Interfaces
Next-Generation HPC IO API Key Features� High-level data model for HPC
I Storage understands data structures vs. byte arrayI Relaxed consistency
� Semantic namespace and storage-aware data formatsI Organize based on domain-specific metadata (instead of file system)I Support domain-specific operations and addressing schemes
� Integrated processing capabilitiesI Offload data-intensive compute to storage systemI Managed data-driven workflows supporting events and servicesI Scheduler maps compute and I/O to hardware
� Enhanced data management featuresI Information life-cycle management (and value of data)I Embedded performance analysisI Resilience, import/export, ...