Top Banner
Data-Centric I/O and Next Generation Interfaces Limitless Storage Limitless Possibilities https://hps.vi4io.org N G Department of Computer Science Copyright University of Reading 2019-06-20 LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT Julian M. Kunkel and partners from the NGI forum HPC-IODC Workshop S H
20

Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Mar 18, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Data-Centric I/O and Next Generation Interfaces

Limitless StorageLimitless Possibilities

https://hps.vi4io.org

NG

Department of Computer Science

Copyright University of Reading

2019-06-20

LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT

Julian M. Kunkel and partners from the NGI forum

HPC-IODC Workshop

SH

)

Page 2: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Workflows The NGI Strategy Summary

Outline

1 Workflows

2 The NGI Strategy

3 Summary

Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 2 / 13

Page 3: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Workflows The NGI Strategy Summary

Workflows

� Consider workflow from 0 to insight

I Needs/produces dataI Uses tasks

• Parallel apps?• Big data tools?• Manual analysis

I May need month to completeI Manual tasks are unpredictableI What are users interested in?

� Not well described in HPC

I Mostly hardcoded in scripts

� Can we exploit workflows?

I Does it matter where data is?I Vendors simulations?I Enforce ILM as needed by users

Task 1

Data 1 Data 2

Task 2

Product 2

Manual QC check

Product 1

Task 3

[OK]

Product 3

Manual usage

Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 3 / 13

Page 4: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Workflows The NGI Strategy Summary

Planning HPC Resources

Planning for Cern/LHC and other big experiments

� A detailed planning of activities is performed

� Experiments are proposed with plans (time, resource utilization)

Planning for Data Centers

� May include: Time needed, CPU (GPU) hours, storage space

� After resources are granted scientists do what they want

I Some limitations, e.g., quota, compute limitI But access patterns?I The system is not aware what possibly could happenI The data center does not know suffiently what users do

� Additionally: Execution uses often tools with 40year old concepts

Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 4 / 13

Page 5: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Workflows The NGI Strategy Summary

Planning HPC Resources: An Alternative Universe� Scientists deliver

I detailed but abstract workflow orchestrationI containers with all softwareI data management plan with data lifecycleI time constraints and budget

� Data centers and vendorsI Simulate the execution before workflow is executedI Estimate costs, energy consumptionI Determine if it is the best option to run

� SystemsI Utilize the information to orchestrate I/OI Make decisions about data location and placement:

• Trade compute vs. storage and energy/costs vs. runtime

I Ensure proper execution

� Provocing: Big data is ahead in such an agenda!

Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 5 / 13

Page 6: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Workflows The NGI Strategy Summary

An Alternative Universe: Separation of Concerns

Decisions made by users

� Defining relevant metadata

� Declaring workflows

I Covering data ingestion, processing, product generation and analysisI Data life cycle (and archive/exchange file format)I Constraints on: accessibility (permissions), ...I Expectations: completion time (interactive feedback human/system)

� Monitoring and if needed modifying workflows on the fly

� Analyzing data interactively, e.g., Visual Analytics

� Declaring value of data (logfile, data-product, observation)

Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 6 / 13

Page 7: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Workflows The NGI Strategy Summary

Separation of Concerns

Programmers of models/tools

� Decide about the most appropriate API to use (e.g., NetCDF + X)

� Register compute snippets (analytics) to API

� Do not care where and how compute/store

Decisions made by the (compute/storage) system

� Where and how to store data, including file format

� Complete management of available storage space

� Performed data transformations, replication factors, storage to use

� Including scheduling of compute/storage/analysis jobs (using, e.g., ML)

� Where to run certain data-driven computations (Fluid-computing)

I Client, server, in-network, cloud, your connected laptopJulian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 7 / 13

Page 8: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Workflows The NGI Strategy Summary

Outline

1 Workflows

2 The NGI Strategy

3 Summary

Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 8 / 13

Page 9: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Workflows The NGI Strategy Summary

The Next Generation Interfaces Initiative

Goals

� The standardization of a high-level data model & interface

I Targeting data intensive and HPC workloadsI Lifting semantic access to a new levelI To have a future: must be beneficial for Big Data + Desktop, too

� Development of a reference implementation of a smart runtime system

I Implementing key features

� Demonstration of benefits on socially relevant data-intense apps

Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 9 / 13

Page 10: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Workflows The NGI Strategy Summary

Next Generation Interfaces

Towards a new data centric compute/IO stack considering:

� Smart hardware and software components

� Storage and compute are covered together

� User metadata and workflows as first-class citizens

� Self-aware instead of unconscious

� Improving over time (self-learning, hardware upgrades)

NGWhy do we need a new domain/funding independent API?

� Many domains have similar issues; projects are competitive

� It is a hard problem approached by countless approaches

� Harness RD&E effort across domains

Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 10 / 13

Page 11: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Workflows The NGI Strategy Summary

Development of the Data Model and API

� Establishing a Forum similarly to MPI

� Define data model for HPC

� Open board: encourage community collaborationS

tand

ard

1.0

Data Model

Interface

Reference Implementation

Standard-Forum

Steering Board

Committee

WorkgroupUse

-cas

es

Mini-apps

Workflows

Industry

Data centers

Scientists

Pseudo code

Mem

ber

s

Bod

ies

Nex

t Gen

erat

ion

Sta

ndar

d 2.

0

Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 11 / 13

Page 12: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Workflows The NGI Strategy Summary

One Year NGI in Retrospective

� One year ago, we started NGI as informal collaboration

� Fair interest in individuals from

I Institutions (UCAR, Sandia, Argonne, ...)I Vendors (NVIDIA, (Mellanox), Kove, DDN, ...)

� Issue: converting committments into actions (without funding)

� Started to work on white-papers (Minipapers)

I Use-cases, visions, APIs, coordinationI Everyone is welcome to contribute

� We will publish the first ones before SC

Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 12 / 13

Page 13: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Workflows The NGI Strategy Summary

Summary

� There is a huge potential for the next-generation interface

� Can the community work together to define next generation APIs?

� Can the community work together to define vision and next generationAPIs?

Participate defining NG interfaces

� Join the mailing list our Slack

� Visit: https://ngi.vi4io.orgNG

Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 13 / 13

Page 14: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Challenges Potential Interfaces

Appendix

Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 14 / 13

Page 15: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Challenges Potential Interfaces

Challenges Faced by HPC I/O� Difficulty to analyze behavior and understand performance

I Unclear access patterns (users, sites)

� Coexistence of access paradigms in workflowsI File (POSIX, ADIOS, HDF5), SQL, NoSQL

� Semantical information is lost through layersI Suboptimal performance, lost opportunitiesI All data treated identically (up to the user)

� Re-implementation of features across stackI Unpredictable interactionsI Wasted resources

� Restricted (performance) portabilityI Optimizing each layer for each system?I Users lack technological knowledge for tweaking

� Utilizing the future storage landscapesI No performance awareness, manual tuning and mapping to storage needed

Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 15 / 13

Page 16: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Challenges Potential Interfaces

Future Systems: Coexistence of Storage Systems

HDD

Node

Memory

Node

Memory

NVM

Memory HDD

S3

Cloud

EC2HDDSSD HDDTape

...

SSD

HDDBurst Buffer

Data centerLocal facility

� We shall be able to use all storage technologies concurrently

I Without explicit migration etc. put data where it fitsI Administrators just add a new technology (e.g., SSD pool) and users benefit

Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 16 / 13

Page 17: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Challenges Potential Interfaces

Alternative Software Stack

Some examples of the zoo of alternatives

� High-level abstractions: Dataclay, Dataspaces, Mochi

� Data models: ADIOS, HDF5, NetCDF, VTK

� Standard API across file formats: Silo, VTK, CDI, HDF5

� Low-level libraries: SIONlib, PLFS

� Storage interfaces: MPI-IO, POSIX, vendor-specific (e.g., CLOVIS), S3

� Big-data: HDFS, Spark, Flink, MongoDB, Cassandra

� Research: Countless storage system prototypes every year

� Projects: EXAHDF, Maestro (FET Proactive)

Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 17 / 13

Page 18: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Challenges Potential Interfaces

Outline

4 Challenges

5 Potential Interfaces

Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 18 / 13

Page 19: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Challenges Potential Interfaces

A Pragmatic View� Take existing data model like VTK (or NetCDF) as baseline� With a hint of:

I Scientific metadata handlingI Workflow and processing interfaceI Information lifecycle managementI Hardware model interface (hardware provides its own performance models)

� First prototype utilizes existing software stackI Like Cylc for workflowsI Like MongoDB for metadataI Like a parallel file system (or object storage)

� Work on:I Scheduler for performant mapping of data/compute to storage/computeI A FUSE client for flexible data mappings on semantic metadataI Importer/Exporter tools for standard file formats

� Add magic (knowledge of experts developing APIs)� Next prototype: move on with true implementation

Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 19 / 13

Page 20: Data-Centric I/O and Next Generation Interfaces · Data models: ADIOS, HDF5, NetCDF, VTK ... IA FUSE client for flexible data mappings on semantic metadata IImporter/Exporter tools

Challenges Potential Interfaces

Next-Generation HPC IO API Key Features� High-level data model for HPC

I Storage understands data structures vs. byte arrayI Relaxed consistency

� Semantic namespace and storage-aware data formatsI Organize based on domain-specific metadata (instead of file system)I Support domain-specific operations and addressing schemes

� Integrated processing capabilitiesI Offload data-intensive compute to storage systemI Managed data-driven workflows supporting events and servicesI Scheduler maps compute and I/O to hardware

� Enhanced data management featuresI Information life-cycle management (and value of data)I Embedded performance analysisI Resilience, import/export, ...

NG-HPC-IO

ManagementInformationIntentWorkflowOperationObjectData description

Julian M. Kunkel SH LIMITLESS POTENTIAL | LIMITLESS OPPORTUNITIES | LIMITLESS IMPACT 20 / 13