Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Rob Ross Mathema,cs and Computer Science Division Argonne Na,onal Laboratory [email protected]

Post-Petascale System Software: Applying Lessons Learned

System Software and HPC

§  “Stuff” that isn’t part of the applica,on –  Opera,ng System –  Resource Manager –  Scheduler –  RAS system –  File System

§  Produc,on versions haven’t changed a lot.

§  Will look at three areas: –  Monitoring –  Resource management –  Data management

2

Pervasive Monitoring

3

Photo by Quevaal.

HPC and Monitoring

§  Three major types of monitoring in HPC systems: –  RAS System Monitoring

•  Constant tracking of health of certain system components –  Applica4on Profiling/Tracing

•  Lighter-‐weight profiling •  Detailed logging of applica,on behavior for debugging purposes

–  Subsystem Monitoring •  Subsystems that independently monitor por,ons of the system for their own reasons (e.g., the file system, more on this later)

§  All have roles in overall success of systems

4

We’re Getting Good at Predicting Faults from Logs

5

Percentage

Precision

Recall

Ana Gainaru. Dealing with prediction unfriendly failures: The road to specialized predictors. JLESC Workshop, Chicago, IL. November 2014. Ana Gainaru. Failure avoidance techniques for HPC systems based on failure prediction. SC Doctoral Showcase. New Orleans, LA. November 2014.

Motivation Impact

Impact

Coupling failure prediction, proactive and preventive checkpoint

7 / 11

Application Monitoring with Negligible Overhead

§  Profiling is used in a number of circumstances §  Q: What can we observe about applica,ons without perturbing their performance?

§  Example: Darshan: A lightweight, scalable I/O characteriza4on tool that captures I/O access paCern informa4on from produc4on applica4ons.

–  Low, fixed memory consump,on –  No data movement un,l MPI_Finalize() –  No source code or makefile changes –  Not a trace, not real ,me

6

P. Carns et al. 24/7 characterization of petascale I/O workloads. In Proceedings of the First Workshop on Interfaces and Abstractions for Scientific Data Storage (IASDS), New Orleans, LA, September 2009.

Job Level

7

% of run,me in I/O

Access size histogram

'DUVKDQ�DQDO\VLV�WRRO�H[DPSOH

䙘 �r<ZdYI��<ghP<[�W]D�hkZZ<gs�dY�dg]GkEIh�<�Ä�d<OI�+��NQYI�hkZZ<gQvQ[O�p<gQ]kh�<hdIEjh�]N��$�dIgN]gZ<[EI

䙘 0PQh�NQOkgI�hP]qh�jPI��$�DIP<pQ]g�]N�<�ÈÉÇ�ÅÄÃ�dg]EIhh�jkgDkYI[EI�hQZkY<jQ][�¥dg]GkEjQ][�gk[¦�][�jPI�!Qg<�hshjIZ�<j��"

䙘 0PQh�d<gjQEkY<g�<ddYQE<jQ][�Qh�qgQjI�Q[jI[hQpI�<[G�DI[INQjh�OgI<jYs�Ng]Z�E]YYIEjQpI�DkNNIgQ[O��[]�]DpQ]kh�jk[Q[O�[IIGIG�

��

�r<ZdYI�ZI<hkgIZI[jh��Ú�]N�gk[jQZI�Q[��$

<EEIhh�hQvI�PQhj]Og<Z

�[<YsvQ[O�+<g<YYIY��$��<ghP<[

System Level: Aggregated View of Data Volume ([DPSOH��V\VWHP�ZLGH�DQDO\VLV

�[<YsvQ[O�+<g<YYIY��$��<ghP<[

�]D�hQvI�ph��G<j<�p]YkZI�N]g�!Qg<��-�hshjIZ�Q[�ÃÁÂÅ�¥¯ÂÃÉ�ÁÁÁ�Y]Oh�<h�]N�$Ej]DIg��¯É�+Q��]N�jg<NNQE¦

䙘 �QOOIhj�Ds�p]YkZI��

¯ÄÁÁ�0Q�

䙘 �QOOIhj�Ds�hE<YI�

ÈÇÉ��dg]EIhhIh

䙘 +g]D<DYs�h]ZI�hE<YQ[O�

IrdIgQZI[jh�

䙘 !]hj�W]Dh�khI�d]qIg�]N�Ã�

[kZDIgh�]N�dg]EIhhIh�][�

!Qg<

8

System Level: I/O Mix

Top 10 data producer/consumers instrumented with Darshan over the month of July, 2011 on Intrepid BG/P system at Argonne. Surprisingly, three of the top producer/consumers almost exclusively read exis,ng data.

Matching large scale simula4ons of dense suspensions with empirical measurements to becer understand proper,es of complex materials such as concrete.

Comparing simula4ons of turbulent mixing of fluids with experimental data to advance our understanding of supernovae explosions, iner,al confinement fusion, and supersonic combus,on.

1

10

100

1000

MaterialsScience

EarthScience1

ParticlePhysics

Com

bustion

Turbulence1

Chem

istry

AstroPhysics

NuclearPhysics

Turbulence2

EarthScience2

Num

ber

of T

iB

Project

Write��Read��

Processing large-‐scale seismographic datasets to develop a 3D velocity model used in developing earthquake hazard maps.

9

Subsystem Monitoring: A Weakness

§  Some subsystems operate independently from other HPC system sofware

§  File systems (in par,cular) do not leverage system-‐level monitoring

–  Do their own monitoring –  And get it wrong at scale –  And then bring down huge

chunks of the system

Component 2010 2011 2012

GPFS 101 77 75

Machine 79 35 40

Myrinet HW/SW 28 32 15

Service Node (DB) 29 8 15

PVFS 15 7 7

DDN 6 0 2

Service Network 2 0 -‐-‐

Root cause of interrupt events on ALCF Intrepid BG/P system for 2010-2012. Thanks to C.Goletz and B. Allcock (ANL).

10

Pervasive Monitoring: Next Steps

§  Predic,ons can be used more aggressively to reduce cost of faults

§  Increased monitoring of applica,ons can assist users and provide insight into applica,on trends – how far can we push this?

§  Decision making by system services needs to be based on the best informa,on, but also the same informa,on

11

Post-Petascale Resource Management

12

HPC Resource Management Today

§  Focuses on the resources that may be scheduled for applica,ons

§  Scheduling provides a queuing mechanism for “jobs” that are of fixed size (in compute resources)

§  Some associated resources might also be allocated (e.g., rou,ng or I/O nodes)

§  Other resources are managed by independent subsystems

–  HPSS manages tape –  Parallel file system manages disks, storage servers –  DB sofware manages RAS database resources

13

Example: ACME – Scientific Infrastructure

Run Model

Diagnostics & Analysis

Science Input

Data Management

Science Input

DOE Accelerated Climate Modeling for Energy (ACME) Testbed

DiagnosticsGeneration

Run ESMBuild ESM

OutputData

Diagnotics Output

Configure ESM Case

or Ensemble

Name List Files

Input Data Sets

Initialization Files

Exploratory & Explanatory Analysis

Web UI

Configuration UI + Rule engine to

guide valid configs

Machine Config

ACME Database Enables Search/Discovery, Automated Reproducibility, Workflow Status,

Monitoring Dashboard, Data Archive and Sharing

- ConfigurationInformation

(Store and/or Retrieve)

- Build status

- ESM run status

- DiagnosticsStatus

Exploratory Analysis

Archive to Storage

Model Source(svn/git)

Analysis (UV-CDAT)

Simulation Manager & ProvenanceAKUNA + ProvEn

ConfigurationStatus

- Retrieve required Datasets

- Store manually provided files

- Store history files

- Store diagnostic data

Data ArchiveESGF

-Analysis "snap shot"

Monitoring & Provenance Dataflow (Simulation Manager)

Dataset Dataflow ESGF

User Driven Interaction

Automated WorkflowProcess Control

Process level Dataflow

Legend

Single sign on and group management: Globus Nexus

System Monitoring

UI

Rapid, reliable, secure data transport and synchronization: Globus Online

UV-CDAT & Dakota

Manually Provided

File(s)

UncertaintyQuantification

Explanatory Analysis

Thanks to ACME team and G. Shipman.

Three Views of Resource Management and Scheduling

15

1

The Application Level Placement Scheduler

Michael Karo1, Richard Lagerstrom

1,

Marlys Kohnke1, Carl Albing

1

Cray User Group

May 8, 2006

Abstract

Cray platforms present unique resource and workload management challenges due to their scale and

complexity. The Application Level Placement Scheduler (ALPS) is a software suite designed to address

these challenges. ALPS provides uniform access to computational resources by masking many of the

architecture specific characteristics of the system from the user. This paper provides an overview of the

ALPS software together with the methodologies used during its design.

1.0 Introduction

Current and future Cray platforms will

consist of large numbers of heterogeneous

computational resources simultaneously running

many independent operating system instances.

The existing resource management infrastructure

on legacy Cray platforms was not designed to

operate in this type of environment. These

circumstances have necessitated the design and

implementation of a new software component

called ALPS. The ALPS software design is

intended to address both the requirements of

future Cray platforms and the limitations

inherent to legacy software components. The

ALPS infrastructure incorporates a robust

modular design to ensure extensibility and

maintain a level of abstraction between the

resource management model and the underlying

hardware and operating system architectures.

Emphasis has been placed on the separation of

policy and mechanism to more clearly identify

the functional requirements of the software.

It is important to note the target

platform for ALPS is limited to systems running

Compute Node Linux (CNL). There are no plans

to replace yod/CPA on systems running the

Catamount operating system.

2.0 ALPS Architecture

The ALPS architecture is divided into

several components, each responsible for

fulfilling a specific set of functional

requirements. This model ensures a modular

design that will remain maintainable and

encourage code reuse. The ALPS components

that run on each node of a system vary

depending on the type of service the node is

intended to provide. The following diagram

illustrates several of the ALPS components

together with their interactions:

The ALPS software components

communicate using the XML-RPC protocol. The

protocol provides an extensible language that

may easily be enhanced in future revisions of the

software to support additional message types and

services. In addition, ALPS makes use of

memory mapped files to consolidate and

distribute data efficiently. This reduces the

demand on the daemons that maintain these files

by allowing clients and other daemons direct

1 Cray Inc., Mendota Heights, MN, USA, [mek|rnl|kohnke|albing]@cray.com

M. Karo and R. Lagerstrom. The Application Level Placement Scheduler. Cray User Group Meeting. May, 2006.

§  Single service node –  Monolithic scheduler

§  Nodes allocated for job life §  Low rate of job scheduling §  MPI model dominates §  Other subsystems operate independently

HPC Batch

gorithms are better expressed using a bulk-synchronousparallel model (BSP) using message passing to com-municate between vertices, rather than the heavy, all-to-all communication barrier in a fault-tolerant, large-scale MapReduce job [22]. This mismatch became animpediment to users’ productivity, but the MapReduce-centric resource model in Hadoop admitted no compet-ing application model. Hadoop’s wide deployment in-side Yahoo! and the gravity of its data pipelines madethese tensions irreconcilable. Undeterred, users wouldwrite “MapReduce” programs that would spawn alter-native frameworks. To the scheduler they appeared asmap-only jobs with radically different resource curves,thwarting the assumptions built into to the platform andcausing poor utilization, potential deadlocks, and insta-bility. YARN must declare a truce with its users, and pro-vide explicit [R8:] Support for Programming ModelDiversity.

Beyond their mismatch with emerging framework re-quirements, typed slots also harm utilization. While theseparation between map and reduce capacity preventsdeadlocks, it can also bottleneck resources. In Hadoop,the overlap between the two stages is configured by theuser for each submitted job; starting reduce tasks laterincreases cluster throughput, while starting them earlyin a job’s execution reduces its latency.3 The number ofmap and reduce slots are fixed by the cluster operator,so fallow map capacity can’t be used to spawn reducetasks and vice versa.4 Because the two task types com-plete at different rates, no configuration will be perfectlybalanced; when either slot type becomes saturated, theJobTracker may be required to apply backpressure to jobinitialization, creating a classic pipeline bubble. Fungi-ble resources complicate scheduling, but they also em-power the allocator to pack the cluster more tightly.This highlights the need for a [R9:] Flexible ResourceModel.

While the move to shared clusters improved utiliza-tion and locality compared to HoD, it also brought con-cerns for serviceability and availability into sharp re-lief. Deploying a new version of Apache Hadoop in ashared cluster was a carefully choreographed, and a re-grettably common event. To fix a bug in the MapReduceimplementation, operators would necessarily scheduledowntime, shut down the cluster, deploy the new bits,validate the upgrade, then admit new jobs. By conflat-ing the platform responsible for arbitrating resource us-age with the framework expressing that program, oneis forced to evolve them simultaneously; when opera-tors improve the allocation efficiency of the platform,

3This oversimplifies significantly, particularly in clusters of unreli-able nodes, but it is generally true.

4Some users even optimized their jobs to favor either map or reducetasks based on shifting demand in the cluster [28].

Node ManagerNode Manager Node Manager

ResourceManager

Scheduler

AMService

MRAM Container

ContainerContainer

MPIAM

...

Container

RM -- NodeManager

RM -- AM

Umbilical

client

clientClient -- RM

Figure 1: YARN Architecture (in blue the system components,and in yellow and pink two applications running.)

users must necessarily incorporate framework changes.Thus, upgrading a cluster requires users to halt, vali-date, and restore their pipelines for orthogonal changes.While updates typically required no more than re-compilation, users’ assumptions about internal frame-work details—or developers’ assumptions about users’programs—occasionally created blocking incompatibil-ities on pipelines running on the grid.

Building on lessons learned by evolving Apache Ha-doop MapReduce, YARN was designed to address re-quirements (R1-R9). However, the massive install baseof MapReduce applications, the ecosystem of relatedprojects, well-worn deployment practice, and a tightschedule would not tolerate a radical redesign. To avoidthe trap of a “second system syndrome” [6], the new ar-chitecture reused as much code from the existing frame-work as possible, behaved in familiar patterns, and leftmany speculative features on the drawing board. Thislead to the final requirement for the YARN redesign:[R10:] Backward compatibility.

In the remainder of this paper, we provide a descrip-tion of YARN’s architecture (Sec. 3), we report aboutreal-world adoption of YARN (Sec. 4), provide experi-mental evidence validating some of the key architecturalchoices (Sec. 5) and conclude by comparing YARN withsome related work (Sec. 6).

3 ArchitectureTo address the requirements we discussed in Section 2,YARN lifts some functions into a platform layer respon-sible for resource management, leaving coordination oflogical execution plans to a host of framework imple-mentations. Specifically, a per-cluster ResourceManager(RM) tracks resource usage and node liveness, enforcesallocation invariants, and arbitrates contention amongtenants. By separating these duties in the JobTracker’scharter, the central allocator can use an abstract descrip-tion of tenants’ requirements, but remain ignorant of the

V. Vavilapalli et al. Apache Hadoop YARN: Yet Another Resource Negotiator. SOCC. October, 2013.

Apache YARN

§  Single resource manager –  Scheduler allocates resources,

locality aware §  Applica,on masters

–  Per applica,on, track liveness –  Can reallocate resources on

demand §  Manages variety of workloads

M. Schwarzkopf et al. Omega: flexible, scalable schedulers for large compute clusters. EuroSys. April, 2013.

Omega: flexible, scalable schedulers for large compute clusters

Malte Schwarzkopf † ⇤ Andy Konwinski‡ ⇤ Michael Abd-El-Malek§ John Wilkes§†University of Cambridge Computer Laboratory ‡University of California, Berkeley §Google, Inc.†

[email protected]

‡

[email protected]

§

{mabdelmalek,johnwilkes}@google.com

AbstractIncreasing scale and the need for rapid response to changingrequirements are hard to meet with current monolithic clus-ter scheduler architectures. This restricts the rate at whichnew features can be deployed, decreases efficiency and uti-lization, and will eventually limit cluster growth. We presenta novel approach to address these needs using parallelism,shared state, and lock-free optimistic concurrency control.

We compare this approach to existing cluster schedulerdesigns, evaluate how much interference between schedulersoccurs and how much it matters in practice, present sometechniques to alleviate it, and finally discuss a use casehighlighting the advantages of our approach – all driven byreal-life Google production workloads.

Categories and Subject Descriptors D.4.7 [OperatingSystems]: Organization and Design—Distributed systems;K.6.4 [Management of computing and information systems]:System Management—Centralization/decentralization

Keywords Cluster scheduling, optimistic concurrency con-trol

1. IntroductionLarge-scale compute clusters are expensive, so it is impor-tant to use them well. Utilization and efficiency can be in-creased by running a mix of workloads on the same ma-chines: CPU- and memory-intensive jobs, small and largeones, and a mix of batch and low-latency jobs – ones thatserve end user requests or provide infrastructure servicessuch as storage, naming or locking. This consolidation re-duces the amount of hardware required for a workload, butit makes the scheduling problem (assigning jobs to ma-chines) more complicated: a wider range of requirements

⇤ Work done while interning at Google, Inc.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.EuroSys’13 April 15-17, 2013, Prague, Czech RepublicCopyright c

� 2013 ACM 978-1-4503-1994-2/13/04. . . $15.00

��

��

��

Figure 1: Schematic overview of the scheduling architec-tures explored in this paper.

and policies have to be taken into account. Meanwhile, clus-ters and their workloads keep growing, and since the sched-uler’s workload is roughly proportional to the cluster size,the scheduler is at risk of becoming a scalability bottleneck.

Google’s production job scheduler has experienced allof this. Over the years, it has evolved into a complicated,sophisticated system that is hard to change. As part of arewrite of this scheduler, we searched for a better approach.

We identified the two prevalent scheduler architecturesshown in Figure 1. Monolithic schedulers use a single,centralized scheduling algorithm for all jobs (our existingscheduler is one of these). Two-level schedulers have a sin-gle active resource manager that offers compute resources tomultiple parallel, independent “scheduler frameworks”, asin Mesos [13] and Hadoop-on-Demand [4].

Neither of these models satisfied our needs. Monolithicschedulers do not make it easy to add new policies and spe-cialized implementations, and may not scale up to the clus-ter sizes we are planning for. Two-level scheduling archi-tectures do appear to provide flexibility and parallelism, butin practice their conservative resource-visibility and lockingalgorithms limit both, and make it hard to place difficult-to-schedule “picky” jobs or to make decisions that requireaccess to the state of the entire cluster.

Our solution is a new parallel scheduler architecture builtaround shared state, using lock-free optimistic concurrencycontrol, to achieve both implementation extensibility andperformance scalability. This architecture is being used in

��

Cluster state information Cluster machines

Google Omega

§  Resource status globally available to mul,ple schedulers

§  Schedulers compete/collaborate –  Common scale of importance

§  Subsystems operate under same umbrella

Node Fault Detection: Replicated State Machine + Heartbeat

▪  Storage servers exchange heartbeat messages to detect faults

▪  A subset of daemons use a distributed consensus algorithm (like PAXOS) to maintain a consistent view of membership state

▪  Clients need not ac,vely par,cipate –  Retrieve state from servers or monitors

when needed –  Limit the scaling requirements

16

Epidemic-based Node Fault Detection

▪  Similari,es: –  Clients need not ac,vely par,cipate –  Servers exchange heartbeat messages to

detect faults ▪  Differences:

–  No dedicated service for distributed consensus

–  Each storage server maintains its own view of the system

–  Disseminate updates using epidemic principles

Seman,c differences may influence the storage system design.

A. Das et al. Swim: Scalable weakly-‐consistent infec,on-‐style process group membership protocol. Proceedings of the 2002 Interna,onal Conference on Dependable Systems and Networks. DSN ’02. 2002.

17

Resource Management: Next Steps

18

§  Revisit resource management –  Support mul,ple applica,on “models” –  Separate scheduling from resource management – Workflow support (but might be at a higher level)

§  Subsystems as applica,ons –  Very long running –  Resource needs will change over ,me

§ Unified node fault detec,on –  Resource manager as the Oracle

Data Management Architectures

19

System Architecture and Nonvolatile Memory

Compute nodes run application processes.

I/O forwarding nodes (or I/O gateways) shuffle data between compute nodes and external resources, including storage.

Storage nodes run the parallel file system.

External network

Disk arrays

20

NVM in storage nodes serves as a PFS accelerator.

NVM in I/O nodes provides a fast staging area and region for temporary storage.

NVM in compute nodes lets you add noise into your system network.

On Lakes and Cold Data §  Companies managing large scale data repositories are moving

to a “data lake” model where bulk data is stored at low cost §  New technologies provide an opportunity for convergence of

“cold store” ideas with facility-‐wide, low latency access

21

R. Miller. Facebook Builds Exabyte Data Centers for Cold Storage. Data Center Knowledge. January 18, 2013. J. Novet. First Look: Facebook’s Oregon Cold Storage Facility. Data Center Knowledge. October 16, 2013. T. Morgan. Facebook Loads Up Innovative Cold Storage Datacenter. EnterpriseTech Storage Edition. October 25, 2013.

§  Facebook’s cold store –  62,000 square feet, up to 500 racks @ 4PBytes/rack –  No generators or UPSes, all redundancy in sofware (Reed-‐Solomon) –  2KW/rack rather than standard (for them) 8KW/rack –  Shingled magne,c recording drives

•  Many drives spun down at any given moment •  Seconds to spin up and access – can be used for content delivery

–  Predict facility will be filled by 2017

HPC I/O Software Stack

The soNware used to provide data model support and to transform I/O to beCer perform on today’s I/O systems is oNen referred to as the I/O stack.

Data Model Libraries map application abstractions onto storage abstractions and provide data portability.

HDF5, Parallel netCDF, ADIOS

I/O Middleware organizes accesses from many processes, especially those using collective ��I/O.

MPI-IO, PLFS I/O Forwarding transforms I/O from many clients into fewer, larger request; reduces lock contention; and bridges between the HPC system and external storage.

IBM ciod, IOFSL, Cray DVS

Parallel file system maintains logical file model and provides efficient access to data.

PVFS, Gfarm, GPFS, Lustre

I/O Hardware

Application

Parallel File System

Data Model Support

Transformations

22

I/O Services, RPC, and the Mercury Project

23

Mercury

ObjectiveCreate a reusable RPC library for use in HPC ScientificLibraries that can serve as a basis for services such asstorage systems, I/O forwarding, analysis frameworks andother forms of inter-application communication

Why not reuse existing RPC frameworks?– Do not support e�cient large data transfers or asynchronous calls– Mostly built on top of TCP/IP protocols

I Need support for native transportI Need to be easy to port to new machines

Similar approaches with some di↵erences indicates need– I/O Forwarding Scalability Layer (IOFSL)– NEtwork Scalable Service Interface (Nessie)– Lustre RPC

3

http://www.mcs.anl.gov/projects/mercury/

An Alternative to File System Metadata: Provenance Graphs

What if we treat metadata management (including provenance) as a graph processing problem?

24

Create a Metadata Graph

• Each log file => one Job

• Each uid => one User• All Ranks => Processes

• jobid, start_time, end_time, exe

• nprocs, file_access• File and exe => Data Object• Synthetically create directory structure

• data files visited by the same execution will be placed under the same directory

• directories accessed by the same user are placed under one directory

Whole 2013’s trace from Intrepid42% of all core-hours consumed in 2013

User Entity

Execution Entity

File Entityrun

exe

readread

write

write

run

name:Johnid:330862395

name:203863...fs-type:gpfs..., ...

id:2726768805params:-n 2048..., ...

name:2111648390..., ...

exe

ts:20130101...writeSize:7M

name:samid:430823375

D. Dai et al. Using Property Graphs for Rich Metadata Management in HPC Systems. PDSW Workshop. New Orleans, LA. November 2014.

Rich metadata size

ApplicationsUser

Processes ( I/O Ranks)Files

detailed level

Processes (All Ranks)

Post-Petascale Data Management Stack

25

Iden,ty and Security

WAN Data Services

Resource Mgmt and Scheduling

Performance Monitoring

Applica,on Tasks Users Analysis Tasks

Programming Model

Science Data Model Services

Task and Data Coordina,on Publish/Subscribe

In System Storage Networking HW External Storage

Core Data Services Provenance Management Core Data

Model Services Metadata

Management

Pass-‐

through

Input from G. Grider, S. Klasky, P. MacCormick, R. Oldfield, G. Shipman, K. van Dam, and D. Williams.

Data Management: Next Steps

§  Storage-‐based vs. memory-‐based approaches to in-‐system storage

§  Data lakes and tape in future storage systems? §  Rearchitect the I/O stack!

26

Concluding Thoughts

27

One Comparison of HPC and Big Data Software

28

Compute Resources(Nodes, Cores, VMs)

Workload Management (Pilots, Condor)

Orchestration(Pegasus, Taverna, Dryad, Swift)

Declarative Languages

(Swift)

MPI Frameworks for Advanced Analytics &

Machine Learning(Blas, ScaLAPACK, CompLearn, PetSc,

Blast)

Applications

MapReduceFrameworks

(Pilot-MapReduce)

Resource Management

Cluster Resource Manager (Slurm, Torque, SGE)

Storage Resources(Lustre, GPFS)

Data Access(Virtual Filesystem,

GridFTP, SSH)

Resource Fabric

Higher-Level Runtime

Environment

Data Processing,Analytics,

Orchestration

Compute and Data Resources(Nodes, Cores, HDFS)

Higher-Level Workload

Management (TEZ, LLama)

Advanced Analytics & Machine Learning (Mahout, R, MLBase)

Applications

MapReduce

Cluster Resource Manager (YARN, Mesos)

MapReduce

Scheduler

Data Store & Processing

(HBase)

In-Memory (Spark)

SparkScheduler

TwisterMapReduce

TwisterScheduler

SQL-Engines (Impala, Hive, Shark, Phoenix)

Scheduler

MPI, RDMA Hadoop Shuffle/Reduction, HARP Collectives Communication

High-Performance Computing Apache Hadoop Big Data

Orchestration (Oozie, Pig)

Advanced Analytics & Machine Learning (Pilot-KMeans, Replica Exchange)

Storage Management (iRODS, SRM, GFFS)

Fig. 1. HPC and ABDS architecture and abstractions: The HPC approach historically separated data and compute; ABDS co-locates compute and data.The YARN resource manager heavily utilizes multi-level, data-aware scheduling and supports a vibrant Hadoop-based ecosystem of data processing, analyticsand machine learning frameworks. Each approach has a rich, but hitherto distinct resource management and communication capabilities.

In addition several runtime environments for supporting het-erogeneous, loosely coupled tasks, e. g. Pilot-Jobs [9], manytasks [10] and workflows [11]. Pilot-Jobs generalize the con-cept of a placeholder to provide multi-level and/or application-level scheduling on top of the system-provided schedulers.With the increasing importance of data, Pilot-Jobs are increas-ingly used to process and analyze large amounts of data [12],[9]. In general, one can distinguish two kinds of data manage-ment: (i) the ability to stage-in/stage-out files from anothercompute node or a storage backend, such as SRM and (ii) theprovisioning of integrated data/compute management mecha-nisms. An example for (i) is Condor-G/Glide-in [13], whichprovides a basic mechanism for file staging and also supportsaccess to SRM storage. DIRAC [14] is an example of a type(ii) system providing more integrated capabilities.

B. ABDS EcosystemHadoop was originally developed in the enterprise space (by

Yahoo!) and introducing an integrated compute and data in-frastructure. Hadoop provides an open source implementationof the MapReduce programming model originally proposedby Google [15]. Hadoop is designed for cheap commodityhardware (which potentially can fail), co-places compute anddata on the same node and is highly optimized for sequentialreads workloads. With the uptake of Hadoop in the commer-cial space, scientific applications and infrastructure providersstarted to evaluate Hadoop for their purposes. At the sametime, Hadoop evolved with increasing requirements (e. g. thesupport for very heterogeneous workloads) into a general pur-pose cluster framework borrowing concepts existing in HPC.

Hadoop-1 had two primary components (i) the HadoopFilesystem [16] – an open source implementation of the

Google Filesystem architecture [17] – and (ii) the MapReduceframework which was the primary way of parallel processingdata stored in HDFS. However, Hadoop saw a broad uptakeand the MapReduce model as sole processing model provedinsufficient. The tight coupling between HDFS, resource man-agement and the MapReduce programming model was deemedto be too inflexible for the usage modes that emerged. An ex-ample of such a deficit is the lack of support for iterativecomputations (as often found in machine learning). With theintroduction of Hadoop-2 and YARN [18] as central resourcemanager, Hadoop clusters can now accommodate any appli-cation or framework. As shown in Figure 1 (right) a vibrantecosystem of higher-level runtime environments, data process-ing and machine learning libraries emerged on top of resourcefabric and management layers, i. e. HDFS and YARN. His-torically, MapReduce was the Hadoop runtime layer for pro-cessing data; but, in response to application requirements, run-times for record-based, random-access data (HBase [19]), it-erative processing (Spark [20], TEZ [21], Twister [7]), stream(Spark Streaming) and graph processing (Apache Giraph [22])emerged. A key enabler for these frameworks is the YARNsupport for multi-level scheduling, which enables the applica-tion to deploy their own application-level scheduling routineson top of Hadoop-managed storage and compute resources.While YARN manages the lower resources, the higher-levelruntimes typically use an application-level scheduler to opti-mize resource usage for the application. In contrast to HPC, theresource manager, runtime system and application are muchmore tighter integrated. Typically, an application uses the ab-straction provided by the runtime system (e. g. MapReduce)and does not directly interact with resource management.

Jha et al. A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures. Big Data Congress. June-July 2014.

Applications and System Services

29

HPC Applica,on

HPC System Services Node OS

Prog. Model and Task Mgmt.

Data Model Comm. Methods

Math/Physics Libraries

Resource Mgmt

Data Mgmt

System Monitoring

Iden,ty &

Security

WAN Data

Time to revisit our model of system services in HPC systems!

Open Questions, Possible Collaboration Areas

§  Monitoring –  How do we becer use predic,ve capabili,es? –  Is there addi,onal data that would improve our predic,ons? –  What more can we learn from applica,ons without perturbing them?

§  Resource Management –  How should we perform resource management and scheduling in HPC? –  What do other system sofware services need from resource manager?

§  Data Management –  What is/are the right short and long term approach(es) for managing

the deep memory hierarchy? –  What algorithms/abstrac,ons for managing data enable scalability? –  What is the right component breakdown to enable compe,,on?

30

MCS Storage Team

§  Phil Carns §  Rob Latham §  Dries Kimpe (on leave) §  John Jenkins §  Shane Snyder §  Kevin Harms §  Dong Dai

31

Thank you for your time and attention!

32

Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Documents