Top Banner
Rob Ross Mathema,cs and Computer Science Division Argonne Na,onal Laboratory [email protected] Post-Petascale System Software: Applying Lessons Learned
32

Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

May 11, 2018

Download

Documents

dangtu
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Rob  Ross  Mathema,cs  and  Computer  Science  Division  Argonne  Na,onal  Laboratory  [email protected]    

Post-Petascale System Software: Applying Lessons Learned

Page 2: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

System Software and HPC

§  “Stuff”  that  isn’t  part  of  the  applica,on  –  Opera,ng  System  –  Resource  Manager  –  Scheduler  –  RAS  system  –  File  System  

 §  Produc,on  versions  haven’t  changed  a  lot.  

§  Will  look  at  three  areas:  –  Monitoring  –  Resource  management  –  Data  management  

2

Page 3: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Pervasive Monitoring

3

Photo by Quevaal.

Page 4: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

HPC and Monitoring

§  Three  major  types  of  monitoring  in  HPC  systems:  –  RAS  System  Monitoring  

•  Constant  tracking  of  health  of  certain  system  components  –  Applica4on  Profiling/Tracing  

•  Lighter-­‐weight  profiling  •  Detailed  logging  of  applica,on  behavior  for  debugging  purposes  

–  Subsystem  Monitoring  •  Subsystems  that  independently  monitor  por,ons  of  the  system  for  their  own  reasons  (e.g.,  the  file  system,  more  on  this  later)  

§  All  have  roles  in  overall  success  of  systems  

4

Page 5: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

We’re Getting Good at Predicting Faults from Logs

5

Percentage

Precision

Recall

Ana Gainaru. Dealing with prediction unfriendly failures: The road to specialized predictors. JLESC Workshop, Chicago, IL. November 2014. Ana Gainaru. Failure avoidance techniques for HPC systems based on failure prediction. SC Doctoral Showcase. New Orleans, LA. November 2014.

Motivation Impact

Impact

Coupling failure prediction, proactive and preventive checkpoint

7 / 11

Page 6: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Application Monitoring with Negligible Overhead

§  Profiling  is  used  in  a  number  of  circumstances  §  Q:  What  can  we  observe  about  applica,ons  without  perturbing  their  performance?  

§  Example:  Darshan:  A  lightweight,  scalable  I/O  characteriza4on  tool  that  captures  I/O  access  paCern  informa4on  from  produc4on  applica4ons.  

–  Low,  fixed  memory  consump,on  –  No  data  movement  un,l  MPI_Finalize()  –  No  source  code  or  makefile  changes  –  Not  a  trace,  not  real  ,me  

6

P. Carns et al. 24/7 characterization of petascale I/O workloads. In Proceedings of the First Workshop on Interfaces and Abstractions for Scientific Data Storage (IASDS), New Orleans, LA, September 2009.

Page 7: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Job Level

7

%  of  run,me  in  I/O    

Access  size  histogram  

'DUVKDQ�DQDO\VLV�WRRO�H[DPSOH

䙘 �r<ZdYI���<ghP<[�W]D�hkZZ<gs�dY�dg]GkEIh�<�Ä�d<OI�+���NQYI�hkZZ<gQvQ[O�p<gQ]kh�<hdIEjh�]N���$�dIgN]gZ<[EI

䙘 0PQh�NQOkgI�hP]qh�jPI���$�DIP<pQ]g�]N�<�ÈÉÇ�ÅÄÃ�dg]EIhh�jkgDkYI[EI�hQZkY<jQ][�¥dg]GkEjQ][�gk[¦�][�jPI�!Qg<�hshjIZ�<j��"

䙘 0PQh�d<gjQEkY<g�<ddYQE<jQ][�Qh�qgQjI�Q[jI[hQpI�<[G�DI[INQjh�OgI<jYs�Ng]Z�E]YYIEjQpI�DkNNIgQ[O��[]�]DpQ]kh�jk[Q[O�[IIGIG�

����

�r<ZdYI�ZI<hkgIZI[jh����]N�gk[jQZI�Q[���$

<EEIhh�hQvI�PQhj]Og<Z

�[<YsvQ[O�+<g<YYIY���$���<ghP<[

Page 8: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

System Level: Aggregated View of Data Volume ([DPSOH��V\VWHP�ZLGH�DQDO\VLV

�[<YsvQ[O�+<g<YYIY���$���<ghP<[

�]D�hQvI�ph��G<j<�p]YkZI�N]g�!Qg<����-�hshjIZ�Q[�ÃÁÂÅ�¥¯ÂÃÉ�ÁÁÁ�Y]Oh�<h�]N�$Ej]DIg��¯É�+Q��]N�jg<NNQE¦

䙘 �QOOIhj�Ds�p]YkZI��

¯ÄÁÁ�0Q�

䙘 �QOOIhj�Ds�hE<YI�

ÈÇÉ��dg]EIhhIh

䙘 +g]D<DYs�h]ZI�hE<YQ[O�

IrdIgQZI[jh�

䙘 !]hj�W]Dh�khI�d]qIg�]N�Ã�

[kZDIgh�]N�dg]EIhhIh�][�

!Qg<

8

Page 9: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

System Level: I/O Mix

Top  10  data  producer/consumers  instrumented  with  Darshan  over  the  month  of  July,  2011  on  Intrepid  BG/P  system  at  Argonne.  Surprisingly,  three  of  the  top  producer/consumers  almost  exclusively  read  exis,ng  data.  

Matching  large  scale  simula4ons  of  dense  suspensions  with  empirical  measurements  to  becer  understand  proper,es  of  complex  materials  such  as  concrete.  

Comparing  simula4ons  of    turbulent  mixing  of  fluids  with  experimental  data  to  advance  our  understanding  of  supernovae  explosions,  iner,al  confinement  fusion,  and  supersonic  combus,on.      

1

10

100

1000

MaterialsScience

EarthScience1

ParticlePhysics

Com

bustion

Turbulence1

Chem

istry

AstroPhysics

NuclearPhysics

Turbulence2

EarthScience2

Num

ber

of T

iB

Project

Write���Read���

Processing  large-­‐scale  seismographic  datasets  to  develop  a  3D  velocity  model  used  in  developing  earthquake  hazard  maps.  

9

Page 10: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Subsystem Monitoring: A Weakness

§  Some  subsystems  operate  independently  from  other  HPC  system  sofware  

§  File  systems  (in  par,cular)  do  not  leverage  system-­‐level  monitoring  

–  Do  their  own  monitoring  –  And  get  it  wrong  at  scale  –  And  then  bring  down  huge  

chunks  of  the  system    

Component   2010   2011   2012  

GPFS   101   77   75  

Machine   79   35   40  

Myrinet  HW/SW   28   32   15  

Service  Node  (DB)   29   8   15  

PVFS   15   7   7  

DDN   6   0   2  

Service  Network   2   0   -­‐-­‐  

Root cause of interrupt events on ALCF Intrepid BG/P system for 2010-2012. Thanks to C.Goletz and B. Allcock (ANL).

10

Page 11: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Pervasive Monitoring: Next Steps

§  Predic,ons  can  be  used  more  aggressively  to  reduce  cost  of  faults  

§  Increased  monitoring  of  applica,ons  can  assist  users  and  provide  insight  into  applica,on  trends  –  how  far  can  we  push  this?  

§  Decision  making  by  system  services  needs  to  be  based  on  the  best  informa,on,  but  also  the  same  informa,on  

11

Page 12: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Post-Petascale Resource Management

12

Page 13: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

HPC Resource Management Today

§  Focuses  on  the  resources  that  may  be  scheduled  for  applica,ons  

§  Scheduling  provides  a  queuing  mechanism  for  “jobs”  that  are  of  fixed  size  (in  compute  resources)  

§  Some  associated  resources  might  also  be  allocated    (e.g.,  rou,ng  or  I/O  nodes)  

§  Other  resources  are  managed  by  independent  subsystems  

–  HPSS  manages  tape  –  Parallel  file  system  manages  disks,  storage  servers  –  DB  sofware  manages  RAS  database  resources  

13

Page 14: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Example: ACME – Scientific Infrastructure

Run Model

Diagnostics & Analysis

Science Input

Data Management

Science Input

DOE Accelerated Climate Modeling for Energy (ACME) Testbed

DiagnosticsGeneration

Run ESMBuild ESM

OutputData

Diagnotics Output

Configure ESM Case

or Ensemble

Name List Files

Input Data Sets

Initialization Files

Exploratory & Explanatory Analysis

Web UI

Configuration UI + Rule engine to

guide valid configs

Machine Config

ACME Database Enables Search/Discovery, Automated Reproducibility, Workflow Status,

Monitoring Dashboard, Data Archive and Sharing

- ConfigurationInformation

(Store and/or Retrieve)

- Build status

- ESM run status

- DiagnosticsStatus

Exploratory Analysis

Archive to Storage

Model Source(svn/git)

Analysis (UV-CDAT)

Simulation Manager & ProvenanceAKUNA + ProvEn

ConfigurationStatus

- Retrieve required Datasets

- Store manually provided files

- Store history files

- Store diagnostic data

Data ArchiveESGF

-Analysis "snap shot"

Monitoring & Provenance Dataflow (Simulation Manager)

Dataset Dataflow ESGF

User Driven Interaction

Automated WorkflowProcess Control

Process level Dataflow

Legend

Single sign on and group management: Globus Nexus

System Monitoring

UI

Rapid, reliable, secure data transport and synchronization: Globus Online

UV-CDAT & Dakota

Manually Provided

File(s)

UncertaintyQuantification

Explanatory Analysis

Thanks to ACME team and G. Shipman.

Page 15: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Three Views of Resource Management and Scheduling

15

1

The Application Level Placement Scheduler

Michael Karo1, Richard Lagerstrom

1,

Marlys Kohnke1, Carl Albing

1

Cray User Group

May 8, 2006

Abstract

Cray platforms present unique resource and workload management challenges due to their scale and

complexity. The Application Level Placement Scheduler (ALPS) is a software suite designed to address

these challenges. ALPS provides uniform access to computational resources by masking many of the

architecture specific characteristics of the system from the user. This paper provides an overview of the

ALPS software together with the methodologies used during its design.

1.0 Introduction

Current and future Cray platforms will

consist of large numbers of heterogeneous

computational resources simultaneously running

many independent operating system instances.

The existing resource management infrastructure

on legacy Cray platforms was not designed to

operate in this type of environment. These

circumstances have necessitated the design and

implementation of a new software component

called ALPS. The ALPS software design is

intended to address both the requirements of

future Cray platforms and the limitations

inherent to legacy software components. The

ALPS infrastructure incorporates a robust

modular design to ensure extensibility and

maintain a level of abstraction between the

resource management model and the underlying

hardware and operating system architectures.

Emphasis has been placed on the separation of

policy and mechanism to more clearly identify

the functional requirements of the software.

It is important to note the target

platform for ALPS is limited to systems running

Compute Node Linux (CNL). There are no plans

to replace yod/CPA on systems running the

Catamount operating system.

2.0 ALPS Architecture

The ALPS architecture is divided into

several components, each responsible for

fulfilling a specific set of functional

requirements. This model ensures a modular

design that will remain maintainable and

encourage code reuse. The ALPS components

that run on each node of a system vary

depending on the type of service the node is

intended to provide. The following diagram

illustrates several of the ALPS components

together with their interactions:

The ALPS software components

communicate using the XML-RPC protocol. The

protocol provides an extensible language that

may easily be enhanced in future revisions of the

software to support additional message types and

services. In addition, ALPS makes use of

memory mapped files to consolidate and

distribute data efficiently. This reduces the

demand on the daemons that maintain these files

by allowing clients and other daemons direct

1 Cray Inc., Mendota Heights, MN, USA, [mek|rnl|kohnke|albing]@cray.com

M. Karo and R. Lagerstrom. The Application Level Placement Scheduler. Cray User Group Meeting. May, 2006.

§  Single  service  node  –  Monolithic  scheduler  

§  Nodes  allocated  for  job  life  §  Low  rate  of  job  scheduling  §  MPI  model  dominates  §  Other  subsystems  operate  independently  

HPC Batch

gorithms are better expressed using a bulk-synchronousparallel model (BSP) using message passing to com-municate between vertices, rather than the heavy, all-to-all communication barrier in a fault-tolerant, large-scale MapReduce job [22]. This mismatch became animpediment to users’ productivity, but the MapReduce-centric resource model in Hadoop admitted no compet-ing application model. Hadoop’s wide deployment in-side Yahoo! and the gravity of its data pipelines madethese tensions irreconcilable. Undeterred, users wouldwrite “MapReduce” programs that would spawn alter-native frameworks. To the scheduler they appeared asmap-only jobs with radically different resource curves,thwarting the assumptions built into to the platform andcausing poor utilization, potential deadlocks, and insta-bility. YARN must declare a truce with its users, and pro-vide explicit [R8:] Support for Programming ModelDiversity.

Beyond their mismatch with emerging framework re-quirements, typed slots also harm utilization. While theseparation between map and reduce capacity preventsdeadlocks, it can also bottleneck resources. In Hadoop,the overlap between the two stages is configured by theuser for each submitted job; starting reduce tasks laterincreases cluster throughput, while starting them earlyin a job’s execution reduces its latency.3 The number ofmap and reduce slots are fixed by the cluster operator,so fallow map capacity can’t be used to spawn reducetasks and vice versa.4 Because the two task types com-plete at different rates, no configuration will be perfectlybalanced; when either slot type becomes saturated, theJobTracker may be required to apply backpressure to jobinitialization, creating a classic pipeline bubble. Fungi-ble resources complicate scheduling, but they also em-power the allocator to pack the cluster more tightly.This highlights the need for a [R9:] Flexible ResourceModel.

While the move to shared clusters improved utiliza-tion and locality compared to HoD, it also brought con-cerns for serviceability and availability into sharp re-lief. Deploying a new version of Apache Hadoop in ashared cluster was a carefully choreographed, and a re-grettably common event. To fix a bug in the MapReduceimplementation, operators would necessarily scheduledowntime, shut down the cluster, deploy the new bits,validate the upgrade, then admit new jobs. By conflat-ing the platform responsible for arbitrating resource us-age with the framework expressing that program, oneis forced to evolve them simultaneously; when opera-tors improve the allocation efficiency of the platform,

3This oversimplifies significantly, particularly in clusters of unreli-able nodes, but it is generally true.

4Some users even optimized their jobs to favor either map or reducetasks based on shifting demand in the cluster [28].

Node ManagerNode Manager Node Manager

ResourceManager

Scheduler

AMService

MRAM Container

ContainerContainer

MPIAM

...

Container

RM -- NodeManager

RM -- AM

Umbilical

client

clientClient -- RM

Figure 1: YARN Architecture (in blue the system components,and in yellow and pink two applications running.)

users must necessarily incorporate framework changes.Thus, upgrading a cluster requires users to halt, vali-date, and restore their pipelines for orthogonal changes.While updates typically required no more than re-compilation, users’ assumptions about internal frame-work details—or developers’ assumptions about users’programs—occasionally created blocking incompatibil-ities on pipelines running on the grid.

Building on lessons learned by evolving Apache Ha-doop MapReduce, YARN was designed to address re-quirements (R1-R9). However, the massive install baseof MapReduce applications, the ecosystem of relatedprojects, well-worn deployment practice, and a tightschedule would not tolerate a radical redesign. To avoidthe trap of a “second system syndrome” [6], the new ar-chitecture reused as much code from the existing frame-work as possible, behaved in familiar patterns, and leftmany speculative features on the drawing board. Thislead to the final requirement for the YARN redesign:[R10:] Backward compatibility.

In the remainder of this paper, we provide a descrip-tion of YARN’s architecture (Sec. 3), we report aboutreal-world adoption of YARN (Sec. 4), provide experi-mental evidence validating some of the key architecturalchoices (Sec. 5) and conclude by comparing YARN withsome related work (Sec. 6).

3 ArchitectureTo address the requirements we discussed in Section 2,YARN lifts some functions into a platform layer respon-sible for resource management, leaving coordination oflogical execution plans to a host of framework imple-mentations. Specifically, a per-cluster ResourceManager(RM) tracks resource usage and node liveness, enforcesallocation invariants, and arbitrates contention amongtenants. By separating these duties in the JobTracker’scharter, the central allocator can use an abstract descrip-tion of tenants’ requirements, but remain ignorant of the

V. Vavilapalli et al. Apache Hadoop YARN: Yet Another Resource Negotiator. SOCC. October, 2013.

Apache YARN

§  Single  resource  manager  –  Scheduler  allocates  resources,  

locality  aware  §  Applica,on  masters  

–  Per  applica,on,  track  liveness  –  Can  reallocate  resources  on  

demand  §  Manages  variety  of  workloads  

M. Schwarzkopf et al. Omega: flexible, scalable schedulers for large compute clusters. EuroSys. April, 2013.

Omega: flexible, scalable schedulers for large compute clusters

Malte Schwarzkopf † ⇤ Andy Konwinski‡ ⇤ Michael Abd-El-Malek§ John Wilkes§†University of Cambridge Computer Laboratory ‡University of California, Berkeley §Google, Inc.†

[email protected]

[email protected]

§

{mabdelmalek,johnwilkes}@google.com

AbstractIncreasing scale and the need for rapid response to changingrequirements are hard to meet with current monolithic clus-ter scheduler architectures. This restricts the rate at whichnew features can be deployed, decreases efficiency and uti-lization, and will eventually limit cluster growth. We presenta novel approach to address these needs using parallelism,shared state, and lock-free optimistic concurrency control.

We compare this approach to existing cluster schedulerdesigns, evaluate how much interference between schedulersoccurs and how much it matters in practice, present sometechniques to alleviate it, and finally discuss a use casehighlighting the advantages of our approach – all driven byreal-life Google production workloads.

Categories and Subject Descriptors D.4.7 [OperatingSystems]: Organization and Design—Distributed systems;K.6.4 [Management of computing and information systems]:System Management—Centralization/decentralization

Keywords Cluster scheduling, optimistic concurrency con-trol

1. IntroductionLarge-scale compute clusters are expensive, so it is impor-tant to use them well. Utilization and efficiency can be in-creased by running a mix of workloads on the same ma-chines: CPU- and memory-intensive jobs, small and largeones, and a mix of batch and low-latency jobs – ones thatserve end user requests or provide infrastructure servicessuch as storage, naming or locking. This consolidation re-duces the amount of hardware required for a workload, butit makes the scheduling problem (assigning jobs to ma-chines) more complicated: a wider range of requirements

⇤ Work done while interning at Google, Inc.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.EuroSys’13 April 15-17, 2013, Prague, Czech RepublicCopyright c

� 2013 ACM 978-1-4503-1994-2/13/04. . . $15.00

���������� ���� �� ������������

�����������������������������������

���������

Figure 1: Schematic overview of the scheduling architec-tures explored in this paper.

and policies have to be taken into account. Meanwhile, clus-ters and their workloads keep growing, and since the sched-uler’s workload is roughly proportional to the cluster size,the scheduler is at risk of becoming a scalability bottleneck.

Google’s production job scheduler has experienced allof this. Over the years, it has evolved into a complicated,sophisticated system that is hard to change. As part of arewrite of this scheduler, we searched for a better approach.

We identified the two prevalent scheduler architecturesshown in Figure 1. Monolithic schedulers use a single,centralized scheduling algorithm for all jobs (our existingscheduler is one of these). Two-level schedulers have a sin-gle active resource manager that offers compute resources tomultiple parallel, independent “scheduler frameworks”, asin Mesos [13] and Hadoop-on-Demand [4].

Neither of these models satisfied our needs. Monolithicschedulers do not make it easy to add new policies and spe-cialized implementations, and may not scale up to the clus-ter sizes we are planning for. Two-level scheduling archi-tectures do appear to provide flexibility and parallelism, butin practice their conservative resource-visibility and lockingalgorithms limit both, and make it hard to place difficult-to-schedule “picky” jobs or to make decisions that requireaccess to the state of the entire cluster.

Our solution is a new parallel scheduler architecture builtaround shared state, using lock-free optimistic concurrencycontrol, to achieve both implementation extensibility andperformance scalability. This architecture is being used in

���

Cluster state information Cluster machines

Google Omega

§  Resource  status  globally  available  to  mul,ple  schedulers  

§  Schedulers  compete/collaborate  –  Common  scale  of  importance  

§  Subsystems  operate  under  same  umbrella  

Page 16: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Node Fault Detection: Replicated State Machine + Heartbeat

▪  Storage  servers  exchange  heartbeat  messages  to  detect  faults  

▪  A  subset  of  daemons  use  a  distributed  consensus  algorithm  (like  PAXOS)  to  maintain  a  consistent  view  of  membership  state  

▪  Clients  need  not  ac,vely  par,cipate  –  Retrieve  state  from  servers  or  monitors  

when  needed  –  Limit  the  scaling  requirements  

16

Page 17: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Epidemic-based Node Fault Detection

▪  Similari,es:  –  Clients  need  not  ac,vely  par,cipate  –  Servers  exchange  heartbeat  messages  to  

detect  faults  ▪  Differences:  

–  No  dedicated  service  for  distributed  consensus  

–  Each  storage  server  maintains  its  own  view  of  the  system  

–  Disseminate  updates  using  epidemic  principles  

 Seman,c  differences  may  influence  the  storage  system  design.  

A.  Das  et  al.  Swim:  Scalable  weakly-­‐consistent  infec,on-­‐style  process  group  membership  protocol.  Proceedings  of  the  2002  Interna,onal  Conference  on  Dependable  Systems  and  Networks.  DSN  ’02.  2002.  

17

Page 18: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Resource Management: Next Steps

18

§  Revisit  resource  management  –  Support  mul,ple  applica,on  “models”  –  Separate  scheduling  from  resource  management  – Workflow  support  (but  might  be  at  a  higher  level)  

§  Subsystems  as  applica,ons  –  Very  long  running  –  Resource  needs  will  change  over  ,me  

§ Unified  node  fault  detec,on  –  Resource  manager  as  the  Oracle  

Page 19: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Data Management Architectures

19

Page 20: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

System Architecture and Nonvolatile Memory

Compute nodes run application processes.

I/O forwarding nodes (or I/O gateways) shuffle data between compute nodes and external resources, including storage.

Storage nodes run the parallel file system.

External network

Disk arrays

20

NVM in storage nodes serves as a PFS accelerator.

NVM in I/O nodes provides a fast staging area and region for temporary storage.

NVM in compute nodes lets you add noise into your system network.

Page 21: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

On Lakes and Cold Data §  Companies  managing  large  scale  data  repositories  are  moving  

to  a  “data  lake”  model  where  bulk  data  is  stored  at  low  cost  §  New  technologies  provide  an  opportunity  for  convergence  of  

“cold  store”  ideas  with  facility-­‐wide,  low  latency  access  

21

R. Miller. Facebook Builds Exabyte Data Centers for Cold Storage. Data Center Knowledge. January 18, 2013. J. Novet. First Look: Facebook’s Oregon Cold Storage Facility. Data Center Knowledge. October 16, 2013. T. Morgan. Facebook Loads Up Innovative Cold Storage Datacenter. EnterpriseTech Storage Edition. October 25, 2013.

§  Facebook’s  cold  store  –  62,000  square  feet,  up  to  500  racks  @  4PBytes/rack  –  No  generators  or  UPSes,  all  redundancy  in  sofware  (Reed-­‐Solomon)  –  2KW/rack  rather  than  standard  (for  them)  8KW/rack  –  Shingled  magne,c  recording  drives  

•  Many  drives  spun  down  at  any  given  moment  •  Seconds  to  spin  up  and  access  –  can  be  used  for  content  delivery  

–  Predict  facility  will  be  filled  by  2017  

Page 22: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

HPC I/O Software Stack

The  soNware  used  to  provide  data  model  support  and  to  transform  I/O  to  beCer  perform  on  today’s  I/O  systems  is  oNen  referred  to  as  the  I/O  stack.  

Data Model Libraries map application abstractions onto storage abstractions and provide data portability.

HDF5, Parallel netCDF, ADIOS

I/O Middleware organizes accesses from many processes, especially those using collective ���I/O.

MPI-IO, PLFS I/O Forwarding transforms I/O from many clients into fewer, larger request; reduces lock contention; and bridges between the HPC system and external storage.

IBM ciod, IOFSL, Cray DVS

Parallel file system maintains logical file model and provides efficient access to data.

PVFS, Gfarm, GPFS, Lustre

I/O Hardware

Application

Parallel File System

Data Model Support

Transformations

22

Page 23: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

I/O Services, RPC, and the Mercury Project

23

Mercury

ObjectiveCreate a reusable RPC library for use in HPC ScientificLibraries that can serve as a basis for services such asstorage systems, I/O forwarding, analysis frameworks andother forms of inter-application communication

Why not reuse existing RPC frameworks?– Do not support e�cient large data transfers or asynchronous calls– Mostly built on top of TCP/IP protocols

I Need support for native transportI Need to be easy to port to new machines

Similar approaches with some di↵erences indicates need– I/O Forwarding Scalability Layer (IOFSL)– NEtwork Scalable Service Interface (Nessie)– Lustre RPC

3

http://www.mcs.anl.gov/projects/mercury/

Page 24: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

An Alternative to File System Metadata: Provenance Graphs

What  if  we  treat  metadata  management  (including  provenance)  as  a  graph  processing  problem?  

24

Create a Metadata Graph

• Each log file => one Job

• Each uid => one User• All Ranks => Processes

• jobid, start_time, end_time, exe

• nprocs, file_access• File and exe => Data Object• Synthetically create directory structure

• data files visited by the same execution will be placed under the same directory

• directories accessed by the same user are placed under one directory

Whole 2013’s trace from Intrepid42% of all core-hours consumed in 2013

User Entity

Execution Entity

File Entityrun

exe

readread

write

write

run

name:Johnid:330862395

name:203863...fs-type:gpfs..., ...

id:2726768805params:-n 2048..., ...

name:2111648390..., ...

exe

ts:20130101...writeSize:7M

name:samid:430823375

D. Dai et al. Using Property Graphs for Rich Metadata Management in HPC Systems. PDSW Workshop. New Orleans, LA. November 2014.

Rich metadata size

ApplicationsUser

Processes ( I/O Ranks)Files

detailed level

Processes (All Ranks)

Page 25: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Post-Petascale Data Management Stack

25

Iden,ty  and  Security  

WAN  Data  Services  

Resource  Mgmt  and  Scheduling  

Performance  Monitoring  

Applica,on  Tasks  Users   Analysis  Tasks  

Programming  Model  

Science  Data  Model  Services  

Task  and  Data  Coordina,on  Publish/Subscribe  

In  System  Storage   Networking  HW  External  Storage  

Core  Data  Services  Provenance  Management   Core  Data  

Model  Services  Metadata  

Management  

Pass-­‐

through  

Input from G. Grider, S. Klasky, P. MacCormick, R. Oldfield, G. Shipman, K. van Dam, and D. Williams.

Page 26: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Data Management: Next Steps

§  Storage-­‐based  vs.  memory-­‐based  approaches  to  in-­‐system  storage  

§  Data  lakes  and  tape  in  future  storage  systems?  §  Rearchitect  the  I/O  stack!  

26

Page 27: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Concluding Thoughts

27

Page 28: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

One Comparison of HPC and Big Data Software

28

Compute Resources(Nodes, Cores, VMs)

Workload Management (Pilots, Condor)

Orchestration(Pegasus, Taverna, Dryad, Swift)

Declarative Languages

(Swift)

MPI Frameworks for Advanced Analytics &

Machine Learning(Blas, ScaLAPACK, CompLearn, PetSc,

Blast)

Applications

MapReduceFrameworks

(Pilot-MapReduce)

Resource Management

Cluster Resource Manager (Slurm, Torque, SGE)

Storage Resources(Lustre, GPFS)

Data Access(Virtual Filesystem,

GridFTP, SSH)

Resource Fabric

Higher-Level Runtime

Environment

Data Processing,Analytics,

Orchestration

Compute and Data Resources(Nodes, Cores, HDFS)

Higher-Level Workload

Management (TEZ, LLama)

Advanced Analytics & Machine Learning (Mahout, R, MLBase)

Applications

MapReduce

Cluster Resource Manager (YARN, Mesos)

MapReduce

Scheduler

Data Store & Processing

(HBase)

In-Memory (Spark)

SparkScheduler

TwisterMapReduce

TwisterScheduler

SQL-Engines (Impala, Hive, Shark, Phoenix)

Scheduler

MPI, RDMA Hadoop Shuffle/Reduction, HARP Collectives Communication

High-Performance Computing Apache Hadoop Big Data

Orchestration (Oozie, Pig)

Advanced Analytics & Machine Learning (Pilot-KMeans, Replica Exchange)

Storage Management (iRODS, SRM, GFFS)

Fig. 1. HPC and ABDS architecture and abstractions: The HPC approach historically separated data and compute; ABDS co-locates compute and data.The YARN resource manager heavily utilizes multi-level, data-aware scheduling and supports a vibrant Hadoop-based ecosystem of data processing, analyticsand machine learning frameworks. Each approach has a rich, but hitherto distinct resource management and communication capabilities.

In addition several runtime environments for supporting het-erogeneous, loosely coupled tasks, e. g. Pilot-Jobs [9], manytasks [10] and workflows [11]. Pilot-Jobs generalize the con-cept of a placeholder to provide multi-level and/or application-level scheduling on top of the system-provided schedulers.With the increasing importance of data, Pilot-Jobs are increas-ingly used to process and analyze large amounts of data [12],[9]. In general, one can distinguish two kinds of data manage-ment: (i) the ability to stage-in/stage-out files from anothercompute node or a storage backend, such as SRM and (ii) theprovisioning of integrated data/compute management mecha-nisms. An example for (i) is Condor-G/Glide-in [13], whichprovides a basic mechanism for file staging and also supportsaccess to SRM storage. DIRAC [14] is an example of a type(ii) system providing more integrated capabilities.

B. ABDS EcosystemHadoop was originally developed in the enterprise space (by

Yahoo!) and introducing an integrated compute and data in-frastructure. Hadoop provides an open source implementationof the MapReduce programming model originally proposedby Google [15]. Hadoop is designed for cheap commodityhardware (which potentially can fail), co-places compute anddata on the same node and is highly optimized for sequentialreads workloads. With the uptake of Hadoop in the commer-cial space, scientific applications and infrastructure providersstarted to evaluate Hadoop for their purposes. At the sametime, Hadoop evolved with increasing requirements (e. g. thesupport for very heterogeneous workloads) into a general pur-pose cluster framework borrowing concepts existing in HPC.

Hadoop-1 had two primary components (i) the HadoopFilesystem [16] – an open source implementation of the

Google Filesystem architecture [17] – and (ii) the MapReduceframework which was the primary way of parallel processingdata stored in HDFS. However, Hadoop saw a broad uptakeand the MapReduce model as sole processing model provedinsufficient. The tight coupling between HDFS, resource man-agement and the MapReduce programming model was deemedto be too inflexible for the usage modes that emerged. An ex-ample of such a deficit is the lack of support for iterativecomputations (as often found in machine learning). With theintroduction of Hadoop-2 and YARN [18] as central resourcemanager, Hadoop clusters can now accommodate any appli-cation or framework. As shown in Figure 1 (right) a vibrantecosystem of higher-level runtime environments, data process-ing and machine learning libraries emerged on top of resourcefabric and management layers, i. e. HDFS and YARN. His-torically, MapReduce was the Hadoop runtime layer for pro-cessing data; but, in response to application requirements, run-times for record-based, random-access data (HBase [19]), it-erative processing (Spark [20], TEZ [21], Twister [7]), stream(Spark Streaming) and graph processing (Apache Giraph [22])emerged. A key enabler for these frameworks is the YARNsupport for multi-level scheduling, which enables the applica-tion to deploy their own application-level scheduling routineson top of Hadoop-managed storage and compute resources.While YARN manages the lower resources, the higher-levelruntimes typically use an application-level scheduler to opti-mize resource usage for the application. In contrast to HPC, theresource manager, runtime system and application are muchmore tighter integrated. Typically, an application uses the ab-straction provided by the runtime system (e. g. MapReduce)and does not directly interact with resource management.

Jha et al. A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures. Big Data Congress. June-July 2014.

Page 29: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Applications and System Services

29

HPC  Applica,on  

HPC  System  Services  Node  OS  

Prog.  Model  and  Task  Mgmt.  

Data  Model  Comm.  Methods  

Math/Physics  Libraries  

Resource  Mgmt  

Data  Mgmt  

System  Monitoring  

Iden,ty  &  

Security  

WAN  Data  

Time  to  revisit  our  model  of  system  services  in  HPC  systems!  

Page 30: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Open Questions, Possible Collaboration Areas

§  Monitoring  –  How  do  we  becer  use  predic,ve  capabili,es?  –  Is  there  addi,onal  data  that  would  improve  our  predic,ons?  –  What  more  can  we  learn  from  applica,ons  without  perturbing  them?  

§  Resource  Management  –  How  should  we  perform  resource  management  and  scheduling  in  HPC?  –  What  do  other  system  sofware  services  need  from  resource  manager?  

§  Data  Management  –  What  is/are  the  right  short  and  long  term  approach(es)  for  managing  

the  deep  memory  hierarchy?  –  What  algorithms/abstrac,ons  for  managing  data  enable  scalability?  –  What  is  the  right  component  breakdown  to  enable  compe,,on?  

30

Page 31: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

MCS Storage Team

§  Phil  Carns  §  Rob  Latham  §  Dries  Kimpe  (on  leave)  §  John  Jenkins  §  Shane  Snyder  §  Kevin  Harms  §  Dong  Dai  

31

Page 32: Post- Petascale System Software: Applying …wallaby.aics.riken.jp/isp2s2/invited/Rob_Ross.pdfPost- Petascale System Software: Applying Lessons Learned . ... for$their$own$reasons$

Thank you for your time and attention!

32