Rob Ross Mathema,cs and Computer Science Division Argonne Na,onal Laboratory [email protected] Post-Petascale System Software: Applying Lessons Learned
Rob Ross Mathema,cs and Computer Science Division Argonne Na,onal Laboratory [email protected]
Post-Petascale System Software: Applying Lessons Learned
System Software and HPC
§ “Stuff” that isn’t part of the applica,on – Opera,ng System – Resource Manager – Scheduler – RAS system – File System
§ Produc,on versions haven’t changed a lot.
§ Will look at three areas: – Monitoring – Resource management – Data management
2
Pervasive Monitoring
3
Photo by Quevaal.
HPC and Monitoring
§ Three major types of monitoring in HPC systems: – RAS System Monitoring
• Constant tracking of health of certain system components – Applica4on Profiling/Tracing
• Lighter-‐weight profiling • Detailed logging of applica,on behavior for debugging purposes
– Subsystem Monitoring • Subsystems that independently monitor por,ons of the system for their own reasons (e.g., the file system, more on this later)
§ All have roles in overall success of systems
4
We’re Getting Good at Predicting Faults from Logs
5
Percentage
Precision
Recall
Ana Gainaru. Dealing with prediction unfriendly failures: The road to specialized predictors. JLESC Workshop, Chicago, IL. November 2014. Ana Gainaru. Failure avoidance techniques for HPC systems based on failure prediction. SC Doctoral Showcase. New Orleans, LA. November 2014.
Motivation Impact
Impact
Coupling failure prediction, proactive and preventive checkpoint
7 / 11
Application Monitoring with Negligible Overhead
§ Profiling is used in a number of circumstances § Q: What can we observe about applica,ons without perturbing their performance?
§ Example: Darshan: A lightweight, scalable I/O characteriza4on tool that captures I/O access paCern informa4on from produc4on applica4ons.
– Low, fixed memory consump,on – No data movement un,l MPI_Finalize() – No source code or makefile changes – Not a trace, not real ,me
6
P. Carns et al. 24/7 characterization of petascale I/O workloads. In Proceedings of the First Workshop on Interfaces and Abstractions for Scientific Data Storage (IASDS), New Orleans, LA, September 2009.
Job Level
7
% of run,me in I/O
Access size histogram
'DUVKDQ�DQDO\VLV�WRRO�H[DPSOH
䙘 �r<ZdYI���<ghP<[�W]D�hkZZ<gs�dY�dg]GkEIh�<�Ä�d<OI�+���NQYI�hkZZ<gQvQ[O�p<gQ]kh�<hdIEjh�]N���$�dIgN]gZ<[EI
䙘 0PQh�NQOkgI�hP]qh�jPI���$�DIP<pQ]g�]N�<�ÈÉÇ�ÅÄÃ�dg]EIhh�jkgDkYI[EI�hQZkY<jQ][�¥dg]GkEjQ][�gk[¦�][�jPI�!Qg<�hshjIZ�<j��"
䙘 0PQh�d<gjQEkY<g�<ddYQE<jQ][�Qh�qgQjI�Q[jI[hQpI�<[G�DI[INQjh�OgI<jYs�Ng]Z�E]YYIEjQpI�DkNNIgQ[O��[]�]DpQ]kh�jk[Q[O�[IIGIG�
����
�r<ZdYI�ZI<hkgIZI[jh����]N�gk[jQZI�Q[���$
<EEIhh�hQvI�PQhj]Og<Z
�[<YsvQ[O�+<g<YYIY���$���<ghP<[
System Level: Aggregated View of Data Volume ([DPSOH��V\VWHP�ZLGH�DQDO\VLV
�[<YsvQ[O�+<g<YYIY���$���<ghP<[
�]D�hQvI�ph��G<j<�p]YkZI�N]g�!Qg<����-�hshjIZ�Q[�ÃÁÂÅ�¥¯ÂÃÉ�ÁÁÁ�Y]Oh�<h�]N�$Ej]DIg��¯É�+Q��]N�jg<NNQE¦
䙘 �QOOIhj�Ds�p]YkZI��
¯ÄÁÁ�0Q�
䙘 �QOOIhj�Ds�hE<YI�
ÈÇÉ��dg]EIhhIh
䙘 +g]D<DYs�h]ZI�hE<YQ[O�
IrdIgQZI[jh�
䙘 !]hj�W]Dh�khI�d]qIg�]N�Ã�
[kZDIgh�]N�dg]EIhhIh�][�
!Qg<
8
System Level: I/O Mix
Top 10 data producer/consumers instrumented with Darshan over the month of July, 2011 on Intrepid BG/P system at Argonne. Surprisingly, three of the top producer/consumers almost exclusively read exis,ng data.
Matching large scale simula4ons of dense suspensions with empirical measurements to becer understand proper,es of complex materials such as concrete.
Comparing simula4ons of turbulent mixing of fluids with experimental data to advance our understanding of supernovae explosions, iner,al confinement fusion, and supersonic combus,on.
1
10
100
1000
MaterialsScience
EarthScience1
ParticlePhysics
Com
bustion
Turbulence1
Chem
istry
AstroPhysics
NuclearPhysics
Turbulence2
EarthScience2
Num
ber
of T
iB
Project
Write���Read���
Processing large-‐scale seismographic datasets to develop a 3D velocity model used in developing earthquake hazard maps.
9
Subsystem Monitoring: A Weakness
§ Some subsystems operate independently from other HPC system sofware
§ File systems (in par,cular) do not leverage system-‐level monitoring
– Do their own monitoring – And get it wrong at scale – And then bring down huge
chunks of the system
Component 2010 2011 2012
GPFS 101 77 75
Machine 79 35 40
Myrinet HW/SW 28 32 15
Service Node (DB) 29 8 15
PVFS 15 7 7
DDN 6 0 2
Service Network 2 0 -‐-‐
Root cause of interrupt events on ALCF Intrepid BG/P system for 2010-2012. Thanks to C.Goletz and B. Allcock (ANL).
10
Pervasive Monitoring: Next Steps
§ Predic,ons can be used more aggressively to reduce cost of faults
§ Increased monitoring of applica,ons can assist users and provide insight into applica,on trends – how far can we push this?
§ Decision making by system services needs to be based on the best informa,on, but also the same informa,on
11
Post-Petascale Resource Management
12
HPC Resource Management Today
§ Focuses on the resources that may be scheduled for applica,ons
§ Scheduling provides a queuing mechanism for “jobs” that are of fixed size (in compute resources)
§ Some associated resources might also be allocated (e.g., rou,ng or I/O nodes)
§ Other resources are managed by independent subsystems
– HPSS manages tape – Parallel file system manages disks, storage servers – DB sofware manages RAS database resources
13
Example: ACME – Scientific Infrastructure
Run Model
Diagnostics & Analysis
Science Input
Data Management
Science Input
DOE Accelerated Climate Modeling for Energy (ACME) Testbed
DiagnosticsGeneration
Run ESMBuild ESM
OutputData
Diagnotics Output
Configure ESM Case
or Ensemble
Name List Files
Input Data Sets
Initialization Files
Exploratory & Explanatory Analysis
Web UI
Configuration UI + Rule engine to
guide valid configs
Machine Config
ACME Database Enables Search/Discovery, Automated Reproducibility, Workflow Status,
Monitoring Dashboard, Data Archive and Sharing
- ConfigurationInformation
(Store and/or Retrieve)
- Build status
- ESM run status
- DiagnosticsStatus
Exploratory Analysis
Archive to Storage
Model Source(svn/git)
Analysis (UV-CDAT)
Simulation Manager & ProvenanceAKUNA + ProvEn
ConfigurationStatus
- Retrieve required Datasets
- Store manually provided files
- Store history files
- Store diagnostic data
Data ArchiveESGF
-Analysis "snap shot"
Monitoring & Provenance Dataflow (Simulation Manager)
Dataset Dataflow ESGF
User Driven Interaction
Automated WorkflowProcess Control
Process level Dataflow
Legend
Single sign on and group management: Globus Nexus
System Monitoring
UI
Rapid, reliable, secure data transport and synchronization: Globus Online
UV-CDAT & Dakota
Manually Provided
File(s)
UncertaintyQuantification
Explanatory Analysis
Thanks to ACME team and G. Shipman.
Three Views of Resource Management and Scheduling
15
1
The Application Level Placement Scheduler
Michael Karo1, Richard Lagerstrom
1,
Marlys Kohnke1, Carl Albing
1
Cray User Group
May 8, 2006
Abstract
Cray platforms present unique resource and workload management challenges due to their scale and
complexity. The Application Level Placement Scheduler (ALPS) is a software suite designed to address
these challenges. ALPS provides uniform access to computational resources by masking many of the
architecture specific characteristics of the system from the user. This paper provides an overview of the
ALPS software together with the methodologies used during its design.
1.0 Introduction
Current and future Cray platforms will
consist of large numbers of heterogeneous
computational resources simultaneously running
many independent operating system instances.
The existing resource management infrastructure
on legacy Cray platforms was not designed to
operate in this type of environment. These
circumstances have necessitated the design and
implementation of a new software component
called ALPS. The ALPS software design is
intended to address both the requirements of
future Cray platforms and the limitations
inherent to legacy software components. The
ALPS infrastructure incorporates a robust
modular design to ensure extensibility and
maintain a level of abstraction between the
resource management model and the underlying
hardware and operating system architectures.
Emphasis has been placed on the separation of
policy and mechanism to more clearly identify
the functional requirements of the software.
It is important to note the target
platform for ALPS is limited to systems running
Compute Node Linux (CNL). There are no plans
to replace yod/CPA on systems running the
Catamount operating system.
2.0 ALPS Architecture
The ALPS architecture is divided into
several components, each responsible for
fulfilling a specific set of functional
requirements. This model ensures a modular
design that will remain maintainable and
encourage code reuse. The ALPS components
that run on each node of a system vary
depending on the type of service the node is
intended to provide. The following diagram
illustrates several of the ALPS components
together with their interactions:
The ALPS software components
communicate using the XML-RPC protocol. The
protocol provides an extensible language that
may easily be enhanced in future revisions of the
software to support additional message types and
services. In addition, ALPS makes use of
memory mapped files to consolidate and
distribute data efficiently. This reduces the
demand on the daemons that maintain these files
by allowing clients and other daemons direct
1 Cray Inc., Mendota Heights, MN, USA, [mek|rnl|kohnke|albing]@cray.com
M. Karo and R. Lagerstrom. The Application Level Placement Scheduler. Cray User Group Meeting. May, 2006.
§ Single service node – Monolithic scheduler
§ Nodes allocated for job life § Low rate of job scheduling § MPI model dominates § Other subsystems operate independently
HPC Batch
gorithms are better expressed using a bulk-synchronousparallel model (BSP) using message passing to com-municate between vertices, rather than the heavy, all-to-all communication barrier in a fault-tolerant, large-scale MapReduce job [22]. This mismatch became animpediment to users’ productivity, but the MapReduce-centric resource model in Hadoop admitted no compet-ing application model. Hadoop’s wide deployment in-side Yahoo! and the gravity of its data pipelines madethese tensions irreconcilable. Undeterred, users wouldwrite “MapReduce” programs that would spawn alter-native frameworks. To the scheduler they appeared asmap-only jobs with radically different resource curves,thwarting the assumptions built into to the platform andcausing poor utilization, potential deadlocks, and insta-bility. YARN must declare a truce with its users, and pro-vide explicit [R8:] Support for Programming ModelDiversity.
Beyond their mismatch with emerging framework re-quirements, typed slots also harm utilization. While theseparation between map and reduce capacity preventsdeadlocks, it can also bottleneck resources. In Hadoop,the overlap between the two stages is configured by theuser for each submitted job; starting reduce tasks laterincreases cluster throughput, while starting them earlyin a job’s execution reduces its latency.3 The number ofmap and reduce slots are fixed by the cluster operator,so fallow map capacity can’t be used to spawn reducetasks and vice versa.4 Because the two task types com-plete at different rates, no configuration will be perfectlybalanced; when either slot type becomes saturated, theJobTracker may be required to apply backpressure to jobinitialization, creating a classic pipeline bubble. Fungi-ble resources complicate scheduling, but they also em-power the allocator to pack the cluster more tightly.This highlights the need for a [R9:] Flexible ResourceModel.
While the move to shared clusters improved utiliza-tion and locality compared to HoD, it also brought con-cerns for serviceability and availability into sharp re-lief. Deploying a new version of Apache Hadoop in ashared cluster was a carefully choreographed, and a re-grettably common event. To fix a bug in the MapReduceimplementation, operators would necessarily scheduledowntime, shut down the cluster, deploy the new bits,validate the upgrade, then admit new jobs. By conflat-ing the platform responsible for arbitrating resource us-age with the framework expressing that program, oneis forced to evolve them simultaneously; when opera-tors improve the allocation efficiency of the platform,
3This oversimplifies significantly, particularly in clusters of unreli-able nodes, but it is generally true.
4Some users even optimized their jobs to favor either map or reducetasks based on shifting demand in the cluster [28].
Node ManagerNode Manager Node Manager
ResourceManager
Scheduler
AMService
MRAM Container
ContainerContainer
MPIAM
...
Container
RM -- NodeManager
RM -- AM
Umbilical
client
clientClient -- RM
Figure 1: YARN Architecture (in blue the system components,and in yellow and pink two applications running.)
users must necessarily incorporate framework changes.Thus, upgrading a cluster requires users to halt, vali-date, and restore their pipelines for orthogonal changes.While updates typically required no more than re-compilation, users’ assumptions about internal frame-work details—or developers’ assumptions about users’programs—occasionally created blocking incompatibil-ities on pipelines running on the grid.
Building on lessons learned by evolving Apache Ha-doop MapReduce, YARN was designed to address re-quirements (R1-R9). However, the massive install baseof MapReduce applications, the ecosystem of relatedprojects, well-worn deployment practice, and a tightschedule would not tolerate a radical redesign. To avoidthe trap of a “second system syndrome” [6], the new ar-chitecture reused as much code from the existing frame-work as possible, behaved in familiar patterns, and leftmany speculative features on the drawing board. Thislead to the final requirement for the YARN redesign:[R10:] Backward compatibility.
In the remainder of this paper, we provide a descrip-tion of YARN’s architecture (Sec. 3), we report aboutreal-world adoption of YARN (Sec. 4), provide experi-mental evidence validating some of the key architecturalchoices (Sec. 5) and conclude by comparing YARN withsome related work (Sec. 6).
3 ArchitectureTo address the requirements we discussed in Section 2,YARN lifts some functions into a platform layer respon-sible for resource management, leaving coordination oflogical execution plans to a host of framework imple-mentations. Specifically, a per-cluster ResourceManager(RM) tracks resource usage and node liveness, enforcesallocation invariants, and arbitrates contention amongtenants. By separating these duties in the JobTracker’scharter, the central allocator can use an abstract descrip-tion of tenants’ requirements, but remain ignorant of the
V. Vavilapalli et al. Apache Hadoop YARN: Yet Another Resource Negotiator. SOCC. October, 2013.
Apache YARN
§ Single resource manager – Scheduler allocates resources,
locality aware § Applica,on masters
– Per applica,on, track liveness – Can reallocate resources on
demand § Manages variety of workloads
M. Schwarzkopf et al. Omega: flexible, scalable schedulers for large compute clusters. EuroSys. April, 2013.
Omega: flexible, scalable schedulers for large compute clusters
Malte Schwarzkopf † ⇤ Andy Konwinski‡ ⇤ Michael Abd-El-Malek§ John Wilkes§†University of Cambridge Computer Laboratory ‡University of California, Berkeley §Google, Inc.†
‡
§
{mabdelmalek,johnwilkes}@google.com
AbstractIncreasing scale and the need for rapid response to changingrequirements are hard to meet with current monolithic clus-ter scheduler architectures. This restricts the rate at whichnew features can be deployed, decreases efficiency and uti-lization, and will eventually limit cluster growth. We presenta novel approach to address these needs using parallelism,shared state, and lock-free optimistic concurrency control.
We compare this approach to existing cluster schedulerdesigns, evaluate how much interference between schedulersoccurs and how much it matters in practice, present sometechniques to alleviate it, and finally discuss a use casehighlighting the advantages of our approach – all driven byreal-life Google production workloads.
Categories and Subject Descriptors D.4.7 [OperatingSystems]: Organization and Design—Distributed systems;K.6.4 [Management of computing and information systems]:System Management—Centralization/decentralization
Keywords Cluster scheduling, optimistic concurrency con-trol
1. IntroductionLarge-scale compute clusters are expensive, so it is impor-tant to use them well. Utilization and efficiency can be in-creased by running a mix of workloads on the same ma-chines: CPU- and memory-intensive jobs, small and largeones, and a mix of batch and low-latency jobs – ones thatserve end user requests or provide infrastructure servicessuch as storage, naming or locking. This consolidation re-duces the amount of hardware required for a workload, butit makes the scheduling problem (assigning jobs to ma-chines) more complicated: a wider range of requirements
⇤ Work done while interning at Google, Inc.
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.EuroSys’13 April 15-17, 2013, Prague, Czech RepublicCopyright c
� 2013 ACM 978-1-4503-1994-2/13/04. . . $15.00
���������� ���� �� ������������
�����������������������������������
���������
Figure 1: Schematic overview of the scheduling architec-tures explored in this paper.
and policies have to be taken into account. Meanwhile, clus-ters and their workloads keep growing, and since the sched-uler’s workload is roughly proportional to the cluster size,the scheduler is at risk of becoming a scalability bottleneck.
Google’s production job scheduler has experienced allof this. Over the years, it has evolved into a complicated,sophisticated system that is hard to change. As part of arewrite of this scheduler, we searched for a better approach.
We identified the two prevalent scheduler architecturesshown in Figure 1. Monolithic schedulers use a single,centralized scheduling algorithm for all jobs (our existingscheduler is one of these). Two-level schedulers have a sin-gle active resource manager that offers compute resources tomultiple parallel, independent “scheduler frameworks”, asin Mesos [13] and Hadoop-on-Demand [4].
Neither of these models satisfied our needs. Monolithicschedulers do not make it easy to add new policies and spe-cialized implementations, and may not scale up to the clus-ter sizes we are planning for. Two-level scheduling archi-tectures do appear to provide flexibility and parallelism, butin practice their conservative resource-visibility and lockingalgorithms limit both, and make it hard to place difficult-to-schedule “picky” jobs or to make decisions that requireaccess to the state of the entire cluster.
Our solution is a new parallel scheduler architecture builtaround shared state, using lock-free optimistic concurrencycontrol, to achieve both implementation extensibility andperformance scalability. This architecture is being used in
���
Cluster state information Cluster machines
Google Omega
§ Resource status globally available to mul,ple schedulers
§ Schedulers compete/collaborate – Common scale of importance
§ Subsystems operate under same umbrella
Node Fault Detection: Replicated State Machine + Heartbeat
▪ Storage servers exchange heartbeat messages to detect faults
▪ A subset of daemons use a distributed consensus algorithm (like PAXOS) to maintain a consistent view of membership state
▪ Clients need not ac,vely par,cipate – Retrieve state from servers or monitors
when needed – Limit the scaling requirements
16
Epidemic-based Node Fault Detection
▪ Similari,es: – Clients need not ac,vely par,cipate – Servers exchange heartbeat messages to
detect faults ▪ Differences:
– No dedicated service for distributed consensus
– Each storage server maintains its own view of the system
– Disseminate updates using epidemic principles
Seman,c differences may influence the storage system design.
A. Das et al. Swim: Scalable weakly-‐consistent infec,on-‐style process group membership protocol. Proceedings of the 2002 Interna,onal Conference on Dependable Systems and Networks. DSN ’02. 2002.
17
Resource Management: Next Steps
18
§ Revisit resource management – Support mul,ple applica,on “models” – Separate scheduling from resource management – Workflow support (but might be at a higher level)
§ Subsystems as applica,ons – Very long running – Resource needs will change over ,me
§ Unified node fault detec,on – Resource manager as the Oracle
Data Management Architectures
19
System Architecture and Nonvolatile Memory
Compute nodes run application processes.
I/O forwarding nodes (or I/O gateways) shuffle data between compute nodes and external resources, including storage.
Storage nodes run the parallel file system.
External network
Disk arrays
20
NVM in storage nodes serves as a PFS accelerator.
NVM in I/O nodes provides a fast staging area and region for temporary storage.
NVM in compute nodes lets you add noise into your system network.
On Lakes and Cold Data § Companies managing large scale data repositories are moving
to a “data lake” model where bulk data is stored at low cost § New technologies provide an opportunity for convergence of
“cold store” ideas with facility-‐wide, low latency access
21
R. Miller. Facebook Builds Exabyte Data Centers for Cold Storage. Data Center Knowledge. January 18, 2013. J. Novet. First Look: Facebook’s Oregon Cold Storage Facility. Data Center Knowledge. October 16, 2013. T. Morgan. Facebook Loads Up Innovative Cold Storage Datacenter. EnterpriseTech Storage Edition. October 25, 2013.
§ Facebook’s cold store – 62,000 square feet, up to 500 racks @ 4PBytes/rack – No generators or UPSes, all redundancy in sofware (Reed-‐Solomon) – 2KW/rack rather than standard (for them) 8KW/rack – Shingled magne,c recording drives
• Many drives spun down at any given moment • Seconds to spin up and access – can be used for content delivery
– Predict facility will be filled by 2017
HPC I/O Software Stack
The soNware used to provide data model support and to transform I/O to beCer perform on today’s I/O systems is oNen referred to as the I/O stack.
Data Model Libraries map application abstractions onto storage abstractions and provide data portability.
HDF5, Parallel netCDF, ADIOS
I/O Middleware organizes accesses from many processes, especially those using collective ���I/O.
MPI-IO, PLFS I/O Forwarding transforms I/O from many clients into fewer, larger request; reduces lock contention; and bridges between the HPC system and external storage.
IBM ciod, IOFSL, Cray DVS
Parallel file system maintains logical file model and provides efficient access to data.
PVFS, Gfarm, GPFS, Lustre
I/O Hardware
Application
Parallel File System
Data Model Support
Transformations
22
I/O Services, RPC, and the Mercury Project
23
Mercury
ObjectiveCreate a reusable RPC library for use in HPC ScientificLibraries that can serve as a basis for services such asstorage systems, I/O forwarding, analysis frameworks andother forms of inter-application communication
Why not reuse existing RPC frameworks?– Do not support e�cient large data transfers or asynchronous calls– Mostly built on top of TCP/IP protocols
I Need support for native transportI Need to be easy to port to new machines
Similar approaches with some di↵erences indicates need– I/O Forwarding Scalability Layer (IOFSL)– NEtwork Scalable Service Interface (Nessie)– Lustre RPC
3
http://www.mcs.anl.gov/projects/mercury/
An Alternative to File System Metadata: Provenance Graphs
What if we treat metadata management (including provenance) as a graph processing problem?
24
Create a Metadata Graph
• Each log file => one Job
• Each uid => one User• All Ranks => Processes
• jobid, start_time, end_time, exe
• nprocs, file_access• File and exe => Data Object• Synthetically create directory structure
• data files visited by the same execution will be placed under the same directory
• directories accessed by the same user are placed under one directory
Whole 2013’s trace from Intrepid42% of all core-hours consumed in 2013
User Entity
Execution Entity
File Entityrun
exe
readread
write
write
run
name:Johnid:330862395
name:203863...fs-type:gpfs..., ...
id:2726768805params:-n 2048..., ...
name:2111648390..., ...
exe
ts:20130101...writeSize:7M
name:samid:430823375
D. Dai et al. Using Property Graphs for Rich Metadata Management in HPC Systems. PDSW Workshop. New Orleans, LA. November 2014.
Rich metadata size
ApplicationsUser
Processes ( I/O Ranks)Files
detailed level
Processes (All Ranks)
Post-Petascale Data Management Stack
25
Iden,ty and Security
WAN Data Services
Resource Mgmt and Scheduling
Performance Monitoring
Applica,on Tasks Users Analysis Tasks
Programming Model
Science Data Model Services
Task and Data Coordina,on Publish/Subscribe
In System Storage Networking HW External Storage
Core Data Services Provenance Management Core Data
Model Services Metadata
Management
Pass-‐
through
Input from G. Grider, S. Klasky, P. MacCormick, R. Oldfield, G. Shipman, K. van Dam, and D. Williams.
Data Management: Next Steps
§ Storage-‐based vs. memory-‐based approaches to in-‐system storage
§ Data lakes and tape in future storage systems? § Rearchitect the I/O stack!
26
Concluding Thoughts
27
One Comparison of HPC and Big Data Software
28
Compute Resources(Nodes, Cores, VMs)
Workload Management (Pilots, Condor)
Orchestration(Pegasus, Taverna, Dryad, Swift)
Declarative Languages
(Swift)
MPI Frameworks for Advanced Analytics &
Machine Learning(Blas, ScaLAPACK, CompLearn, PetSc,
Blast)
Applications
MapReduceFrameworks
(Pilot-MapReduce)
Resource Management
Cluster Resource Manager (Slurm, Torque, SGE)
Storage Resources(Lustre, GPFS)
Data Access(Virtual Filesystem,
GridFTP, SSH)
Resource Fabric
Higher-Level Runtime
Environment
Data Processing,Analytics,
Orchestration
Compute and Data Resources(Nodes, Cores, HDFS)
Higher-Level Workload
Management (TEZ, LLama)
Advanced Analytics & Machine Learning (Mahout, R, MLBase)
Applications
MapReduce
Cluster Resource Manager (YARN, Mesos)
MapReduce
Scheduler
Data Store & Processing
(HBase)
In-Memory (Spark)
SparkScheduler
TwisterMapReduce
TwisterScheduler
SQL-Engines (Impala, Hive, Shark, Phoenix)
Scheduler
MPI, RDMA Hadoop Shuffle/Reduction, HARP Collectives Communication
High-Performance Computing Apache Hadoop Big Data
Orchestration (Oozie, Pig)
Advanced Analytics & Machine Learning (Pilot-KMeans, Replica Exchange)
Storage Management (iRODS, SRM, GFFS)
Fig. 1. HPC and ABDS architecture and abstractions: The HPC approach historically separated data and compute; ABDS co-locates compute and data.The YARN resource manager heavily utilizes multi-level, data-aware scheduling and supports a vibrant Hadoop-based ecosystem of data processing, analyticsand machine learning frameworks. Each approach has a rich, but hitherto distinct resource management and communication capabilities.
In addition several runtime environments for supporting het-erogeneous, loosely coupled tasks, e. g. Pilot-Jobs [9], manytasks [10] and workflows [11]. Pilot-Jobs generalize the con-cept of a placeholder to provide multi-level and/or application-level scheduling on top of the system-provided schedulers.With the increasing importance of data, Pilot-Jobs are increas-ingly used to process and analyze large amounts of data [12],[9]. In general, one can distinguish two kinds of data manage-ment: (i) the ability to stage-in/stage-out files from anothercompute node or a storage backend, such as SRM and (ii) theprovisioning of integrated data/compute management mecha-nisms. An example for (i) is Condor-G/Glide-in [13], whichprovides a basic mechanism for file staging and also supportsaccess to SRM storage. DIRAC [14] is an example of a type(ii) system providing more integrated capabilities.
B. ABDS EcosystemHadoop was originally developed in the enterprise space (by
Yahoo!) and introducing an integrated compute and data in-frastructure. Hadoop provides an open source implementationof the MapReduce programming model originally proposedby Google [15]. Hadoop is designed for cheap commodityhardware (which potentially can fail), co-places compute anddata on the same node and is highly optimized for sequentialreads workloads. With the uptake of Hadoop in the commer-cial space, scientific applications and infrastructure providersstarted to evaluate Hadoop for their purposes. At the sametime, Hadoop evolved with increasing requirements (e. g. thesupport for very heterogeneous workloads) into a general pur-pose cluster framework borrowing concepts existing in HPC.
Hadoop-1 had two primary components (i) the HadoopFilesystem [16] – an open source implementation of the
Google Filesystem architecture [17] – and (ii) the MapReduceframework which was the primary way of parallel processingdata stored in HDFS. However, Hadoop saw a broad uptakeand the MapReduce model as sole processing model provedinsufficient. The tight coupling between HDFS, resource man-agement and the MapReduce programming model was deemedto be too inflexible for the usage modes that emerged. An ex-ample of such a deficit is the lack of support for iterativecomputations (as often found in machine learning). With theintroduction of Hadoop-2 and YARN [18] as central resourcemanager, Hadoop clusters can now accommodate any appli-cation or framework. As shown in Figure 1 (right) a vibrantecosystem of higher-level runtime environments, data process-ing and machine learning libraries emerged on top of resourcefabric and management layers, i. e. HDFS and YARN. His-torically, MapReduce was the Hadoop runtime layer for pro-cessing data; but, in response to application requirements, run-times for record-based, random-access data (HBase [19]), it-erative processing (Spark [20], TEZ [21], Twister [7]), stream(Spark Streaming) and graph processing (Apache Giraph [22])emerged. A key enabler for these frameworks is the YARNsupport for multi-level scheduling, which enables the applica-tion to deploy their own application-level scheduling routineson top of Hadoop-managed storage and compute resources.While YARN manages the lower resources, the higher-levelruntimes typically use an application-level scheduler to opti-mize resource usage for the application. In contrast to HPC, theresource manager, runtime system and application are muchmore tighter integrated. Typically, an application uses the ab-straction provided by the runtime system (e. g. MapReduce)and does not directly interact with resource management.
Jha et al. A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures. Big Data Congress. June-July 2014.
Applications and System Services
29
HPC Applica,on
HPC System Services Node OS
Prog. Model and Task Mgmt.
Data Model Comm. Methods
Math/Physics Libraries
Resource Mgmt
Data Mgmt
System Monitoring
Iden,ty &
Security
WAN Data
Time to revisit our model of system services in HPC systems!
Open Questions, Possible Collaboration Areas
§ Monitoring – How do we becer use predic,ve capabili,es? – Is there addi,onal data that would improve our predic,ons? – What more can we learn from applica,ons without perturbing them?
§ Resource Management – How should we perform resource management and scheduling in HPC? – What do other system sofware services need from resource manager?
§ Data Management – What is/are the right short and long term approach(es) for managing
the deep memory hierarchy? – What algorithms/abstrac,ons for managing data enable scalability? – What is the right component breakdown to enable compe,,on?
30
MCS Storage Team
§ Phil Carns § Rob Latham § Dries Kimpe (on leave) § John Jenkins § Shane Snyder § Kevin Harms § Dong Dai
31
Thank you for your time and attention!
32