GACOP JACCA Meeting - February 27, 2004 PAL A New Approach in the System Software Design for Large-Scale Parallel Computers Juan Fernández 1,2 , Eitan Frachtenberg 1 , Fabrizio Petrini 1 , Salvador Coll 1 and José C. Sancho 1 1 Performance and Architecture Lab 2 Grupo de Arquitectura y Computación Paralela (GACOP) CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN URL: http:// www .c3. lanl . gov URL: http:// www . ditec . um .es email:{juanf,eitanf,fabrizio,scoll,jcsancho}@lanl.gov
24
Embed
GACOP JACCA Meeting - February 27, 2004 P AL A New Approach in the System Software Design for Large-Scale Parallel Computers Juan Fernández 1,2, Eitan.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
GACOP
JACCA Meeting - February 27, 2004
PAL
A New Approach in the System Software Designfor Large-Scale Parallel Computers
A New Approach in the System Software Designfor Large-Scale Parallel Computers
Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1,
Salvador Coll1 and José C. Sancho1
1Performance and Architecture Lab 2Grupo de Arquitectura y Computación Paralela (GACOP)
CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores
Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN
System software is a key factor to maximize usability, performance and scalability on large-scale systems!!!
Hardware / OSsHardware / OSsare glued together byare glued together by
System Software:System Software:Resource Management
CommunicationsParallel Developmentand Debugging ToolsParallel File System
Fault Tolerance
OSOS OSOSOSOS
OSOS
OSOS OSOSOSOS
OSOS
GACOP
JACCA Meeting - February 27, 2004
PAL
MotivationMotivation
System software complexity due to multiple factors: Extremely complex global state Non-deterministic behavior inherent to computing
systems and parallel apps Local OSs lack global awareness of parallel apps Independent design of different components User-level applications rely on system software
GACOP
JACCA Meeting - February 27, 2004
PAL
OutlineOutline
Motivation
Goals
Core Primitives
Resource Management
Communication Libraries
Ongoing and future work
GACOP
JACCA Meeting - February 27, 2004
PAL
Target Simplifying design and implementation of the system
software for large-scale parallel computers Simplicity, performance, scalability, determinism
Approach Built atop a basic set of three primitives Global synchronization/scheduling
Vision SIMD system running MIMD applications
(variable granularity in the order of hundreds of s)
GoalsGoals
GACOP
JACCA Meeting - February 27, 2004
PAL
OutlineOutline
Motivation
Goals
Core Primitives
Resource Management
Communication Libraries
Ongoing and future work
GACOP
JACCA Meeting - February 27, 2004
PAL
Core PrimitivesCore Primitives
System software built atop three primitives Xfer-And-Signal
– Transfer block of data to a set of nodes– Optionally signal local/remote event upon completion
Compare-And-Write– Compare global variable on a set of nodes– Optionally write global variable on the same set of nodes
Test-Event– Poll local event
GACOP
JACCA Meeting - February 27, 2004
PAL
Core PrimitivesCore Primitives
Characteristic Requirement Solution
Job Launching
Data dissemination
Flow Control
Termination Detection
Xfer-And-Signal
Compare-And-Write
Compare-And-Write
Job SchedulingHeartbeat
Context switch
responsiveness
Xfer-And-Signal
Prioritized messages /
Multiple Rails
Communication
PUT
GET
Barrier
Broadcast
Reduce
Xfer-And-Signal
Xfer-And-Signal
Compare-And-Write
Compare-And-Write+Xfer-And-Signal
Xfer-And-Signal / “Smart” NIC
The proposed mechanisms simplify design and implementation!!!
GACOP
JACCA Meeting - February 27, 2004
PAL
Core PrimitivesCore Primitives
Implementation Global, virtually addressable shared memory Remote Direct Memory Access (RDMA) Hardware-supported multicast Hardware-supported global query Computing capability in the NIC
Portability Infiniband, BlueGene/L, QsNET
GACOP
JACCA Meeting - February 27, 2004
PAL
OutlineOutline
Motivation
Goals
Core Primitives
Resource Management
Communication Libraries
Ongoing and future work
GACOP
JACCA Meeting - February 27, 2004
PAL
Resource ManagementResource Management STORM: Scalable TOol for Resource Management [1,2]
Job launching– binary and data dissemination– actual launching of a parallel job– reporting of job termination
Job scheduling– FCFS, gang scheduling, ... [3]– new scheduling algorithms can be “plugged”
Heartbeat/strobe at regular intervals (time slices) Monitoring Built atop the three core primitives
[1] “Scalable Resource Management in High Performance Computers.”
E. Frachtenberg, J. Fernández, F. Petrini, and S. Coll. Cluster´02.
[2] “STORM: Lightning-Fast Resource Management.” E. Frachtenberg, J. Fernández, F. Petrini, S. Pakin and S. Coll. SC´02.
[3] “Flexible CoScheduling: Mitigating Load Imbalance and Improving Utilization of Heterogeneous Resources.” E. Frachtenberg, D. G. Feitelson, F. Petrini and J. Fernández. IPDPS´03.
GACOP
JACCA Meeting - February 27, 2004
PAL
OutlineOutline
Motivation
Goals
Core Primitives
Resource Management
Communication Libraries
Ongoing and future work
GACOP
JACCA Meeting - February 27, 2004
PAL
Communication LibrariesCommunication Libraries BCS-MPI: Buffered Coscheduled MPI [4]
Global synchronization [5]– Heartbeat/strobe sent at regular intervals (time slices)– All system activities are tightly coupled
Global Scheduling– Exchange of communication requirements– Communication scheduling– Perform real transmission and reduce computations [6]
Implementation on the NIC (Elan3 - QsNet) Built atop the three core primitives
[4] “BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers.” J. Fernández, E. Frachtenberg, and F. Petrini. SC´03.
[5] “Scalable Collective Communication on the ASCI Q Machine”J. Fernández, E. Frachtenberg, and F. Pettrini. HOTi´11.
[6] “Scalable NIC-based Reduction on Large-scale Clusters.” A. Moody, J. Fernández, F. Petrini and D. K. Panda. SC´03.
GACOP
JACCA Meeting - February 27, 2004
PAL
Communication LibrariesCommunication Libraries
•Global Strobe•(time slice starts)
•Global Strobe•(time slice ends)
Exchange of comm requirements
Communication scheduling
Real transmission
•Global•Synchronization
•Global•Synchronization
Tim
e S
lice
(h
un
dre
ds
of s
) BCS-MPI: real-time commication scheduling
GACOP
JACCA Meeting - February 27, 2004
PAL
Ongoing and future workOngoing and future work
Improved system utilization Scheduling multiple jobs
QoS for different types of traffic Scheduling messages may provide traffic segregation
Transparent fault tolerance [7] BCS MPI simplifies the state of the machine
Kernel-level implementation of BCS-MPI User-level solution is already working
Deterministic replay of MPI programs Ordered resource scheduling may enforce reproducibility
[7] “On the Feasibility of Incremental Checkpointing for Scientific Computing.” J. C. Sancho, F. Petrini, G. Johnson, J. Fernández and E. Frachtenberg. IPDPS´04.
GACOP
JACCA Meeting - February 27, 2004
PAL
A New Approach in the System Software Designfor Large-Scale Parallel Computers
A New Approach in the System Software Designfor Large-Scale Parallel Computers
Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1
1 Performance and Architecture Lab 2 Grupo de Arquitectura y Computación Paralelas (GACOP)
CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores
Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN