Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy # , Arjun Singri*, P. Sadayappan* *The Ohio State University # Pacific Northwest National Laboratory 1
Selective Recovery From Failures In A Task Parallel Programming Model
James Dinan*, Sriram Krishnamoorthy# , Arjun Singri*, P. Sadayappan*
*The Ohio State University# Pacific Northwest National Laboratory
1
Faults at Scale
Future systems built with large number of componentsMTBF inversely proportional to #components
Faults will be frequent
Checkpoint-restart too expensive with numerous faultsStrain on system components, notably file system
Assumption of fault-free operation infeasible
Applications need to think about faults
2
Programming Models
SPMD ties computation to a processFixed machine modelApplications needs to change with major architectural shifts
Fault handling involves non-local design changesRely on p processes: what if one goes away?
Message-passing makes it harderConsistent cuts are challengingMessage logging, etc. expensive
Fault management requires lot of user involvement
3
Problem Statement
Fault management framework
Minimize user effort
ComponentsData state
Application dataCommunication operations
Control stateWhat work is each process doing?Continue to completion despite faults
4
Approach
One-sided communication modelEasy to derive consistent cuts
Task parallel control modelComputation decoupled from processes
User specifies computationCollection of tasks on global data
Runtime schedules computationLoad balancingFault management
5
Global Arrays (GA)
PGAS Family: UPC (C), CAF (Fortran), Titanium (Java), GA (library)Aggregate memory from multiple nodes into global address space
Data access via one-sided get(..), put(..), acc(..) operationsProgrammer controls data distribution and locality
Fully inter-operable with MPI and ARMCISupport for higher-level collectives – DGEMM, etc.Widely used – chemistry, sub-surface transport, bioinformatics, CFD
6
Shared
Glo
bal
ad
dre
ss s
pac
e
Private
Proc0 Proc1 Procn
X[M][M][N]
X[1..9][1..9][1..9]
X
GA Memory Model
Remote memory accessDominant communication in GA programsDestination known in advanceNo receive operation or tag matchingRemote Progress
Ensure overlap
Atomics and collectivesBlockingFew outstanding at any time
7
Saving Data State
Data State = Commn state + memory state
Communication state“Flush” pending RMA operations (single call)Save atomic and collective ops (small state)
Memory stateForce other processes to flush their pending ops
Used in virtualized execution of GA apps (Comp. Frontiers’09)
Also enables pre-emptive migration
8
9
The Asynchronous Gap
The PGAS memory model simplifies managing data
Computation model is still regular, process-centric SPMD
Irregularity in the data can lead toload imbalance
Extend PGAS model to bridge asynchronous gapDynamic, irregular view of the computationRuntime system should perform load balancingAllow for computation movement to exploit locality
X[M][M][N]
X[1..9][1..9][1..9]X
get(…)
Control State – Task Model
Express computation as collection of tasksTasks operate on data stored in Global ArraysExecuted in collective task parallel phases
Runtime system manages task execution
10
SPMD
SPMD
Task Parallel
Termination
11
Task Model
• Inputs: Global data, Immediates, CLOs• Outputs: Global data, CLOs, Child tasks• Strict dependence: Only parent → child (for now)
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In: 5, Y[0], ...
Out: X[1]
Task:
Partitioned Global Address Space
X[0] X[1] X[N]
12
Scioto Programming Interface
High level interface: shared global task collection
Low level interface: set of distributed task queuesQueues are prioritized by affinityUse work first principle (LIFO)Load balancing via work stealing (FIFO)
13
Work Stealing Runtime System
ARMCI task queue on each processorSteals don’t interrupt remote process
When a process runs out of workSelect a victim at random and steal work from them
Scaled to 8192 cores (SC’09)
Communication Markers
Communication initiated by a failed process
Handling partial completionsGet(), Put() are idempotent – ignoreAcc() non-idempotent
Mark beginning and end of acc() ops
OverheadMemory usage – proportional to # tasksCommunication – additional small messages
14
Detecting Incomplete Commn
Data with ‘started’ set but not ‘contributed’
Approach 1: “Naïve” schemeCheck all markers for any that remain `started’Not scalable
Approach 2: “Home-based” schemeInvert the task-to-data mappingDistributed meta-data check + all-to-all
17
Algorithm Characteristics
Tolerance to arbitrary number of failures
Low overhead in absence of failuresSmall messages for markersCan we optimized through pre-issue/speculation
Space overhead proportional to task pool sizeStorage for markers
Recovery cost proportional to #failuresRedo work to produce data in failed processes
18
Bounding Cascading Failures
A process with “corrupted” dataIncomplete comm. from failed process
Marking it as failed -> cascade failures
A process with “corrupted” dataFlushes its communication; then recovers its data
Each task computes only a few data blocksEach process: pending comm. to few blocks at a timeTotal recovery cost
Data in failed processes + a small additional number
19
Experimental Setup
Linux cluster
Each nodeDual quad-core 2.5GHz opterons24GB RAM
Infiniband interconnection network
Self-Consistent Field (SCF) kernel – 48 Be atoms
Worst case fault – at the end of a task pool
20
Related Work
Checkpoint restartContinues to handle the SPMD portion of an appFiner-grain recoverability using our approach
BOINC – client-serverCilkNOW – single assignment formLinda – requires transactionsCHARM++
processor virtualization basedNeeds message logging
Efforts on fault tolerant runtimesComplements this work
24
Conclusions
Fault tolerance throughPGAS memory modelTask parallel computation model
Fine-grain recoverability through markers
Cost of failure proportional to #failures
Demonstrated low cost recovery for an SCF kernel
25