Top Banner
Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy # , Arjun Singri*, P. Sadayappan* *The Ohio State University # Pacific Northwest National Laboratory 1
26

Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Dec 29, 2015

Download

Documents

Loraine Dawson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Selective Recovery From Failures In A Task Parallel Programming Model

James Dinan*, Sriram Krishnamoorthy# , Arjun Singri*, P. Sadayappan*

*The Ohio State University# Pacific Northwest National Laboratory

1

Page 2: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Faults at Scale

Future systems built with large number of componentsMTBF inversely proportional to #components

Faults will be frequent

Checkpoint-restart too expensive with numerous faultsStrain on system components, notably file system

Assumption of fault-free operation infeasible

Applications need to think about faults

2

Page 3: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Programming Models

SPMD ties computation to a processFixed machine modelApplications needs to change with major architectural shifts

Fault handling involves non-local design changesRely on p processes: what if one goes away?

Message-passing makes it harderConsistent cuts are challengingMessage logging, etc. expensive

Fault management requires lot of user involvement

3

Page 4: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Problem Statement

Fault management framework

Minimize user effort

ComponentsData state

Application dataCommunication operations

Control stateWhat work is each process doing?Continue to completion despite faults

4

Page 5: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Approach

One-sided communication modelEasy to derive consistent cuts

Task parallel control modelComputation decoupled from processes

User specifies computationCollection of tasks on global data

Runtime schedules computationLoad balancingFault management

5

Page 6: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Global Arrays (GA)

PGAS Family: UPC (C), CAF (Fortran), Titanium (Java), GA (library)Aggregate memory from multiple nodes into global address space

Data access via one-sided get(..), put(..), acc(..) operationsProgrammer controls data distribution and locality

Fully inter-operable with MPI and ARMCISupport for higher-level collectives – DGEMM, etc.Widely used – chemistry, sub-surface transport, bioinformatics, CFD

6

Shared

Glo

bal

ad

dre

ss s

pac

e

Private

Proc0 Proc1 Procn

X[M][M][N]

X[1..9][1..9][1..9]

X

Page 7: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

GA Memory Model

Remote memory accessDominant communication in GA programsDestination known in advanceNo receive operation or tag matchingRemote Progress

Ensure overlap

Atomics and collectivesBlockingFew outstanding at any time

7

Page 8: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Saving Data State

Data State = Commn state + memory state

Communication state“Flush” pending RMA operations (single call)Save atomic and collective ops (small state)

Memory stateForce other processes to flush their pending ops

Used in virtualized execution of GA apps (Comp. Frontiers’09)

Also enables pre-emptive migration

8

Page 9: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

9

The Asynchronous Gap

The PGAS memory model simplifies managing data

Computation model is still regular, process-centric SPMD

Irregularity in the data can lead toload imbalance

Extend PGAS model to bridge asynchronous gapDynamic, irregular view of the computationRuntime system should perform load balancingAllow for computation movement to exploit locality

X[M][M][N]

X[1..9][1..9][1..9]X

get(…)

Page 10: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Control State – Task Model

Express computation as collection of tasksTasks operate on data stored in Global ArraysExecuted in collective task parallel phases

Runtime system manages task execution

10

SPMD

SPMD

Task Parallel

Termination

Page 11: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

11

Task Model

• Inputs: Global data, Immediates, CLOs• Outputs: Global data, CLOs, Child tasks• Strict dependence: Only parent → child (for now)

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In: 5, Y[0], ...

Out: X[1]

Task:

Partitioned Global Address Space

X[0] X[1] X[N]

Page 12: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

12

Scioto Programming Interface

High level interface: shared global task collection

Low level interface: set of distributed task queuesQueues are prioritized by affinityUse work first principle (LIFO)Load balancing via work stealing (FIFO)

Page 13: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

13

Work Stealing Runtime System

ARMCI task queue on each processorSteals don’t interrupt remote process

When a process runs out of workSelect a victim at random and steal work from them

Scaled to 8192 cores (SC’09)

Page 14: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Communication Markers

Communication initiated by a failed process

Handling partial completionsGet(), Put() are idempotent – ignoreAcc() non-idempotent

Mark beginning and end of acc() ops

OverheadMemory usage – proportional to # tasksCommunication – additional small messages

14

Page 15: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Fault Tolerant Task Pool

15

Re-execute incomplete tasks till a round without failures

Page 16: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Task Execution

16

Update result only if it has not already been modified

Page 17: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Detecting Incomplete Commn

Data with ‘started’ set but not ‘contributed’

Approach 1: “Naïve” schemeCheck all markers for any that remain `started’Not scalable

Approach 2: “Home-based” schemeInvert the task-to-data mappingDistributed meta-data check + all-to-all

17

Page 18: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Algorithm Characteristics

Tolerance to arbitrary number of failures

Low overhead in absence of failuresSmall messages for markersCan we optimized through pre-issue/speculation

Space overhead proportional to task pool sizeStorage for markers

Recovery cost proportional to #failuresRedo work to produce data in failed processes

18

Page 19: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Bounding Cascading Failures

A process with “corrupted” dataIncomplete comm. from failed process

Marking it as failed -> cascade failures

A process with “corrupted” dataFlushes its communication; then recovers its data

Each task computes only a few data blocksEach process: pending comm. to few blocks at a timeTotal recovery cost

Data in failed processes + a small additional number

19

Page 20: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Experimental Setup

Linux cluster

Each nodeDual quad-core 2.5GHz opterons24GB RAM

Infiniband interconnection network

Self-Consistent Field (SCF) kernel – 48 Be atoms

Worst case fault – at the end of a task pool

20

Page 21: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Cost of Failure – Strong Scaling

21

#tasks re-executed goes down with increase in process count

Page 22: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Worst Case Failure Cost

22

Page 23: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Relative Performance

23

Less than 10% cost for one worst case fault

Page 24: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Related Work

Checkpoint restartContinues to handle the SPMD portion of an appFiner-grain recoverability using our approach

BOINC – client-serverCilkNOW – single assignment formLinda – requires transactionsCHARM++

processor virtualization basedNeeds message logging

Efforts on fault tolerant runtimesComplements this work

24

Page 25: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Conclusions

Fault tolerance throughPGAS memory modelTask parallel computation model

Fine-grain recoverability through markers

Cost of failure proportional to #failures

Demonstrated low cost recovery for an SCF kernel

25

Page 26: Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

Thank You!

26