Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

Rebound: Scalable Checkpointing for Coherent Shared Memory

Rishi Agarwal, Pranav Garg, and Josep Torrellas

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

http://iacoma.cs.uiuc.edu

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Checkpointing in Shared-Memory MPs

• HW-based schemes for small CMPs use Global checkpointing– All procs participate in system-wide checkpoints

• Global checkpointing is not scalable– Synchronization, bursty movement of data, loss in rollback…

save chkpt

save chkpt

rollback

2

Fault

checkpoint

checkpoint

P1 P2 P3 P4


Alternative: Coordinated Local Checkpointing

• Idea: threads coordinate their checkpointing in groups• Rationale:

– Faults propagate only through communication – Interleaving between non-comm. threads is irrelevant

3

+ Scalable: Checkpoint and rollback in processor groups– Complexity: Record inter-thread dependences dynamically.

GlobalChkpt

P1 P2 P3 P4 P5 P1 P2 P3 P4 P5

LocalChkptLocal

Chkpt


Contributions

• Leverages directory protocol to track inter-thread deps.

• Opts to boost checkpointing efficiency:• Delaying write-back of data to safe memory at checkpoints• Supporting multiple checkpoints• Optimizing checkpointing at barrier synchronization

• Avg. performance overhead for 64 procs: 2%• Compared to 15% for global checkpointing

Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory

4


Background: In-Memory Checkpt with ReVive

P1 P2 P3

MemoryLog

Writebacks

Logging

RegisterDump

Caches

Writeback

5

[Prvulovic-02]

CHK

W W W W WBDirty Cache linesDirty Cache lines

ExecutionExecution

CheckpointCheckpoint

ApplicationStalls

ApplicationStalls

oldold

old

DisplacementDisplacement


Fault

Background: In-Memory Checkpt with ReVive

[Pvrulovic-02]

6

Old Register restored

Cache Invalidated

Memory LinesReverted

Global

Broadcast protocolLocal CoordinatedScalable protocol

CHK

W W W W WB

Log Memory

P3P2

Caches

P1


Coordinated Local Checkpointing Rules

• Banatre et al. used Coordinated Local checkpointing for bus-based machines [Banatre96]

wr x

rd x

P1 P2

Producerrollback

Consumerrollback

P1 P2

Producerchkpoint

Consumerchkpoint

P1 P2

chkptchkpt

7

P checkpoints P’s producers checkpoint

P rolls back P’s consumers rollback


Rebound Fault Model

• Any part of the chip can suffer transient or permanent faults.• A fault can occur even during checkpointing• Off-chip memory and logs suffer no fault on their own (e.g. NVM)• Fault detection outside our scope:

• Fault detection latency has upper-bound of L cycles

Log (in SW)

Main Memory

Chip Multiprocessor

8


Rebound Architecture

Main Memory

Chip Multiprocessor

L2

DirectoryCache

LW-ID

MyProducerMyConsumer

DepRegister

P+L1

9


• Dependence (Dep) registers in the L2 cache controller:• MyProducers : bitmap of proc. that produced data consumed by

the local proc.• MyConsumers : bitmap of proc. that consumed data produced

by the local proc.


Main Memory

Chip Multiprocessor

L2

DirectoryCache

LW-ID


DepRegister

P+L1

10


• Dependence (Dep) registers in the L2 cache controller:• MyProducers : bitmap of proc. that produced data consumed by

the local proc.• MyConsumers : bitmap of proc. that consumed data produced

by the local proc.

• Processor ID in each directory entry: • LW-ID : last writer to the line in the current checkpoint interval.


Main Memory

Chip Multiprocessor

L2

DirectoryCache

LW-ID


DepRegister

P+L1

11


Recording Inter-Thread Dependences

Assume MESI protocol

P1 P2

Log

DP1

Memory

Write

12

P1 writesP1 writes MyProducersMyConsumers

MyProducersMyConsumers

LW-ID




P1 P2

DP1 S

Write back

Logging

13

MemoryLog

P2 readsP2 reads

MyConsumers P2MyConsumers P2

MyProducers P1MyProducers P1


MyProducersMyConsumersP2

P1

LW-ID


P1 S



P1 P2

DP1

14

MemoryLog

P1 writesP1 writesP2

P1MyProducersMyConsumers


LW-ID


P1

P1 S



P1 P2

DWritebacks

Clear LW-ID

Logging

15

MemoryLog

P1 checkpointsP1 checkpoints

LW-ID should remain set till the line is checkpointed

LW-ID should remain set till the line is checkpointed

P2P1MyProducers

MyConsumersMyProducersMyConsumers

Clear Dep registersClear Dep registers

LW-ID


Lazily clearing Last Writers

• Clear LW-IDs Expensive process !

• Write Signature encodes all line addresses that the processor has written to (or read exclusively) in the current interval.

• At checkpoint, the processors clear their Write Signature– Potentially stale LW-ID

16


P1 P2

P1 S

17

MemoryLog

P2 readsP2 readsMyProducersMyConsumers


Stale LW-ID

Lazily clearing Last Writers

WSigNO !

Addr ?Clear LW-ID


• Interaction Set [Pi]: set of producer processors (transitively) for P i

– Built using MyProducers

Distributed Checkpointing Protocol in SW

initiatecheckpoint

P1P1 P2 P3 P4

chk

InteractionSet : P1

18





initiatecheckpoint

P1P1 P2 P3 P4

chk

InteractionSet : P1

19

P3

Ck? Ck?

P2





initiatecheckpoint

P1

P2

P4

P3

Ck?

Ck? Ck?Acc

ept

P1 P2 P3 P4

chk

InteractionSet : P1, P2, P3

21

Accept





initiatecheckpoint

P1

P2

P4

P3

Decline

Ack

Ck?

Ck? Ck?Acc

ept

P1 P2 P3 P4

chk


22

Accept





initiatecheckpoint

P1

P2

P4

P3

Decline

Ack

Ck?

Ck? Ck?Acc

ept

P1 P2 P3 P4

chk


23

Accept

• Checkpointing is a 2-phase commit protocol.


• Rollback handled similar to the Checkpointing protocol:

- Interaction set is built transitively using MyConsumers

• Rollback involves– Clearing the Dep. Registers and Write Signature– Invalidating the processor caches– Restoring the data and register context from the logs up to

the latest checkpoint.

• No Domino Effect

24

Distributed Rollback Protocol in SW


Optimization1 : Delayed Writebacks

• Checkpointing overhead dominated by data writebacks

• Delayed Writeback optimization• Processors synchronize and resume execution• Hardware automatically writes back dirty lines in background • Checkpoint only completed when all delayed data written back• Still need to record inter-thread dependences on delayed data

WB dirty linesIn

terv

al

I1Tim

e

25

sync

sync

Ch

eck

po

int

Inte

rva

l I2

Stall

sync

sync

WB dirty lines

Ch

eck

po

int

Inte

rva

l I1

Inte

rva

l I2

Stall


Delayed Writeback Pros/Cons

+ Significant reduction in checkpoint overhead

- Additional support:

Each processor has two sets of Dep. Registers and Write Signature

Each cache line has a delayed bit

- Increased vulnerability

A rollback event forces both intervals to roll back

26


P1 P2

DP1 S

Write back

Logging

27

MemoryLog

P2 readsP2 reads

MyConsumers0 P2MyConsumers0 P2

MyProducers1 P1MyProducers1 P1

MyProducers0

MyConsumers0

MyProducers0

MyConsumers0P2

P1

LW-ID

MyProducers1

MyConsumers1

MyProducers1

MyConsumers1

WSig0

WSig1

Addr ?

Addr ?

NO !

YES !xxx

Delayed Writeback protocol


Optimization2 : Multiple Checkpoints

• Solution: Keep multiple checkpoints– On fault, roll back interacting processors to safe checkpoints

• No Domino Effect

28

Fault

Det

ect

ion

Lat

ency

Dep registers 1

Dep registers 2Ro

llba

ck

Ckpt 1

Ckpt 2

tf

• Problem: Fault detection is not instantaneous– Checkpoint is safe only after max fault-detection latency (L)


Multiple Checkpoints: Pros/Cons

+ Realistic system: supports non-instantaneous fault detection

- Additional support:

Each checkpoint has Dep registers

Dep registers can be recycled only after fault detection latency

- Need to track communication across checkpoints

- Combination with Delayed Writebacks: one more Dep register set

29


Optimization3 : Hiding Chkpt behind Global Barrier

• Global barriers require that all processors communicate– Leads to global checkpoints

• Optimization:– Proactively trigger a global checkpoint at a global barrier– Hide checkpoint overhead behind barrier imbalance spins

30


Hiding Checkpoint behind Global Barrier

Lock

count++

if(count == numProc)

Iam_last = TRUE /*local var*/

Unlock

If(I am_last) {

count = 0

flag = TRUE …

}

else

while(!flag) {}

31

Update


Hiding Checkpoint behind Global Barrier

• First arriving processor initiates the checkpoint• Others: HW writes back data as execution proceeds to barrier• Commit checkpoint as last processor arrives• After the barrier: few interacting processors

Lock

count++

if(count == numProc)

Iam_last = TRUE /*local var*/

Unlock

If(I am_last) {

count = 0

flag = TRUE …

}

else

while(!flag) {}

32

UpdateUpdate

Processor P1 Processor P2 Processor P3

Update

BarCK?BarCK?

Notify Notify

flag = TRUE ICHK = {P3} while(!flag)

ICHK = {P2, P3}

while(!flag)ICHK = {P1, P3}

Update


Evaluation Setup

• Analysis tool using Pin + SESC cycle-acc. simulator + DRAMsim• Applications: SPLASH-2 , some PARSEC, Apache• Simulated CMP architecture with up to 64 threads • Checkpoint interval : 5 – 8 ms• Modeled several environments:

• Global: baseline global checkpointing• Rebound: Local checkpointing scheme with delayed writeback.• Rebound_NoDWB: Rebound without the delayed writebacks.

33


Avg. Interaction Set: Set of Producer Processors

• Most apps: interaction set is a small set– Justifies coordinated local checkpointing– Averages brought up by global barriers

34

64

38


Checkpoint Execution Overhead

• Rebound’s avg checkpoint execution overhead is 2%– Compared to 15% for Global

35

Ba

rne

s

Ch

ole

sky

Fft

Fm

m

Ra

dix

Lu

-C

Lu

-NC

Vo

lre

nd

Wa

ter-

Sp

Wa

ter-

Nsq

Ra

dio

sity

Oce

an

Ra

ytra

ce

SP

2

0

10

20

30

40Global

Rebound_NoDWB

Rebound

% C

he

ck

po

int

Ov

erh

ea

d

2

15


Checkpoint Execution Overhead

• Rebound’s avg checkpoint execution overhead is 2%– Compared to 15% for Global

• Delayed Writebacks complement local checkpointing

36

Ba

rne

s

Ch

ole

sky

Fft

Fm

m

Ra

dix

Lu

-C

Lu

-NC

Vo

lre

nd

Wa

ter-

Sp

Wa

ter-

Nsq

Ra

dio

sity

Oce

an

Ra

ytra

ce

SP

2

0

10

20

30

40Global

Rebound_NoDWB

Rebound

% C

he

ck

po

int

Ov

erh

ea

d


Rebound Scalability

• Rebound is scalable in checkpoint overhead• Delayed Writebacks help scalability

Constant problem size

37


Also in the Paper

• Delayed write backs also useful in Global• Barrier optimization is effective but not universally applicable• Power increase due to hardware additions < 2%• Rebound leads to only 4% increase in coherence traffic

38


Conclusions

• Leverages directory protocol• Boosts checkpointing efficiency:

• Delayed write-backs• Multiple checkpoints• Barrier optimization

• Avg. execution overhead for 64 procs: 2%

Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory

• Future work:• Apply Rebound to non-hardware coherent machines• Scalability to hierarchical directories

39

Rebound: Scalable Checkpointing for Coherent Shared Memory

Rishi Agarwal, Pranav Garg, and Josep Torrellas

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

http://iacoma.cs.uiuc.edu

Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

Documents

torrellas rebound

chip memory

scalable synchronization

memory checkpt

rebound fault model

safe memory

chkpt rollback

coordinated local