Top Banner
Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of Illinois at Urbana- Champaign http://iacoma.cs.uiuc.edu
39

Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

Dec 24, 2015

Download

Documents

Walter Baker
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

Rebound: Scalable Checkpointing for Coherent Shared Memory

Rishi Agarwal, Pranav Garg, and Josep Torrellas

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

http://iacoma.cs.uiuc.edu

Page 2: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Checkpointing in Shared-Memory MPs

• HW-based schemes for small CMPs use Global checkpointing– All procs participate in system-wide checkpoints

• Global checkpointing is not scalable– Synchronization, bursty movement of data, loss in rollback…

save chkpt

save chkpt

rollback

2

Fault

checkpoint

checkpoint

P1 P2 P3 P4

Page 3: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Alternative: Coordinated Local Checkpointing

• Idea: threads coordinate their checkpointing in groups• Rationale:

– Faults propagate only through communication – Interleaving between non-comm. threads is irrelevant

3

+ Scalable: Checkpoint and rollback in processor groups– Complexity: Record inter-thread dependences dynamically.

GlobalChkpt

P1 P2 P3 P4 P5 P1 P2 P3 P4 P5

LocalChkptLocal

Chkpt

Page 4: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Contributions

• Leverages directory protocol to track inter-thread deps.

• Opts to boost checkpointing efficiency:• Delaying write-back of data to safe memory at checkpoints• Supporting multiple checkpoints• Optimizing checkpointing at barrier synchronization

• Avg. performance overhead for 64 procs: 2%• Compared to 15% for global checkpointing

Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory

4

Page 5: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Background: In-Memory Checkpt with ReVive

P1 P2 P3

MemoryLog

Writebacks

Logging

RegisterDump

Caches

Writeback

5

[Prvulovic-02]

CHK

W W W W WBDirty Cache linesDirty Cache lines

ExecutionExecution

CheckpointCheckpoint

ApplicationStalls

ApplicationStalls

oldold

old

DisplacementDisplacement

Page 6: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Fault

Background: In-Memory Checkpt with ReVive

[Pvrulovic-02]

6

Old Register restored

Cache Invalidated

Memory LinesReverted

Global

Broadcast protocolLocal CoordinatedScalable protocol

CHK

W W W W WB

Log Memory

P3P2

Caches

P1

Page 7: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Coordinated Local Checkpointing Rules

• Banatre et al. used Coordinated Local checkpointing for bus-based machines [Banatre96]

wr x

rd x

P1 P2

Producerrollback

Consumerrollback

P1 P2

Producerchkpoint

Consumerchkpoint

P1 P2

chkptchkpt

7

P checkpoints P’s producers checkpoint

P rolls back P’s consumers rollback

Page 8: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Rebound Fault Model

• Any part of the chip can suffer transient or permanent faults.• A fault can occur even during checkpointing• Off-chip memory and logs suffer no fault on their own (e.g. NVM)• Fault detection outside our scope:

• Fault detection latency has upper-bound of L cycles

Log (in SW)

Main Memory

Chip Multiprocessor

8

Page 9: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Rebound Architecture

Main Memory

Chip Multiprocessor

L2

DirectoryCache

LW-ID

MyProducerMyConsumer

DepRegister

P+L1

9

Page 10: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

• Dependence (Dep) registers in the L2 cache controller:• MyProducers : bitmap of proc. that produced data consumed by

the local proc.• MyConsumers : bitmap of proc. that consumed data produced

by the local proc.

Rebound Architecture

Main Memory

Chip Multiprocessor

L2

DirectoryCache

LW-ID

MyProducerMyConsumer

DepRegister

P+L1

10

Page 11: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

• Dependence (Dep) registers in the L2 cache controller:• MyProducers : bitmap of proc. that produced data consumed by

the local proc.• MyConsumers : bitmap of proc. that consumed data produced

by the local proc.

• Processor ID in each directory entry: • LW-ID : last writer to the line in the current checkpoint interval.

Rebound Architecture

Main Memory

Chip Multiprocessor

L2

DirectoryCache

LW-ID

MyProducerMyConsumer

DepRegister

P+L1

11

Page 12: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Recording Inter-Thread Dependences

Assume MESI protocol

P1 P2

Log

DP1

Memory

Write

12

P1 writesP1 writes MyProducersMyConsumers

MyProducersMyConsumers

LW-ID

Page 13: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Recording Inter-Thread Dependences

Assume MESI protocol

P1 P2

DP1 S

Write back

Logging

13

MemoryLog

P2 readsP2 reads

MyConsumers P2MyConsumers P2

MyProducers P1MyProducers P1

MyProducersMyConsumers

MyProducersMyConsumersP2

P1

LW-ID

Page 14: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

P1 S

Recording Inter-Thread Dependences

Assume MESI protocol

P1 P2

DP1

14

MemoryLog

P1 writesP1 writesP2

P1MyProducersMyConsumers

MyProducersMyConsumers

LW-ID

Page 15: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

P1

P1 S

Recording Inter-Thread Dependences

Assume MESI protocol

P1 P2

DWritebacks

Clear LW-ID

Logging

15

MemoryLog

P1 checkpointsP1 checkpoints

LW-ID should remain set till the line is checkpointed

LW-ID should remain set till the line is checkpointed

P2P1MyProducers

MyConsumersMyProducersMyConsumers

Clear Dep registersClear Dep registers

LW-ID

Page 16: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Lazily clearing Last Writers

• Clear LW-IDs Expensive process !

• Write Signature encodes all line addresses that the processor has written to (or read exclusively) in the current interval.

• At checkpoint, the processors clear their Write Signature– Potentially stale LW-ID

16

Page 17: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

P1 P2

P1 S

17

MemoryLog

P2 readsP2 readsMyProducersMyConsumers

MyProducersMyConsumers

Stale LW-ID

Lazily clearing Last Writers

WSigNO !

Addr ?Clear LW-ID

Page 18: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

• Interaction Set [Pi]: set of producer processors (transitively) for P i

– Built using MyProducers

Distributed Checkpointing Protocol in SW

initiatecheckpoint

P1P1 P2 P3 P4

chk

InteractionSet : P1

18

Page 19: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

• Interaction Set [Pi]: set of producer processors (transitively) for P i

– Built using MyProducers

Distributed Checkpointing Protocol in SW

initiatecheckpoint

P1P1 P2 P3 P4

chk

InteractionSet : P1

19

P3

Ck? Ck?

P2

Page 20: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

• Interaction Set [Pi]: set of producer processors (transitively) for P i

– Built using MyProducers

Distributed Checkpointing Protocol in SW

initiatecheckpoint

P1

P2

P4

P3

Ck?

Ck? Ck?Acc

ept

P1 P2 P3 P4

chk

InteractionSet : P1, P2, P3

21

Accept

Page 21: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

• Interaction Set [Pi]: set of producer processors (transitively) for P i

– Built using MyProducers

Distributed Checkpointing Protocol in SW

initiatecheckpoint

P1

P2

P4

P3

Decline

Ack

Ck?

Ck? Ck?Acc

ept

P1 P2 P3 P4

chk

InteractionSet : P1, P2, P3

22

Accept

Page 22: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

• Interaction Set [Pi]: set of producer processors (transitively) for P i

– Built using MyProducers

Distributed Checkpointing Protocol in SW

initiatecheckpoint

P1

P2

P4

P3

Decline

Ack

Ck?

Ck? Ck?Acc

ept

P1 P2 P3 P4

chk

InteractionSet : P1, P2, P3

23

Accept

• Checkpointing is a 2-phase commit protocol.

Page 23: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

• Rollback handled similar to the Checkpointing protocol:

- Interaction set is built transitively using MyConsumers

• Rollback involves– Clearing the Dep. Registers and Write Signature– Invalidating the processor caches– Restoring the data and register context from the logs up to

the latest checkpoint.

• No Domino Effect

24

Distributed Rollback Protocol in SW

Page 24: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Optimization1 : Delayed Writebacks

• Checkpointing overhead dominated by data writebacks

• Delayed Writeback optimization• Processors synchronize and resume execution• Hardware automatically writes back dirty lines in background • Checkpoint only completed when all delayed data written back• Still need to record inter-thread dependences on delayed data

WB dirty linesIn

terv

al

I1Tim

e

25

sync

sync

Ch

eck

po

int

Inte

rva

l I2

Stall

sync

sync

WB dirty lines

Ch

eck

po

int

Inte

rva

l I1

Inte

rva

l I2

Stall

Page 25: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Delayed Writeback Pros/Cons

+ Significant reduction in checkpoint overhead

- Additional support:

Each processor has two sets of Dep. Registers and Write Signature

Each cache line has a delayed bit

- Increased vulnerability

A rollback event forces both intervals to roll back

26

Page 26: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

P1 P2

DP1 S

Write back

Logging

27

MemoryLog

P2 readsP2 reads

MyConsumers0 P2MyConsumers0 P2

MyProducers1 P1MyProducers1 P1

MyProducers0

MyConsumers0

MyProducers0

MyConsumers0P2

P1

LW-ID

MyProducers1

MyConsumers1

MyProducers1

MyConsumers1

WSig0

WSig1

Addr ?

Addr ?

NO !

YES !xxx

Delayed Writeback protocol

Page 27: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Optimization2 : Multiple Checkpoints

• Solution: Keep multiple checkpoints– On fault, roll back interacting processors to safe checkpoints

• No Domino Effect

28

Fault

Det

ect

ion

Lat

ency

Dep registers 1

Dep registers 2Ro

llba

ck

Ckpt 1

Ckpt 2

tf

• Problem: Fault detection is not instantaneous– Checkpoint is safe only after max fault-detection latency (L)

Page 28: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Multiple Checkpoints: Pros/Cons

+ Realistic system: supports non-instantaneous fault detection

- Additional support:

Each checkpoint has Dep registers

Dep registers can be recycled only after fault detection latency

- Need to track communication across checkpoints

- Combination with Delayed Writebacks: one more Dep register set

29

Page 29: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Optimization3 : Hiding Chkpt behind Global Barrier

• Global barriers require that all processors communicate– Leads to global checkpoints

• Optimization:– Proactively trigger a global checkpoint at a global barrier– Hide checkpoint overhead behind barrier imbalance spins

30

Page 30: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Hiding Checkpoint behind Global Barrier

Lock

count++

if(count == numProc)

Iam_last = TRUE /*local var*/

Unlock

If(I am_last) {

count = 0

flag = TRUE …

}

else

while(!flag) {}

31

Update

Page 31: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Hiding Checkpoint behind Global Barrier

• First arriving processor initiates the checkpoint• Others: HW writes back data as execution proceeds to barrier• Commit checkpoint as last processor arrives• After the barrier: few interacting processors

Lock

count++

if(count == numProc)

Iam_last = TRUE /*local var*/

Unlock

If(I am_last) {

count = 0

flag = TRUE …

}

else

while(!flag) {}

32

UpdateUpdate

Processor P1 Processor P2 Processor P3

Update

BarCK?BarCK?

Notify Notify

flag = TRUE ICHK = {P3} while(!flag)

ICHK = {P2, P3}

while(!flag)ICHK = {P1, P3}

Update

Page 32: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Evaluation Setup

• Analysis tool using Pin + SESC cycle-acc. simulator + DRAMsim• Applications: SPLASH-2 , some PARSEC, Apache• Simulated CMP architecture with up to 64 threads • Checkpoint interval : 5 – 8 ms• Modeled several environments:

• Global: baseline global checkpointing• Rebound: Local checkpointing scheme with delayed writeback.• Rebound_NoDWB: Rebound without the delayed writebacks.

33

Page 33: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Avg. Interaction Set: Set of Producer Processors

• Most apps: interaction set is a small set– Justifies coordinated local checkpointing– Averages brought up by global barriers

34

64

38

Page 34: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Checkpoint Execution Overhead

• Rebound’s avg checkpoint execution overhead is 2%– Compared to 15% for Global

35

Ba

rne

s

Ch

ole

sky

Fft

Fm

m

Ra

dix

Lu

-C

Lu

-NC

Vo

lre

nd

Wa

ter-

Sp

Wa

ter-

Nsq

Ra

dio

sity

Oce

an

Ra

ytra

ce

SP

2

0

10

20

30

40Global

Rebound_NoDWB

Rebound

% C

he

ck

po

int

Ov

erh

ea

d

2

15

Page 35: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Checkpoint Execution Overhead

• Rebound’s avg checkpoint execution overhead is 2%– Compared to 15% for Global

• Delayed Writebacks complement local checkpointing

36

Ba

rne

s

Ch

ole

sky

Fft

Fm

m

Ra

dix

Lu

-C

Lu

-NC

Vo

lre

nd

Wa

ter-

Sp

Wa

ter-

Nsq

Ra

dio

sity

Oce

an

Ra

ytra

ce

SP

2

0

10

20

30

40Global

Rebound_NoDWB

Rebound

% C

he

ck

po

int

Ov

erh

ea

d

Page 36: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Rebound Scalability

• Rebound is scalable in checkpoint overhead• Delayed Writebacks help scalability

Constant problem size

37

Page 37: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Also in the Paper

• Delayed write backs also useful in Global• Barrier optimization is effective but not universally applicable• Power increase due to hardware additions < 2%• Rebound leads to only 4% increase in coherence traffic

38

Page 38: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Conclusions

• Leverages directory protocol• Boosts checkpointing efficiency:

• Delayed write-backs• Multiple checkpoints• Barrier optimization

• Avg. execution overhead for 64 procs: 2%

Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory

• Future work:• Apply Rebound to non-hardware coherent machines• Scalability to hierarchical directories

39

Page 39: Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

Rebound: Scalable Checkpointing for Coherent Shared Memory

Rishi Agarwal, Pranav Garg, and Josep Torrellas

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

http://iacoma.cs.uiuc.edu