PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

PACT-18 :: Sep 15, 2009

CPROB: Checkpoint Processing with Opportunistic Minimal Recovery

Andrew Hilton, Neeraj Eswaran, Amir RothUniversity of Pennsylvania

{adhilton,neeraj,amir}@cis.upenn.edu

[ 2 ][ 2 ]

CPROB in a Nutshell (Sorry, O’Reilly)

Physical register file constrains out-of-order window• Area and power intensive, latency complicates the scheduler

CPR (Checkpoint Processing and Recovery) [Akkary+03]

+ Aggressive, execution-driven register reclamation– Checkpoint overhead: recovery only to pre-created checkpoints

CPROB: hybrid register reclamation scheme• CPR + opportunistic checkpoint overhead elimination• Opportunistic = dynamically adapts to register demands

+ Outperforms both CPR and conventional reclamation+ Simple low-overhead implementation

[ 3 ][ 3 ]

Outline

Introduction

CPR review• The “checkpoint overhead” problem

CPROB

Evaluation

Related Work

Conclusion

[ 4 ][ 4 ]

Conventional Register Reclamation

Running example • 7 instructions (A–G), 2 branches (C & E), 3 arch regs (r1–r3)

Conventional register reclamation (i.e., ROB)• Commit-driven reclamation: over-written register freed at commit• Needs 8 physical registers for this “window”• RenameMap + OverWritten

ld [r3] => r3A: ld [p3] => p4 p1 p2 p3r1 r2 r3Raw InsnPC Renamed Insn

p3OW

sub r1, 4 => r2brz r3, Qld [r2] => r2brz r2, Tadd r1, 8 => r3ld [r3] => r2

B:C:D:E:F:G:

sub p1, 4 => p5brz p4, Qld [p5] => p6brz p6, Tadd p1, 8 => p7ld [p7] => p8

p1 p2 p4p1 p5 p4

p1 p6 p4p1 p5 p4

p1 p6 p4p1 p6 p7p1 p8 p7RenameMap

p2-

-p5

p4p6

p1 p8 p7

p3p2-

-p5

p4p6

[ 5 ][ 5 ]

CPR Register Reclamation

CPR (Checkpoint Processing & Recovery)• Execution-driven reclamation: sources + dest “freed” at execute• Needs only 7 physical registers for this window • Sources + dests of un-executed insns• RenameMap• Pre-created checkpoints


p3OW


B:C:D:E:F:G:


p1 p2 p4p1 p5 p4

p1 p6 p4p1 p5 p4


p2-

-p5

p4p6

ld [r3] => r3 ld [p3] => p4

brz r3, Q brz p4, Q

p1 p8 p7

p3

p4

p4

p5 is free

p1 p6 p4Chk1

p1 p2 p3Chk0

[ 6 ][ 6 ]

CPR Checkpoint Overhead

What if branch C mis-predicts?• Can’t recover to D … p5 (appears in D’s RenameMap) already freed!– Must recover to A (checkpoint) and re-execute A–C• This penalty is called checkpoint overhead• Squash & re-execute insns older than un-checkpointed mis-spec• No such penalty in ROB which performs minimal recovery


p3OW


B:C:D:E:F:G:


p1 p2 p4p1 p5 p4

p1 p6 p4p1 p5 p4


p2-

-p5

p4p6

ld [r3] => r3 ld [p3] => p4

brz r3, Q brz p4, Q

p1 p8 p7

p3

p4

p4

p5 is free

p5

p1 p2 p3Chk0

p1 p6 p4Chk1

A:B:C:

[ 7 ][ 7 ]

The Two Faces of CPR

• SpecFP: high bpred accuracy + need large window• Reclamation trumps overhead average speedups• Some pathologies, e.g., galgel

• SpecINT: low bpred accuracy• Overhead dominates average slowdown

[ 8 ][ 8 ]

Answer != More Checkpoints

• More checkpoints reduce overhead … but only a little– Sometimes hurt performance (tie up more registers)– Also, checkpoints are not cheap

[ 9 ][ 9 ]

But CPR is Great for SMT … Right?

+ SMT needs more registers … + And reduces branch mis-prediction penalty …– But actually makes checkpoint overhead worse!• Distance from mis-predicted branch to older checkpoint has

nothing to do with speculation depth• Threads share checkpoints (more un-checkpointed branches)

[ 10 ][ 10 ]

Outline

Introduction

CPR

CPROB• Basic idea (very simple)• Some policies• Implementation

Evaluation

Related Work

Conclusion

[ 11 ][ 11 ]

CPROB: The Key Idea

CPR + hold recovery (OW) registers opportunistically• Recovery registers (p5) available no checkpoint overhead• Recover to younger checkpoint, then walk backwards serially

• Recovery registers (p5) not available overhead, but still correct• Recover to older checkpoint a la CPR

• Opportunistically = can release recovery registers at any time!


p3OW


B:C:D:E:F:G:


p1 p2 p4p1 p5 p4

p1 p6 p4p1 p5 p4


p2-

-p5

p4p6

ld [r3] => r3 ld [p3] => p4

brz r3, Q brz p4, Q

p1 p8 p7

p3

p4

p4

p5

p1 p2 p3Chk0

p1 p6 p4Chk1

[ 12 ][ 12 ]

Good Time Part I

When is a good time to release recovery registers?

Don’t grab in first place: no branches since older checkpoint• “Tail” checkpoint doesn’t grab p4 & p6, Chk1 didn’t grab p3 & p2

Spontaneously: all branches in a checkpoint have executed• Chk1: branch C executes release p5


p3OW


B:C:D:E:F:G:


p1 p2 p4p1 p5 p4

p1 p6 p4p1 p5 p4


p2-

-p5

p4p6

ld [r3] => r3 ld [p3] => p4

brz r3, Q brz p4, Q

p1 p8 p7

p3

p4

p4

p5

p1 p2 p3Chk0

p1 p6 p4Chk1

[ 13 ][ 13 ]

Good Time Part II

Also victimize when rename needs registers to continue• Chances are good un-executed branches are right• Otherwise they would have been assigned checkpoints

CPROB reclamation policy adapts dynamically• Branch mis-predictions tend to cluster [Heil+98]

• Recent mis-prediction window empty, no need to victimize• Hold recovery registers to “protect” upcoming branches

• No recent mis-prediction window full, need to victimize• Probably in a region of high-confidence branches

• Most mis-predicted branches resolve quickly after dispatch• Chance of victimization in this “window” is small

[ 14 ][ 14 ]

Does CPROB Need a Giant ROB?

CPROB tries to support a large window• Needs a large ROB to hold all insns, right? No

CPROB uses ROB for opportunistic recovery, not commit• Only insns whose recovery registers are held need ROB entries• Can victimize ROB space & recovery registers together• Policy “victimize oldest checkpoint” meshes well with this

[ 15 ][ 15 ]

Implementation

How is CPROB register reclamation implemented?• When/how are instructions added to the free list?

First: how is CPR register reclamation implemented?• Not using a circular queue free list enqueued at commit …• Using register reference counting [Roth08]

[ 16 ][ 16 ]

A: ld [p3] => p4 p1 p2 p3r1 r2 r3PC Renamed Insn

p3OW

B:C:D:E:F:G:


p1 p2 p4p1 p5 p4

p1 p6 p4p1 p5 p4


p2-

-p5

p4p6

ld [p3] => p4

brz p4, Q

CPR Register Reference Counting

Reference counts implemented as bit-matrix• One column per physical register• One row per entity that can hold physical register• Issue queue entry, checkpoint, RenameMap

• Columns OR’ed together to form bitvector-style free list• Registers allocated using encoders

1 2 3 4 5 6 7 80 0 0 0 0 0 0 0IQ00 0 0 0 0 0 0 0IQ10 0 0 0 0 0 0 0IQ20 0 0 0 0 0 0 0Chk00 0 0 0 0 0 0 0Chk10 0 0 0 0 0 0 0RMap

0 0 0 0 0 0 0 0Free

p3

p4

p4 1 11

p1 p2 p3

111

p1 p8 p7

1 1 1p1 p6 p41 11

1 1 11111

Chk1

Chk0

[ 17 ][ 17 ]

CPROB Extensions

Add recovery-register matrix rows• One for each checkpoint• One for RenameMap (“tail” checkpoint)• CPROB rows can be cleared at any time• CPR rows cleared according to strict CPR rules (for correctness)

0 0 0 0 0 0 0 0Rec00 0 0 0 0 0 0 0Rec10 0 0 0 0 0 0 0RRec

A: ld [p3] => p4 p1 p2 p3r1 r2 r3PC Renamed Insn

p3OW

B:C:D:E:F:G:


p1 p2 p4p1 p5 p4

p1 p6 p4p1 p5 p4


p2-

-p5

p4p6

ld [p3] => p4

brz p4, Q

1 2 3 4 5 6 7 80 0 0 0 0 0 0 0IQ00 0 0 0 0 0 0 0IQ10 0 0 0 0 0 0 0IQ20 0 0 0 0 0 0 0Chk00 0 0 0 0 0 0 0Chk10 0 0 0 0 0 0 0RMap

0 0 0 0 0 0 0 0Free

p3

p4

p4 1 11

p1 p2 p3

111

p1 p8 p7

1 1 1p1 p6 p41 11

1 1 11111

Chk0

Chk1

p5

1

1

[ 18 ][ 18 ]

Outline

Introduction

CPR

CPROB

Evaluation• CPROB• CPROB-SMT

Related Work

Conclusion

[ 19 ][ 19 ]

Methodology

Benchmarks• SPEC2000 compiled using -O4

For SMT• Characterized as ILP, Branch-, Latency-, or bandWidth-bound • 2-thread workloads using FIESTA methodology [Hilton+09]

Cycle-level simulation• 4-way superscalar out-of-order, 17-stage pipeline, 1 or 2 threads• 256 ROB, 32/32 INT/FP issue queue, 128/128 INT/FP phys-regs• 8 checkpoints for CPR• 48 Kbyte 3-table PPM branch predictor, 16K confidence pred• 32 Kbyte I$/D$, 2 Mbyte 20-cycle L2, 400-cycle memory

[ 20 ][ 20 ]

CPROB vs. CPR vs. ROB

• Reduces checkpoint overhead significantly (4% 1%)• Remaining: miss-dependent mis-predicted branches

• Fixes CPR’s performance pathologies relative to ROB• Outperforms both CPR and ROB in (almost) every case

[ 21 ][ 21 ]

CPROB is Energy Efficient

Rough argument (see paper for details) but here goes …• Energy efficient = relative-to-ROB ED2 < 1 [Martin+01]

• Dynamic energy consumption ~ dynamic instruction execution count• CPR: FP: 1.031 / 1.0412 = 0.95, INT: 1.035 / 0.9962 = 1.04• CPROB: FP: 1.001 / 1.0552 = 0.90, INT: 1.013 / 1.0142 = 0.98

[ 22 ][ 22 ]

Register Usage: Spec Average

Physical registers are expensive: vary from 256 to 2K• ROB: steady benefits to more registers• CPR: roughly constant performance+ Better than ROB at low registers (reclamation dominates)– Worse with more registers (checkpoint overhead dominates)

• CPROB: few registers does CPR, many registers does ROB• Adaptive better than CPR and ROB at all points

[ 23 ][ 23 ]

Register Usage: SpecINT Gap

Same behavior in individual benchmarks• Some phases need many registers• Some phases need minimal recovery

[ 24 ][ 24 ]

Checkpoint Usage: Spec Average

Checkpoints are also expensive: vary from 2 to 16• CPR: quite sensitive (needs 4 to break even with ROB)• CPROB: removes CPR’s sensitivity to checkpoint count• Makes CPR viable with 2 checkpoints

[ 25 ][ 25 ]

CPROB-SMT

+ CPROB fixes Bx pairings in SMT• Branch-bound program paired with something else• Remaining pathologies (LW & WW) due to D$ thrashing

+ Also relieves checkpoint pressure

See paper for other results• Sensitivity, energy model details, area analysis, etc.

[ 26 ][ 26 ]

Related Works

Other aggressive register schemes• Early register release [Ergin+04], Cherry [Martinez+02]

ROB based large window [Cristal+04, Pericas+06]

• CPROB not relevant here

Control Independence [Cher+01, Chou+99, Gandhi+04, Rotenberg+99]

• Orthogonal, CPROB potentially Synergistic with TCI [AlZawawi+07]

TurboROB [Akl+08]

• Accelerates serial recovery• Compatible (maybe synergistic) with CPROB

[ 27 ][ 27 ]

Also Related: FIESTA

FIESTA: workloads for multi-program experiments• Fixed Instruction with Equal STAndalone runtimes• Pre-select application samples for equal standalone runtimes• Run same samples consistently in every experiment+ Fixed workloads direct comparison with no result skew• Plain, unambiguous speedup metrics

+ Minimal load imbalance by construction• Remaining load imbalance is “un-fairness”

• Hilton et al. “FIESTA”, MoBS workshop, 2009.• Consider using it in your multi-program experiments

[ 28 ][ 28 ]

Conclusions

Physical register file: critical out-of-order core resource• Limits window size (especially for SMT)

CPR: execution-driven reclamation scheme+ Much better scalability (good for SMT)– Checkpoint overhead (surprise, even worse in SMT)• Some pathologies relative to ROB commit-driven reclamation

CPROB: opportunistic hybrid register reclamation• Holds recovery registers to eliminate checkpoint overhead• Adaptively victimizes them when rename needs more+ Eliminates CPR’s pathologies, outperforms both CPR and ROB

[ 29 ][ 29 ]

[ 30 ][ 30 ]

CPROB: ROB size?

Vary ROB size from 32 to 256 entries• CPROB only needs 64-96 entries for full performance• Degrades gracefully to 32• ROB needs at least 128 entries

[ 31 ][ 31 ]

CFPROB

CPR base for CFP (Continual Flow Pipelines) [Srinivasan+04]

• Unblocks issue queue & registers under LLC misses

CFPROB: CFP on top of CPROB• CPROB baseline fixes performance pathologies• Small ROB = minimal recovery for miss-independent branches• LLC-miss-dependent branch mis-predictions are rare

PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

Documents