PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania {adhilton,neeraj,amir}@cis.upenn.edu
PACT-18 :: Sep 15, 2009
CPROB: Checkpoint Processing with Opportunistic Minimal Recovery
Andrew Hilton, Neeraj Eswaran, Amir RothUniversity of Pennsylvania
{adhilton,neeraj,amir}@cis.upenn.edu
[ 2 ][ 2 ]
CPROB in a Nutshell (Sorry, O’Reilly)
Physical register file constrains out-of-order window• Area and power intensive, latency complicates the scheduler
CPR (Checkpoint Processing and Recovery) [Akkary+03]
+ Aggressive, execution-driven register reclamation– Checkpoint overhead: recovery only to pre-created checkpoints
CPROB: hybrid register reclamation scheme• CPR + opportunistic checkpoint overhead elimination• Opportunistic = dynamically adapts to register demands
+ Outperforms both CPR and conventional reclamation+ Simple low-overhead implementation
[ 3 ][ 3 ]
Outline
Introduction
CPR review• The “checkpoint overhead” problem
CPROB
Evaluation
Related Work
Conclusion
[ 4 ][ 4 ]
Conventional Register Reclamation
Running example • 7 instructions (A–G), 2 branches (C & E), 3 arch regs (r1–r3)
Conventional register reclamation (i.e., ROB)• Commit-driven reclamation: over-written register freed at commit• Needs 8 physical registers for this “window”• RenameMap + OverWritten
ld [r3] => r3A: ld [p3] => p4 p1 p2 p3r1 r2 r3Raw InsnPC Renamed Insn
p3OW
sub r1, 4 => r2brz r3, Qld [r2] => r2brz r2, Tadd r1, 8 => r3ld [r3] => r2
B:C:D:E:F:G:
sub p1, 4 => p5brz p4, Qld [p5] => p6brz p6, Tadd p1, 8 => p7ld [p7] => p8
p1 p2 p4p1 p5 p4
p1 p6 p4p1 p5 p4
p1 p6 p4p1 p6 p7p1 p8 p7RenameMap
p2-
-p5
p4p6
p1 p8 p7
p3p2-
-p5
p4p6
[ 5 ][ 5 ]
CPR Register Reclamation
CPR (Checkpoint Processing & Recovery)• Execution-driven reclamation: sources + dest “freed” at execute• Needs only 7 physical registers for this window • Sources + dests of un-executed insns• RenameMap• Pre-created checkpoints
ld [r3] => r3A: ld [p3] => p4 p1 p2 p3r1 r2 r3Raw InsnPC Renamed Insn
p3OW
sub r1, 4 => r2brz r3, Qld [r2] => r2brz r2, Tadd r1, 8 => r3ld [r3] => r2
B:C:D:E:F:G:
sub p1, 4 => p5brz p4, Qld [p5] => p6brz p6, Tadd p1, 8 => p7ld [p7] => p8
p1 p2 p4p1 p5 p4
p1 p6 p4p1 p5 p4
p1 p6 p4p1 p6 p7p1 p8 p7RenameMap
p2-
-p5
p4p6
ld [r3] => r3 ld [p3] => p4
brz r3, Q brz p4, Q
p1 p8 p7
p3
p4
p4
p5 is free
p1 p6 p4Chk1
p1 p2 p3Chk0
[ 6 ][ 6 ]
CPR Checkpoint Overhead
What if branch C mis-predicts?• Can’t recover to D … p5 (appears in D’s RenameMap) already freed!– Must recover to A (checkpoint) and re-execute A–C• This penalty is called checkpoint overhead• Squash & re-execute insns older than un-checkpointed mis-spec• No such penalty in ROB which performs minimal recovery
ld [r3] => r3A: ld [p3] => p4 p1 p2 p3r1 r2 r3Raw InsnPC Renamed Insn
p3OW
sub r1, 4 => r2brz r3, Qld [r2] => r2brz r2, Tadd r1, 8 => r3ld [r3] => r2
B:C:D:E:F:G:
sub p1, 4 => p5brz p4, Qld [p5] => p6brz p6, Tadd p1, 8 => p7ld [p7] => p8
p1 p2 p4p1 p5 p4
p1 p6 p4p1 p5 p4
p1 p6 p4p1 p6 p7p1 p8 p7RenameMap
p2-
-p5
p4p6
ld [r3] => r3 ld [p3] => p4
brz r3, Q brz p4, Q
p1 p8 p7
p3
p4
p4
p5 is free
p5
p1 p2 p3Chk0
p1 p6 p4Chk1
A:B:C:
[ 7 ][ 7 ]
The Two Faces of CPR
• SpecFP: high bpred accuracy + need large window• Reclamation trumps overhead average speedups• Some pathologies, e.g., galgel
• SpecINT: low bpred accuracy• Overhead dominates average slowdown
[ 8 ][ 8 ]
Answer != More Checkpoints
• More checkpoints reduce overhead … but only a little– Sometimes hurt performance (tie up more registers)– Also, checkpoints are not cheap
[ 9 ][ 9 ]
But CPR is Great for SMT … Right?
+ SMT needs more registers … + And reduces branch mis-prediction penalty …– But actually makes checkpoint overhead worse!• Distance from mis-predicted branch to older checkpoint has
nothing to do with speculation depth• Threads share checkpoints (more un-checkpointed branches)
[ 10 ][ 10 ]
Outline
Introduction
CPR
CPROB• Basic idea (very simple)• Some policies• Implementation
Evaluation
Related Work
Conclusion
[ 11 ][ 11 ]
CPROB: The Key Idea
CPR + hold recovery (OW) registers opportunistically• Recovery registers (p5) available no checkpoint overhead• Recover to younger checkpoint, then walk backwards serially
• Recovery registers (p5) not available overhead, but still correct• Recover to older checkpoint a la CPR
• Opportunistically = can release recovery registers at any time!
ld [r3] => r3A: ld [p3] => p4 p1 p2 p3r1 r2 r3Raw InsnPC Renamed Insn
p3OW
sub r1, 4 => r2brz r3, Qld [r2] => r2brz r2, Tadd r1, 8 => r3ld [r3] => r2
B:C:D:E:F:G:
sub p1, 4 => p5brz p4, Qld [p5] => p6brz p6, Tadd p1, 8 => p7ld [p7] => p8
p1 p2 p4p1 p5 p4
p1 p6 p4p1 p5 p4
p1 p6 p4p1 p6 p7p1 p8 p7RenameMap
p2-
-p5
p4p6
ld [r3] => r3 ld [p3] => p4
brz r3, Q brz p4, Q
p1 p8 p7
p3
p4
p4
p5
p1 p2 p3Chk0
p1 p6 p4Chk1
[ 12 ][ 12 ]
Good Time Part I
When is a good time to release recovery registers?
Don’t grab in first place: no branches since older checkpoint• “Tail” checkpoint doesn’t grab p4 & p6, Chk1 didn’t grab p3 & p2
Spontaneously: all branches in a checkpoint have executed• Chk1: branch C executes release p5
ld [r3] => r3A: ld [p3] => p4 p1 p2 p3r1 r2 r3Raw InsnPC Renamed Insn
p3OW
sub r1, 4 => r2brz r3, Qld [r2] => r2brz r2, Tadd r1, 8 => r3ld [r3] => r2
B:C:D:E:F:G:
sub p1, 4 => p5brz p4, Qld [p5] => p6brz p6, Tadd p1, 8 => p7ld [p7] => p8
p1 p2 p4p1 p5 p4
p1 p6 p4p1 p5 p4
p1 p6 p4p1 p6 p7p1 p8 p7RenameMap
p2-
-p5
p4p6
ld [r3] => r3 ld [p3] => p4
brz r3, Q brz p4, Q
p1 p8 p7
p3
p4
p4
p5
p1 p2 p3Chk0
p1 p6 p4Chk1
[ 13 ][ 13 ]
Good Time Part II
Also victimize when rename needs registers to continue• Chances are good un-executed branches are right• Otherwise they would have been assigned checkpoints
CPROB reclamation policy adapts dynamically• Branch mis-predictions tend to cluster [Heil+98]
• Recent mis-prediction window empty, no need to victimize• Hold recovery registers to “protect” upcoming branches
• No recent mis-prediction window full, need to victimize• Probably in a region of high-confidence branches
• Most mis-predicted branches resolve quickly after dispatch• Chance of victimization in this “window” is small
[ 14 ][ 14 ]
Does CPROB Need a Giant ROB?
CPROB tries to support a large window• Needs a large ROB to hold all insns, right? No
CPROB uses ROB for opportunistic recovery, not commit• Only insns whose recovery registers are held need ROB entries• Can victimize ROB space & recovery registers together• Policy “victimize oldest checkpoint” meshes well with this
[ 15 ][ 15 ]
Implementation
How is CPROB register reclamation implemented?• When/how are instructions added to the free list?
First: how is CPR register reclamation implemented?• Not using a circular queue free list enqueued at commit …• Using register reference counting [Roth08]
[ 16 ][ 16 ]
A: ld [p3] => p4 p1 p2 p3r1 r2 r3PC Renamed Insn
p3OW
B:C:D:E:F:G:
sub p1, 4 => p5brz p4, Qld [p5] => p6brz p6, Tadd p1, 8 => p7ld [p7] => p8
p1 p2 p4p1 p5 p4
p1 p6 p4p1 p5 p4
p1 p6 p4p1 p6 p7p1 p8 p7RenameMap
p2-
-p5
p4p6
ld [p3] => p4
brz p4, Q
CPR Register Reference Counting
Reference counts implemented as bit-matrix• One column per physical register• One row per entity that can hold physical register• Issue queue entry, checkpoint, RenameMap
• Columns OR’ed together to form bitvector-style free list• Registers allocated using encoders
1 2 3 4 5 6 7 80 0 0 0 0 0 0 0IQ00 0 0 0 0 0 0 0IQ10 0 0 0 0 0 0 0IQ20 0 0 0 0 0 0 0Chk00 0 0 0 0 0 0 0Chk10 0 0 0 0 0 0 0RMap
0 0 0 0 0 0 0 0Free
p3
p4
p4 1 11
p1 p2 p3
111
p1 p8 p7
1 1 1p1 p6 p41 11
1 1 11111
Chk1
Chk0
[ 17 ][ 17 ]
CPROB Extensions
Add recovery-register matrix rows• One for each checkpoint• One for RenameMap (“tail” checkpoint)• CPROB rows can be cleared at any time• CPR rows cleared according to strict CPR rules (for correctness)
0 0 0 0 0 0 0 0Rec00 0 0 0 0 0 0 0Rec10 0 0 0 0 0 0 0RRec
A: ld [p3] => p4 p1 p2 p3r1 r2 r3PC Renamed Insn
p3OW
B:C:D:E:F:G:
sub p1, 4 => p5brz p4, Qld [p5] => p6brz p6, Tadd p1, 8 => p7ld [p7] => p8
p1 p2 p4p1 p5 p4
p1 p6 p4p1 p5 p4
p1 p6 p4p1 p6 p7p1 p8 p7RenameMap
p2-
-p5
p4p6
ld [p3] => p4
brz p4, Q
1 2 3 4 5 6 7 80 0 0 0 0 0 0 0IQ00 0 0 0 0 0 0 0IQ10 0 0 0 0 0 0 0IQ20 0 0 0 0 0 0 0Chk00 0 0 0 0 0 0 0Chk10 0 0 0 0 0 0 0RMap
0 0 0 0 0 0 0 0Free
p3
p4
p4 1 11
p1 p2 p3
111
p1 p8 p7
1 1 1p1 p6 p41 11
1 1 11111
Chk0
Chk1
p5
1
1
[ 18 ][ 18 ]
Outline
Introduction
CPR
CPROB
Evaluation• CPROB• CPROB-SMT
Related Work
Conclusion
[ 19 ][ 19 ]
Methodology
Benchmarks• SPEC2000 compiled using -O4
For SMT• Characterized as ILP, Branch-, Latency-, or bandWidth-bound • 2-thread workloads using FIESTA methodology [Hilton+09]
Cycle-level simulation• 4-way superscalar out-of-order, 17-stage pipeline, 1 or 2 threads• 256 ROB, 32/32 INT/FP issue queue, 128/128 INT/FP phys-regs• 8 checkpoints for CPR• 48 Kbyte 3-table PPM branch predictor, 16K confidence pred• 32 Kbyte I$/D$, 2 Mbyte 20-cycle L2, 400-cycle memory
[ 20 ][ 20 ]
CPROB vs. CPR vs. ROB
• Reduces checkpoint overhead significantly (4% 1%)• Remaining: miss-dependent mis-predicted branches
• Fixes CPR’s performance pathologies relative to ROB• Outperforms both CPR and ROB in (almost) every case
[ 21 ][ 21 ]
CPROB is Energy Efficient
Rough argument (see paper for details) but here goes …• Energy efficient = relative-to-ROB ED2 < 1 [Martin+01]
• Dynamic energy consumption ~ dynamic instruction execution count• CPR: FP: 1.031 / 1.0412 = 0.95, INT: 1.035 / 0.9962 = 1.04• CPROB: FP: 1.001 / 1.0552 = 0.90, INT: 1.013 / 1.0142 = 0.98
[ 22 ][ 22 ]
Register Usage: Spec Average
Physical registers are expensive: vary from 256 to 2K• ROB: steady benefits to more registers• CPR: roughly constant performance+ Better than ROB at low registers (reclamation dominates)– Worse with more registers (checkpoint overhead dominates)
• CPROB: few registers does CPR, many registers does ROB• Adaptive better than CPR and ROB at all points
[ 23 ][ 23 ]
Register Usage: SpecINT Gap
Same behavior in individual benchmarks• Some phases need many registers• Some phases need minimal recovery
[ 24 ][ 24 ]
Checkpoint Usage: Spec Average
Checkpoints are also expensive: vary from 2 to 16• CPR: quite sensitive (needs 4 to break even with ROB)• CPROB: removes CPR’s sensitivity to checkpoint count• Makes CPR viable with 2 checkpoints
[ 25 ][ 25 ]
CPROB-SMT
+ CPROB fixes Bx pairings in SMT• Branch-bound program paired with something else• Remaining pathologies (LW & WW) due to D$ thrashing
+ Also relieves checkpoint pressure
See paper for other results• Sensitivity, energy model details, area analysis, etc.
[ 26 ][ 26 ]
Related Works
Other aggressive register schemes• Early register release [Ergin+04], Cherry [Martinez+02]
ROB based large window [Cristal+04, Pericas+06]
• CPROB not relevant here
Control Independence [Cher+01, Chou+99, Gandhi+04, Rotenberg+99]
• Orthogonal, CPROB potentially Synergistic with TCI [AlZawawi+07]
TurboROB [Akl+08]
• Accelerates serial recovery• Compatible (maybe synergistic) with CPROB
[ 27 ][ 27 ]
Also Related: FIESTA
FIESTA: workloads for multi-program experiments• Fixed Instruction with Equal STAndalone runtimes• Pre-select application samples for equal standalone runtimes• Run same samples consistently in every experiment+ Fixed workloads direct comparison with no result skew• Plain, unambiguous speedup metrics
+ Minimal load imbalance by construction• Remaining load imbalance is “un-fairness”
• Hilton et al. “FIESTA”, MoBS workshop, 2009.• Consider using it in your multi-program experiments
[ 28 ][ 28 ]
Conclusions
Physical register file: critical out-of-order core resource• Limits window size (especially for SMT)
CPR: execution-driven reclamation scheme+ Much better scalability (good for SMT)– Checkpoint overhead (surprise, even worse in SMT)• Some pathologies relative to ROB commit-driven reclamation
CPROB: opportunistic hybrid register reclamation• Holds recovery registers to eliminate checkpoint overhead• Adaptively victimizes them when rename needs more+ Eliminates CPR’s pathologies, outperforms both CPR and ROB
[ 29 ][ 29 ]
[ 30 ][ 30 ]
CPROB: ROB size?
Vary ROB size from 32 to 256 entries• CPROB only needs 64-96 entries for full performance• Degrades gracefully to 32• ROB needs at least 128 entries
[ 31 ][ 31 ]
CFPROB
CPR base for CFP (Continual Flow Pipelines) [Srinivasan+04]
• Unblocks issue queue & registers under LLC misses
CFPROB: CFP on top of CPROB• CPROB baseline fixes performance pathologies• Small ROB = minimal recovery for miss-independent branches• LLC-miss-dependent branch mis-predictions are rare