Top Banner
PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania {adhilton,neeraj,amir}@cis.upenn.edu
31

PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

Dec 14, 2015

Download

Documents

Jadyn Seeger
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

PACT-18 :: Sep 15, 2009

CPROB: Checkpoint Processing with Opportunistic Minimal Recovery

Andrew Hilton, Neeraj Eswaran, Amir RothUniversity of Pennsylvania

{adhilton,neeraj,amir}@cis.upenn.edu

Page 2: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 2 ][ 2 ]

CPROB in a Nutshell (Sorry, O’Reilly)

Physical register file constrains out-of-order window• Area and power intensive, latency complicates the scheduler

CPR (Checkpoint Processing and Recovery) [Akkary+03]

+ Aggressive, execution-driven register reclamation– Checkpoint overhead: recovery only to pre-created checkpoints

CPROB: hybrid register reclamation scheme• CPR + opportunistic checkpoint overhead elimination• Opportunistic = dynamically adapts to register demands

+ Outperforms both CPR and conventional reclamation+ Simple low-overhead implementation

Page 3: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 3 ][ 3 ]

Outline

Introduction

CPR review• The “checkpoint overhead” problem

CPROB

Evaluation

Related Work

Conclusion

Page 4: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 4 ][ 4 ]

Conventional Register Reclamation

Running example • 7 instructions (A–G), 2 branches (C & E), 3 arch regs (r1–r3)

Conventional register reclamation (i.e., ROB)• Commit-driven reclamation: over-written register freed at commit• Needs 8 physical registers for this “window”• RenameMap + OverWritten

ld [r3] => r3A: ld [p3] => p4 p1 p2 p3r1 r2 r3Raw InsnPC Renamed Insn

p3OW

sub r1, 4 => r2brz r3, Qld [r2] => r2brz r2, Tadd r1, 8 => r3ld [r3] => r2

B:C:D:E:F:G:

sub p1, 4 => p5brz p4, Qld [p5] => p6brz p6, Tadd p1, 8 => p7ld [p7] => p8

p1 p2 p4p1 p5 p4

p1 p6 p4p1 p5 p4

p1 p6 p4p1 p6 p7p1 p8 p7RenameMap

p2-

-p5

p4p6

p1 p8 p7

p3p2-

-p5

p4p6

Page 5: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 5 ][ 5 ]

CPR Register Reclamation

CPR (Checkpoint Processing & Recovery)• Execution-driven reclamation: sources + dest “freed” at execute• Needs only 7 physical registers for this window • Sources + dests of un-executed insns• RenameMap• Pre-created checkpoints

ld [r3] => r3A: ld [p3] => p4 p1 p2 p3r1 r2 r3Raw InsnPC Renamed Insn

p3OW

sub r1, 4 => r2brz r3, Qld [r2] => r2brz r2, Tadd r1, 8 => r3ld [r3] => r2

B:C:D:E:F:G:

sub p1, 4 => p5brz p4, Qld [p5] => p6brz p6, Tadd p1, 8 => p7ld [p7] => p8

p1 p2 p4p1 p5 p4

p1 p6 p4p1 p5 p4

p1 p6 p4p1 p6 p7p1 p8 p7RenameMap

p2-

-p5

p4p6

ld [r3] => r3 ld [p3] => p4

brz r3, Q brz p4, Q

p1 p8 p7

p3

p4

p4

p5 is free

p1 p6 p4Chk1

p1 p2 p3Chk0

Page 6: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 6 ][ 6 ]

CPR Checkpoint Overhead

What if branch C mis-predicts?• Can’t recover to D … p5 (appears in D’s RenameMap) already freed!– Must recover to A (checkpoint) and re-execute A–C• This penalty is called checkpoint overhead• Squash & re-execute insns older than un-checkpointed mis-spec• No such penalty in ROB which performs minimal recovery

ld [r3] => r3A: ld [p3] => p4 p1 p2 p3r1 r2 r3Raw InsnPC Renamed Insn

p3OW

sub r1, 4 => r2brz r3, Qld [r2] => r2brz r2, Tadd r1, 8 => r3ld [r3] => r2

B:C:D:E:F:G:

sub p1, 4 => p5brz p4, Qld [p5] => p6brz p6, Tadd p1, 8 => p7ld [p7] => p8

p1 p2 p4p1 p5 p4

p1 p6 p4p1 p5 p4

p1 p6 p4p1 p6 p7p1 p8 p7RenameMap

p2-

-p5

p4p6

ld [r3] => r3 ld [p3] => p4

brz r3, Q brz p4, Q

p1 p8 p7

p3

p4

p4

p5 is free

p5

p1 p2 p3Chk0

p1 p6 p4Chk1

A:B:C:

Page 7: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 7 ][ 7 ]

The Two Faces of CPR

• SpecFP: high bpred accuracy + need large window• Reclamation trumps overhead average speedups• Some pathologies, e.g., galgel

• SpecINT: low bpred accuracy• Overhead dominates average slowdown

Page 8: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 8 ][ 8 ]

Answer != More Checkpoints

• More checkpoints reduce overhead … but only a little– Sometimes hurt performance (tie up more registers)– Also, checkpoints are not cheap

Page 9: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 9 ][ 9 ]

But CPR is Great for SMT … Right?

+ SMT needs more registers … + And reduces branch mis-prediction penalty …– But actually makes checkpoint overhead worse!• Distance from mis-predicted branch to older checkpoint has

nothing to do with speculation depth• Threads share checkpoints (more un-checkpointed branches)

Page 10: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 10 ][ 10 ]

Outline

Introduction

CPR

CPROB• Basic idea (very simple)• Some policies• Implementation

Evaluation

Related Work

Conclusion

Page 11: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 11 ][ 11 ]

CPROB: The Key Idea

CPR + hold recovery (OW) registers opportunistically• Recovery registers (p5) available no checkpoint overhead• Recover to younger checkpoint, then walk backwards serially

• Recovery registers (p5) not available overhead, but still correct• Recover to older checkpoint a la CPR

• Opportunistically = can release recovery registers at any time!

ld [r3] => r3A: ld [p3] => p4 p1 p2 p3r1 r2 r3Raw InsnPC Renamed Insn

p3OW

sub r1, 4 => r2brz r3, Qld [r2] => r2brz r2, Tadd r1, 8 => r3ld [r3] => r2

B:C:D:E:F:G:

sub p1, 4 => p5brz p4, Qld [p5] => p6brz p6, Tadd p1, 8 => p7ld [p7] => p8

p1 p2 p4p1 p5 p4

p1 p6 p4p1 p5 p4

p1 p6 p4p1 p6 p7p1 p8 p7RenameMap

p2-

-p5

p4p6

ld [r3] => r3 ld [p3] => p4

brz r3, Q brz p4, Q

p1 p8 p7

p3

p4

p4

p5

p1 p2 p3Chk0

p1 p6 p4Chk1

Page 12: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 12 ][ 12 ]

Good Time Part I

When is a good time to release recovery registers?

Don’t grab in first place: no branches since older checkpoint• “Tail” checkpoint doesn’t grab p4 & p6, Chk1 didn’t grab p3 & p2

Spontaneously: all branches in a checkpoint have executed• Chk1: branch C executes release p5

ld [r3] => r3A: ld [p3] => p4 p1 p2 p3r1 r2 r3Raw InsnPC Renamed Insn

p3OW

sub r1, 4 => r2brz r3, Qld [r2] => r2brz r2, Tadd r1, 8 => r3ld [r3] => r2

B:C:D:E:F:G:

sub p1, 4 => p5brz p4, Qld [p5] => p6brz p6, Tadd p1, 8 => p7ld [p7] => p8

p1 p2 p4p1 p5 p4

p1 p6 p4p1 p5 p4

p1 p6 p4p1 p6 p7p1 p8 p7RenameMap

p2-

-p5

p4p6

ld [r3] => r3 ld [p3] => p4

brz r3, Q brz p4, Q

p1 p8 p7

p3

p4

p4

p5

p1 p2 p3Chk0

p1 p6 p4Chk1

Page 13: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 13 ][ 13 ]

Good Time Part II

Also victimize when rename needs registers to continue• Chances are good un-executed branches are right• Otherwise they would have been assigned checkpoints

CPROB reclamation policy adapts dynamically• Branch mis-predictions tend to cluster [Heil+98]

• Recent mis-prediction window empty, no need to victimize• Hold recovery registers to “protect” upcoming branches

• No recent mis-prediction window full, need to victimize• Probably in a region of high-confidence branches

• Most mis-predicted branches resolve quickly after dispatch• Chance of victimization in this “window” is small

Page 14: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 14 ][ 14 ]

Does CPROB Need a Giant ROB?

CPROB tries to support a large window• Needs a large ROB to hold all insns, right? No

CPROB uses ROB for opportunistic recovery, not commit• Only insns whose recovery registers are held need ROB entries• Can victimize ROB space & recovery registers together• Policy “victimize oldest checkpoint” meshes well with this

Page 15: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 15 ][ 15 ]

Implementation

How is CPROB register reclamation implemented?• When/how are instructions added to the free list?

First: how is CPR register reclamation implemented?• Not using a circular queue free list enqueued at commit …• Using register reference counting [Roth08]

Page 16: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 16 ][ 16 ]

A: ld [p3] => p4 p1 p2 p3r1 r2 r3PC Renamed Insn

p3OW

B:C:D:E:F:G:

sub p1, 4 => p5brz p4, Qld [p5] => p6brz p6, Tadd p1, 8 => p7ld [p7] => p8

p1 p2 p4p1 p5 p4

p1 p6 p4p1 p5 p4

p1 p6 p4p1 p6 p7p1 p8 p7RenameMap

p2-

-p5

p4p6

ld [p3] => p4

brz p4, Q

CPR Register Reference Counting

Reference counts implemented as bit-matrix• One column per physical register• One row per entity that can hold physical register• Issue queue entry, checkpoint, RenameMap

• Columns OR’ed together to form bitvector-style free list• Registers allocated using encoders

1 2 3 4 5 6 7 80 0 0 0 0 0 0 0IQ00 0 0 0 0 0 0 0IQ10 0 0 0 0 0 0 0IQ20 0 0 0 0 0 0 0Chk00 0 0 0 0 0 0 0Chk10 0 0 0 0 0 0 0RMap

0 0 0 0 0 0 0 0Free

p3

p4

p4 1 11

p1 p2 p3

111

p1 p8 p7

1 1 1p1 p6 p41 11

1 1 11111

Chk1

Chk0

Page 17: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 17 ][ 17 ]

CPROB Extensions

Add recovery-register matrix rows• One for each checkpoint• One for RenameMap (“tail” checkpoint)• CPROB rows can be cleared at any time• CPR rows cleared according to strict CPR rules (for correctness)

0 0 0 0 0 0 0 0Rec00 0 0 0 0 0 0 0Rec10 0 0 0 0 0 0 0RRec

A: ld [p3] => p4 p1 p2 p3r1 r2 r3PC Renamed Insn

p3OW

B:C:D:E:F:G:

sub p1, 4 => p5brz p4, Qld [p5] => p6brz p6, Tadd p1, 8 => p7ld [p7] => p8

p1 p2 p4p1 p5 p4

p1 p6 p4p1 p5 p4

p1 p6 p4p1 p6 p7p1 p8 p7RenameMap

p2-

-p5

p4p6

ld [p3] => p4

brz p4, Q

1 2 3 4 5 6 7 80 0 0 0 0 0 0 0IQ00 0 0 0 0 0 0 0IQ10 0 0 0 0 0 0 0IQ20 0 0 0 0 0 0 0Chk00 0 0 0 0 0 0 0Chk10 0 0 0 0 0 0 0RMap

0 0 0 0 0 0 0 0Free

p3

p4

p4 1 11

p1 p2 p3

111

p1 p8 p7

1 1 1p1 p6 p41 11

1 1 11111

Chk0

Chk1

p5

1

1

Page 18: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 18 ][ 18 ]

Outline

Introduction

CPR

CPROB

Evaluation• CPROB• CPROB-SMT

Related Work

Conclusion

Page 19: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 19 ][ 19 ]

Methodology

Benchmarks• SPEC2000 compiled using -O4

For SMT• Characterized as ILP, Branch-, Latency-, or bandWidth-bound • 2-thread workloads using FIESTA methodology [Hilton+09]

Cycle-level simulation• 4-way superscalar out-of-order, 17-stage pipeline, 1 or 2 threads• 256 ROB, 32/32 INT/FP issue queue, 128/128 INT/FP phys-regs• 8 checkpoints for CPR• 48 Kbyte 3-table PPM branch predictor, 16K confidence pred• 32 Kbyte I$/D$, 2 Mbyte 20-cycle L2, 400-cycle memory

Page 20: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 20 ][ 20 ]

CPROB vs. CPR vs. ROB

• Reduces checkpoint overhead significantly (4% 1%)• Remaining: miss-dependent mis-predicted branches

• Fixes CPR’s performance pathologies relative to ROB• Outperforms both CPR and ROB in (almost) every case

Page 21: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 21 ][ 21 ]

CPROB is Energy Efficient

Rough argument (see paper for details) but here goes …• Energy efficient = relative-to-ROB ED2 < 1 [Martin+01]

• Dynamic energy consumption ~ dynamic instruction execution count• CPR: FP: 1.031 / 1.0412 = 0.95, INT: 1.035 / 0.9962 = 1.04• CPROB: FP: 1.001 / 1.0552 = 0.90, INT: 1.013 / 1.0142 = 0.98

Page 22: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 22 ][ 22 ]

Register Usage: Spec Average

Physical registers are expensive: vary from 256 to 2K• ROB: steady benefits to more registers• CPR: roughly constant performance+ Better than ROB at low registers (reclamation dominates)– Worse with more registers (checkpoint overhead dominates)

• CPROB: few registers does CPR, many registers does ROB• Adaptive better than CPR and ROB at all points

Page 23: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 23 ][ 23 ]

Register Usage: SpecINT Gap

Same behavior in individual benchmarks• Some phases need many registers• Some phases need minimal recovery

Page 24: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 24 ][ 24 ]

Checkpoint Usage: Spec Average

Checkpoints are also expensive: vary from 2 to 16• CPR: quite sensitive (needs 4 to break even with ROB)• CPROB: removes CPR’s sensitivity to checkpoint count• Makes CPR viable with 2 checkpoints

Page 25: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 25 ][ 25 ]

CPROB-SMT

+ CPROB fixes Bx pairings in SMT• Branch-bound program paired with something else• Remaining pathologies (LW & WW) due to D$ thrashing

+ Also relieves checkpoint pressure

See paper for other results• Sensitivity, energy model details, area analysis, etc.

Page 26: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 26 ][ 26 ]

Related Works

Other aggressive register schemes• Early register release [Ergin+04], Cherry [Martinez+02]

ROB based large window [Cristal+04, Pericas+06]

• CPROB not relevant here

Control Independence [Cher+01, Chou+99, Gandhi+04, Rotenberg+99]

• Orthogonal, CPROB potentially Synergistic with TCI [AlZawawi+07]

TurboROB [Akl+08]

• Accelerates serial recovery• Compatible (maybe synergistic) with CPROB

Page 27: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 27 ][ 27 ]

Also Related: FIESTA

FIESTA: workloads for multi-program experiments• Fixed Instruction with Equal STAndalone runtimes• Pre-select application samples for equal standalone runtimes• Run same samples consistently in every experiment+ Fixed workloads direct comparison with no result skew• Plain, unambiguous speedup metrics

+ Minimal load imbalance by construction• Remaining load imbalance is “un-fairness”

• Hilton et al. “FIESTA”, MoBS workshop, 2009.• Consider using it in your multi-program experiments

Page 28: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 28 ][ 28 ]

Conclusions

Physical register file: critical out-of-order core resource• Limits window size (especially for SMT)

CPR: execution-driven reclamation scheme+ Much better scalability (good for SMT)– Checkpoint overhead (surprise, even worse in SMT)• Some pathologies relative to ROB commit-driven reclamation

CPROB: opportunistic hybrid register reclamation• Holds recovery registers to eliminate checkpoint overhead• Adaptively victimizes them when rename needs more+ Eliminates CPR’s pathologies, outperforms both CPR and ROB

Page 29: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 29 ][ 29 ]

Page 30: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 30 ][ 30 ]

CPROB: ROB size?

Vary ROB size from 32 to 256 entries• CPROB only needs 64-96 entries for full performance• Degrades gracefully to 32• ROB needs at least 128 entries

Page 31: PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

[ 31 ][ 31 ]

CFPROB

CPR base for CFP (Continual Flow Pipelines) [Srinivasan+04]

• Unblocks issue queue & registers under LLC misses

CFPROB: CFP on top of CPROB• CPROB baseline fixes performance pathologies• Small ROB = minimal recovery for miss-independent branches• LLC-miss-dependent branch mis-predictions are rare