ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry.

ICCD’03

1

Distributed Reorder Buffer Schemes for Low Power *

*supported in part by DARPA through the PAC-C program and NSF

Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad GhoseDepartment of Computer Science

State University of New YorkBinghamton, NY 13902-6000

http://www.cs.binghamton.edu/~lowpower

21st International Conference on Computer Design (ICCD’03), October 14th 2003

ICCD’03

2

– Reorder Buffer (ROB) complexities– Motivation for the low-complexity ROB– Low-complexity ROB designs

Fully Distributed ROB Retention Latches (RLs) revisited (ICS’02) Combined Scheme

– Results– Concluding remarks

Outline

ICCD’03

3

P6-style Superscalar Datapath

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

ROB

ICCD’03

4

IQ


F1 D1

FU1

FU2

FUm

ARF


EX



F2


D2ROB

RB

PPC 620-style Superscalar Datapath

ICCD’03

5

ROB Port Requirements for a W-way CPU

ROB

WritebackW write portsto write results

Dispatch/Issue2W read ports

to read the source operands

Decode/DispatchW write portsto setup entries

CommitW read portsfor instruction commitment

ICCD’03

6

What This Work is All About

– ROB complexity reduction is important for reducing power and improving performance

ROB dissipates a non-trivial fraction of the total chip power ROB accesses stretch over several cycles

– Goal of this work: Reduce the complexity and power dissipation of the ROB without sacrificing performance

ICCD’03

7

Comparison of ROB Bitcells (0.18µ, TSMC)

Layout of a 32-ported SRAM bitcell

Layout of a 16-ported SRAM bitcell

Area Reduction – 71%

Shorter bit and wordlines

ICCD’03

8


P6-style Superscalar Datapath

IQ


F1 D1

FU1

FU2

FUm

ARF


EX


F2


D2

ROB

ICCD’03

9

Reorder Buffer Distribution

IQ


F1 D1

FU1

FU2

FUm

ARF


EX



F2


D2

ROBC 1

ROBC 2

ROBC m

ROB

Holds pointers to entries within

ROBCs

ROB Components

(ROBCs)

ICCD’03

10

Impact of Distributing the ROB

– Each ROBC is effectively is a small Rename Buffer Smaller read/write access energy Faster access time

– Distributing physical storage in this manner allows FUs to use shorter buses to write their respective ROBCs

Lower energy dissipation on the wires (We have NOT accounted for energy savings from using shorter wires)

– Fits in naturally with a multi-clustered datapath design

ICCD’03

11

– Port conflicts result in performance penalty

– Interconnection network is more complex

Problems with the earlier Multi-banked RF Schemes

ICCD’03

12

– Port conflicts result in performance penaltyTotally avoid write port conflictsMinimize read port conflicts at commitment

– Interconnection network is more complex

and some good news!


ICCD’03

13

– Port conflicts result in performance penaltyTotally avoid write port conflictsMinimize read port conflicts at commitment

– Interconnection network is more complexCompletely remove source read ports

and some good news!


ICCD’03

14


– Port conflicts result in performance penaltyTotally avoid write port conflictsMinimize read port conflicts at commitmentTotally avoid source read port conflicts

– Interconnection network is more complexCompletely remove source read ports

and some good news!

ICCD’03

15

ROBCs Assigned to Each Function Unit

1

2

3

4

n

ROBC #11 1

2

3

1

ROBC #21

2

3

4

m 1

2 1

ROBC #m1FU #m

FU #2

FU #1

Centralized ROB Distributed ROBCs

FU_id offset

ICCD’03

16

Good News:Write port conflicts are avoided

ROBC #11

2

3

ROBC #21

2

3

4

ROBC #m1FU #m

FU #2

FU #1

1 write port

Distributed ROBCs

1

2

3

4

n

1 1

m 1

2 1

Centralized ROB

FU_id offset

ICCD’03

17

Round Robin Scheduling at Dispatch Time

1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset


Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

instruction

5

ICCD’03

18


1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset


Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

ADDinstruction

5

ICCD’03

19


1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset


Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

ADDreserved

instruction

5

ICCD’03

20


1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset


Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

ADD1 1

instruction

reserved

5

ADD

ICCD’03

21


1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset


Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

ADD1 1

instruction

reservedSUB

5

ICCD’03

22


1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset


Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

ADD1 1

instruction

reservedSUB

reserved

5

ICCD’03

23


1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset


Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

ADD1 1

instruction

reserved

reserved

SUB2 1

5

SUB

ICCD’03

24


1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset


Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

ADD1 1

instruction

reserved

reserved

SUB2 1AND

5

ICCD’03

25


1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset


Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

ADD1 1

instruction

reserved

reserved

SUB2 1

reserved

AND

5

ICCD’03

26


1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset


Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

ADD1 1

instruction

reserved

reserved

SUB2 1

reserved

AND13

5

AND

ICCD’03

27

Good News:Avoiding Read Port Conflicts

1

2

3

4

n

1

2

FU_id offset


1

2

1

2

1

2

ADD1 1

instruction

reserved

reserved

SUB2 1

1 read port

Tocommitment

3 1 AND

reserved

5

ICCD’03

28


1

2

3

4

n

FU_id offset


1

2

ADD1 1

instruction

SUB2 1

AND13MUL

5

IntMUL/DIVROBC #5

ICCD’03

29


1

2

3

4

n

FU_id offset


2

1

ADD1 1

instruction

SUB2 1

AND13MUL

5

reserved

IntMUL/DIVROBC #5

ICCD’03

30


1

2

3

4

n

FU_id offset


1

2

ADD1 1

instruction

reserved

SUB2 1

AND13

5

5 1 MUL

IntMUL/DIVROBC #5

MUL

ICCD’03

31


1

2

3

4

n

FU_id offset


ADD1 1

instruction

SUB2 1

AND13

DIV5

5 1 MUL1

2reserved

IntMUL/DIVROBC #5

ICCD’03

32


1

2

3

4

n

FU_id offset


ADD1 1

instruction

SUB2 1

AND13

DIV5

5 1 MUL1

2reservedreserved

IntMUL/DIVROBC #5

ICCD’03

33


1

2

3

4

n

FU_id offset


ADD1 1

instruction

SUB2 1

AND13

5

5 1 MUL

5 2 DIV

1

2reservedreserved

IntMUL/DIVROBC #5

DIV

ICCD’03

34

Read Port Conflicts at Commitment

1

2

3

4

n

FU_id offset


ADD1 1

instruction

SUB2 1

AND13

5

5 1 MUL

5 2 DIV

1

2reserved

IntMUL/DIVROBC #5

reserved Tocommitment

CONFLICT:If MUL and DIV wantsto commit in the same cycle

1 read port

DIV

ICCD’03

35

Distributed ROB Design 1

ROBC

Writeback1 write port

to write results

ICCD’03

36

Distributed ROB Design 1

ROBC


to write results

Commit1 read port

for instruction commitment

ICCD’03

37

Distributed ROB Design 1: with source read ports

ROBC


to write resultsDispatch/Issue1 read port


Commit1 read port


ICCD’03

38

Experimental Setup: the AccuPower (DATE’02)Compiled

SPEC benchmarks

Datapathspecs

Performance stats

VLSI layoutdata

SPICEdeck

SPICE

MicroarchitecturalSimulator(Rooted in

SimpleScalar)

Energy/PowerEstimator

Power/energystats

SPICE measures ofenergy per transition

Transition counts,Context information

ICCD’03

39

Configuration of the Simulated System

Machine width 4-way

Issue Queue 32 entries

96 entriesReorder Buffer

Load/Store Queue 32 entries

Simulated the execution of SPEC2000 benchmarks

ICCD’03

40

Peak/Average demands on the number of ROBC entries

ROBC type IntADD#1, #2, #3, #4

IntMUL/DIV

FPADD#1, #2, #3, #4

FPMUL/DIV

Load

SPEC 2000Integer Average 16.9 4.4 4.1 0.1 1.6 0.04 3.8 0.04 28.6 9.3

SPEC 2000FP Average 14.2 4.9 3.2 0.8 3.8 0.6 6.7 1.1 23.5 7.5

SPEC 2000Average 15.7 4.6 3.7 0.4 2.6 0.3 5.0 0.5 26.4 8.5

peak peakpeak peak peak avg.avg.avg.avg.avg.

ICCD’03

41



IntMUL/DIV

FPADD#1, #2, #3, #4

FPMUL/DIV

Load


SPEC 2000FP Average 14.2 4.9 3.2 0.8 3.8 0.6 6.7 1.1 23.5 7.5

SPEC 2000Average 15.7 4.6 3.7 0.4 2.6 0.3 5.0 0.5 26.4 8.5


8 8 8 8 4 4 4 4 4 4 16Number of entriesassigned to eachROBC

ICCD’03

42



IntMUL/DIV

FPADD#1, #2, #3, #4

FPMUL/DIV

Load


SPEC 2000FP Average 14.2 4.9 3.2 0.8 3.8 0.6 6.7 1.1 23.5 7.5

SPEC 2000Average 15.7 4.6 3.7 0.4 2.6 0.3 5.0 0.5 26.4 8.5


8 8 8 8 4 4 4 4 4 4 16+ + + + + + + + + + = 72entry

8_4_4_4_16 configuration

Number of entriesassigned to eachROBC

ICCD’03

43

Percentage of cycles when dispatch blocks for 8_4_4_4_16


IntMUL/DIV

FPADD#1, #2, #3, #4

FPMUL/DIV

Load

SPEC 2000Integer Average 0.9 0.1 0 0 5.2

SPEC 2000FP Average 1.5 1.0 0.1 0.8 1.9

SPEC 2000Average 1.2 0.5 0 0.4 3.8

Average IPC drop% with 8_4_4_4_16 configuration = 4.8%

ICCD’03

44

Percentage of cycles when dispatch blocks for 8_4_4_4_16


IntMUL/DIV

FPADD#1, #2, #3, #4

FPMUL/DIV

Load


SPEC 2000FP Average 1.5 1.0 0.1 0.8 1.9

SPEC 2000Average 1.2 0.5 0 0.4 3.8

8 8 8 8 4 4 4 4 4 4 16+ + + + + + + + + + = 72entry


ICCD’03

45

Reducing performance penalty: 12_6_4_6_20 Configuration


IntMUL/DIV

FPADD#1, #2, #3, #4

FPMUL/DIV

Load


SPEC 2000FP Average 1.5 1.0 0.1 0.8 1.9

SPEC 2000Average 1.2 0.5 0 0.4 3.8

12 12 12 12 6 4 4 4 4 6 20+ + + + + + + + + + = 96entry

12_6_4_6_20 configuration


ICCD’03

46

0

1

2

3

Base, 2-cycle RO B access and full bypass 2 read ports, 12_6_4_6_20

Performance Results for 12_6_4_6_20 Configuration

0

1

2

3

gap gcc gzip parser perl twolf Int Avg.vortex vpr

applu art mesa mgrid swim wupwise FP Avg.

IPC


ICCD’03

47

Distributed ROB Design 1: with source read ports

ROBC




Commit1 read port


ICCD’03

48

Eliminating All Source Read Ports

ROBC




Commit1 read port


ICCD’03

49


ROBC


to write results

Commit1 read port


ICCD’03

50

Where are the Source Values Coming From?

IQ


F1 D1

FU1

FU2

FUm

ARF


EX



F2


D2

ROB

12

3

ICCD’03

51

Where are the Source Values Coming From ?

0%

20%

40%

60%

80%

100%

Forwarding ARF ROB

96-entry ROB, 4-way processorSPEC2K Benchmarks

62% 32%32% 6%

ICCD’03

52

How Efficiently are the Ports Used ?

ROB






6%

ICCD’03

53

Our Solution: Elimination of Read Ports

IQ


F1 D1

FU1

FU2

FUm

ARF


EX



F2


D2

ROB

12

3

ICCD’03

54


IQ


F1 D1

FU1

FU2

FUm

ARF


EX



F2


D2

ROB

12

3

ICCD’03

55


IQ


F1 D1

FU1

FU2

FUm

ARF


EX



F2


D2

1

3

ROB

ICCD’03

56

Distributed Reorder Buffer Scheme

IQ


F1 D1

FU1

FU2

FUm

ARF


EX



F2


D2

ROBC 1

ROBC 2

ROBC m

ROB


ROBCs

ROBCs

ICCD’03

57

Elimination of Source Read Ports

IQ


F1 D1

FU1

FU2

FUm

ARF


EX



F2


D2

ROBC 1

ROBC 2

ROBC m

ROB

ROBCs


ROBCs

ICCD’03

58

Elimination of Source Read Ports

IQ


F1 D1

FU1

FU2

FUm

ARF


EX



F2


D2

ROBC 1

ROBC 2

ROBC m

ROB

ROBCs


ROBCs

ICCD’03

59

Completely Eliminating the Source Read Ports on the ROBCs

– The Problem: Issue of instructions that require a value stored in a ROBC will stall

– Solutions:Forward the value to the waiting instruction at the

time of committing the value: LATE FORWARDING

ICCD’03

60

Late Forwarding: Use the Normal Forwarding Buses!

IQ


F1 D1

FU1

FU2

FUm

ARF


EX



F2


D2

ROBC 1

ROBC 2

ROBC m

ROB

ROBCs


ROBCs

ICCD’03

61

Late Forwarding: Use the Normal Forwarding Buses!

IQ


F1 D1

FU1

FU2

FUm

ARF


EX



F2


D2

ROBC 1

ROBC 2

ROBC m

ROB

Late Forwarding

ROBCs


ROBCs

ICCD’03

62

0

8

16

24

No ROBC source read ports with Late Forwarding

Performance Drop of Simplified ROBC Design

Per

form

ance

Dro

p %

0

8

16

24

32

40

48

9.6%Average IPC Drop:

bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr

applu apsi art equake mesa mgrid swim wupwise FP Avg.

37%

17%

ICCD’03

63

IPC Penalty:Source Value Not Accessible within the ROBC

ForwardingLate Forwarding/

Commitment

Lifetime of a Result Value

ResultGeneration

time

Valuewithin ARF

Valuewithin a ROBC

ICCD’03

64

Improving IPC with No Read Ports

– Cache recently generated values in a set of RETENTION LATCHES (RL)

– Retention Latches are SMALL and FASTOnly 8 to 16 latches needed in the setEntire set has 1 or 2 read ports

ICCD’03

65

Adding Retention Latches into the Picture

IQ


F1 D1

FU1

FU2

FUm

ARF


EX



F2


D2

ROBC 1

ROBC 2

ROBC m

ROB

Late Forwarding

ROBCs


ROBCs

ICCD’03

66

Adding Retention Latches into the Picture

IQ


F1 D1

FU1

FU2

FUm

ARF


EX



F2


D2

ROBC 1

ROBC 2

ROBC m

ROB

Late Forwarding

RETENTION LATCHES


ROBCs

ICCD’03

67


ROBC


to write results

Commit1 read port


ICCD’03

68

Distributed ROB Design 2: with Retention Latches

ROBC


to write results

Commit1 read port


Eight,2-ported

FIFORLs

ICCD’03

69

0

1

2

3

Base, 2-cycle RO B access and full bypass 2 read ports, 12_6_4_6_20


0

1

2

3



IPC


ICCD’03

70

0

1

2

3

gap gcc gzip pars perl twolf vortex vpr

Base, 2-cycle ROB access and full bypassDesign 1: 2 read ports, 12_6_4_6_20Design 2: Eight 2-ported FIFO RLs, 12_6_4_6_20 with 1 read port


0

1

2

3



IPC


ICCD’03

71

0

1

2

3

gap gcc gzip pars perl twolf vortex vpr

Base, 1-cycle ROB access and full bypassDesign 1: 2 read ports, 12_6_4_6_20Design 2: Eight 2-ported FIFO RLs, 12_6_4_6_20 with 1 read port


0

1

2

3



IPC


ICCD’03

72

0

10

20

30

40

50

60

Eight 2-ported FIFO latchesDesign 1: 2 read ports, 12_6_4_6_20Design 2: Eight 2-ported FIFO RLs, 12_6_4_6_20 with 1 read port

Power Results for 12_6_4_6_20 Configuration

0

10

20

30

40

50

60



Pow

er S

avin

gs %

Power savings%: 49% 47%23%

ICCD’03

73

0

10

20

30

40

50

60

Eight 2-ported FIFO latchesDesign 1: 2 read ports, 12_6_4_6_20Design 2: Eight 2-ported FIFO RLs, 12_6_4_6_20 with 1 read port

Power Results for 12_6_4_6_20 Configuration(Compared to Baseline case with 64 entry Rename Buffers)

0

10

20

30

40

50

60



Pow

er S

avin

gs %

Power savings%: 39% 37%20%

ICCD’03

74

Summary of Results

– Low performance degradation: 1.7% IPC drop on the average (compared to 2-cycle ROB) 3.8% IPC drop on the average (compared to 1-cycle ROB)

– ROB Power savings: as high as 49% are realized (compared to P6-style datapath: 96

entry ROB) as high as 39% (compared to Rename Buffer design: 96 entry

ROB, 64 entry RB)

ICCD’03

75

Conclusions

– We introduced a conflict-free distributed Reorder Buffer design

– ROB power savings of as high as 49% are realized with only a small (1.7%) performance penalty

– ROB complexity is drastically reduced by Distributing the ROB into multiple banks Reducing the port requirements to no more than 2 ports for

each ROB components

ICCD’03

76

~ Thank You~

ICCD’03

77

Distributed Reorder Buffer Schemes for Low Power *

*supported in part by DARPA through the PAC-C program and NSF

Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad GhoseDepartment of Computer Science

State University of New YorkBinghamton, NY 13902-6000

http://www.cs.binghamton.edu/~lowpower

21st International Conference on Computer Design (ICCD’03), October 14th 2003

ICCD’03

78

Related Work

– Replicated (Kessler, IEEE Micro) and distributed (Canal et.al, HPCA’00 and Farkas et.al, MICRO’97) RFs in a clustered organization

– Multiple Register Banks (Cruz et.al., ISCA’00 & Balasubramonian et.al., MICRO’01)

– Multiple Register Banks with additional pipeline stage to avoid complex arbitration logic (Tseng et.al, ISCA’03

– Multiple Register Banks without write port conflicts (Wallase et.al, PACT’96)

ICCD’03

79


ROB






ICCD’03

80


ROB

WritebackW write ports

To write results



Decode/Dispatch1 W-wide write port

to setup entries

Commit1 W-wide read port


ICCD’03

85

Fully Distributed Reorder Buffer Scheme

ICCD’03

86


– Distributed ROB Components (ROBCs) are assigned to each Function Unit

No write port conflicts at writeback stage, and minimal read port conflicts at commitment: Negligible performance penalty

Each ROBC can be tailored to the needs of its FU : No over commitment of resources, less complexity

– The FIFO structure that maintains pointers to the ROBCs remains centralized

ICCD’03

87


1

2

3

4

n

ROBC #11 1

2

3

1

FU_id offset

ROBC #21

2

3

4

m 1

2 1

ROBC #m1


ICCD’03

88


1

2

3

4

n

ROBC #11 1

2

3

1

ROBC #21

2

3

4

m 1

2 1

ROBC #m1


FU_id offset

ICCD’03

90

0

10

20

30

40

50

60

Centralized ROB, Eight 2-ported FIFO Retention Latches

Results for the Scheme with Retention Latches

0

10

20

30

40

50

60



Pow

er S

avin

gs %

Power savings%: 23%

ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry.

Documents

performance rob

performance slide

instruction commitment

rob port requirements

rob complexity reduction

wordlines slide

comparison of rob bitcells

buses ex instruction