ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF urhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad Gh Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 21 st International Conference on Computer Design (ICCD’03), October 14 th 2003
85
Embed
ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ICCD’03
1
Distributed Reorder Buffer Schemes for Low Power *
*supported in part by DARPA through the PAC-C program and NSF
Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad GhoseDepartment of Computer Science
State University of New YorkBinghamton, NY 13902-6000
http://www.cs.binghamton.edu/~lowpower
21st International Conference on Computer Design (ICCD’03), October 14th 2003
ICCD’03
2
– Reorder Buffer (ROB) complexities– Motivation for the low-complexity ROB– Low-complexity ROB designs
Fully Distributed ROB Retention Latches (RLs) revisited (ICS’02) Combined Scheme
– Results– Concluding remarks
Outline
ICCD’03
3
P6-style Superscalar Datapath
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
ROB
ICCD’03
4
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2ROB
RB
PPC 620-style Superscalar Datapath
ICCD’03
5
ROB Port Requirements for a W-way CPU
ROB
WritebackW write portsto write results
Dispatch/Issue2W read ports
to read the source operands
Decode/DispatchW write portsto setup entries
CommitW read portsfor instruction commitment
ICCD’03
6
What This Work is All About
– ROB complexity reduction is important for reducing power and improving performance
ROB dissipates a non-trivial fraction of the total chip power ROB accesses stretch over several cycles
– Goal of this work: Reduce the complexity and power dissipation of the ROB without sacrificing performance
ICCD’03
7
Comparison of ROB Bitcells (0.18µ, TSMC)
Layout of a 32-ported SRAM bitcell
Layout of a 16-ported SRAM bitcell
Area Reduction – 71%
Shorter bit and wordlines
ICCD’03
8
Instruction dispatch
P6-style Superscalar Datapath
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Architectural Register File
F2
Fetch Decode/Dispatch
D2
ROB
ICCD’03
9
Reorder Buffer Distribution
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
ROBC 1
ROBC 2
ROBC m
ROB
Holds pointers to entries within
ROBCs
ROB Components
(ROBCs)
ICCD’03
10
Impact of Distributing the ROB
– Each ROBC is effectively is a small Rename Buffer Smaller read/write access energy Faster access time
– Distributing physical storage in this manner allows FUs to use shorter buses to write their respective ROBCs
Lower energy dissipation on the wires (We have NOT accounted for energy savings from using shorter wires)
– Fits in naturally with a multi-clustered datapath design
ICCD’03
11
– Port conflicts result in performance penalty
– Interconnection network is more complex
Problems with the earlier Multi-banked RF Schemes
ICCD’03
12
– Port conflicts result in performance penaltyTotally avoid write port conflictsMinimize read port conflicts at commitment
– Interconnection network is more complex
and some good news!
Problems with the earlier Multi-banked RF Schemes
ICCD’03
13
– Port conflicts result in performance penaltyTotally avoid write port conflictsMinimize read port conflicts at commitment
– Interconnection network is more complexCompletely remove source read ports
and some good news!
Problems with the earlier Multi-banked RF Schemes
ICCD’03
14
Problems with the earlier Multi-banked RF Schemes
– Port conflicts result in performance penaltyTotally avoid write port conflictsMinimize read port conflicts at commitmentTotally avoid source read port conflicts
– Interconnection network is more complexCompletely remove source read ports
and some good news!
ICCD’03
15
ROBCs Assigned to Each Function Unit
1
2
3
4
n
ROBC #11 1
2
3
1
ROBC #21
2
3
4
m 1
2 1
ROBC #m1FU #m
FU #2
FU #1
Centralized ROB Distributed ROBCs
FU_id offset
ICCD’03
16
Good News:Write port conflicts are avoided
ROBC #11
2
3
ROBC #21
2
3
4
ROBC #m1FU #m
FU #2
FU #1
1 write port
Distributed ROBCs
1
2
3
4
n
1 1
m 1
2 1
Centralized ROB
FU_id offset
ICCD’03
17
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
instruction
5
ICCD’03
18
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
ADDinstruction
5
ICCD’03
19
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
ADDreserved
instruction
5
ICCD’03
20
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
ADD1 1
instruction
reserved
5
ADD
ICCD’03
21
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
ADD1 1
instruction
reservedSUB
5
ICCD’03
22
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
ADD1 1
instruction
reservedSUB
reserved
5
ICCD’03
23
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
ADD1 1
instruction
reserved
reserved
SUB2 1
5
SUB
ICCD’03
24
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
ADD1 1
instruction
reserved
reserved
SUB2 1AND
5
ICCD’03
25
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
ADD1 1
instruction
reserved
reserved
SUB2 1
reserved
AND
5
ICCD’03
26
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
ADD1 1
instruction
reserved
reserved
SUB2 1
reserved
AND13
5
AND
ICCD’03
27
Good News:Avoiding Read Port Conflicts
1
2
3
4
n
1
2
FU_id offset
Centralized ROB Distributed ROBCs
1
2
1
2
1
2
ADD1 1
instruction
reserved
reserved
SUB2 1
1 read port
Tocommitment
3 1 AND
reserved
5
ICCD’03
28
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
FU_id offset
Centralized ROB Distributed ROBCs
1
2
ADD1 1
instruction
SUB2 1
AND13MUL
5
IntMUL/DIVROBC #5
ICCD’03
29
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
FU_id offset
Centralized ROB Distributed ROBCs
2
1
ADD1 1
instruction
SUB2 1
AND13MUL
5
reserved
IntMUL/DIVROBC #5
ICCD’03
30
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
FU_id offset
Centralized ROB Distributed ROBCs
1
2
ADD1 1
instruction
reserved
SUB2 1
AND13
5
5 1 MUL
IntMUL/DIVROBC #5
MUL
ICCD’03
31
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
FU_id offset
Centralized ROB Distributed ROBCs
ADD1 1
instruction
SUB2 1
AND13
DIV5
5 1 MUL1
2reserved
IntMUL/DIVROBC #5
ICCD’03
32
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
FU_id offset
Centralized ROB Distributed ROBCs
ADD1 1
instruction
SUB2 1
AND13
DIV5
5 1 MUL1
2reservedreserved
IntMUL/DIVROBC #5
ICCD’03
33
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
FU_id offset
Centralized ROB Distributed ROBCs
ADD1 1
instruction
SUB2 1
AND13
5
5 1 MUL
5 2 DIV
1
2reservedreserved
IntMUL/DIVROBC #5
DIV
ICCD’03
34
Read Port Conflicts at Commitment
1
2
3
4
n
FU_id offset
Centralized ROB Distributed ROBCs
ADD1 1
instruction
SUB2 1
AND13
5
5 1 MUL
5 2 DIV
1
2reserved
IntMUL/DIVROBC #5
reserved Tocommitment
CONFLICT:If MUL and DIV wantsto commit in the same cycle
1 read port
DIV
ICCD’03
35
Distributed ROB Design 1
ROBC
Writeback1 write port
to write results
ICCD’03
36
Distributed ROB Design 1
ROBC
Writeback1 write port
to write results
Commit1 read port
for instruction commitment
ICCD’03
37
Distributed ROB Design 1: with source read ports
ROBC
Writeback1 write port
to write resultsDispatch/Issue1 read port
to read the source operands
Commit1 read port
for instruction commitment
ICCD’03
38
Experimental Setup: the AccuPower (DATE’02)Compiled
SPEC benchmarks
Datapathspecs
Performance stats
VLSI layoutdata
SPICEdeck
SPICE
MicroarchitecturalSimulator(Rooted in
SimpleScalar)
Energy/PowerEstimator
Power/energystats
SPICE measures ofenergy per transition
Transition counts,Context information
ICCD’03
39
Configuration of the Simulated System
Machine width 4-way
Issue Queue 32 entries
96 entriesReorder Buffer
Load/Store Queue 32 entries
Simulated the execution of SPEC2000 benchmarks
ICCD’03
40
Peak/Average demands on the number of ROBC entries