George Michelogiannakis, James Balfour, William J. Dally Computer Systems Laboratory Stanford University Elastic-Buffer Flow- Control for On-Chip Networks Edited by: Abhay Bhopat
George Michelogiannakis,James Balfour, William J. Dally
Computer Systems LaboratoryStanford University
Elastic-Buffer Flow-Control for On-Chip Networks
Edited by: Abhay Bhopat
Background
Buffer
Elastic Buffer
Elastic Buffer design
2
Introduction
Elastic-buffer (EB) flow-control uses the channels as distributed FIFOs• Input buffers at routers are not needed
Can provide 12% more throughput per unit power Reduces router cycle time by 18%
• Compared to VC routers
3
Outline
Building elastic-buffered channels• By using what is already there
Router microarchitecture Deadlock avoidance Load-sensing for adaptive routing Evaluation
4
The Idea
Use the network channels as distributed FIFOs Use that storage instead of input buffers at
routers• To remove input buffer area and power costs
5
Pipelined channel
Channel as FIFO
Building an Elastic Buffer
To build an EB in a pipelined channel with master-slave flip-flops (FFs):
Use latches for storage by driving their enables independently
6
Master-slave FF
Elastic buffer
Expanded view of EB control logic
7
How Elastic Buffer Channels Work
Ready/valid handshake between elastic buffers• Ready: At least one free storage slot• Valid: Non-empty (driving valid data)
8Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6
Control Logic Area Overhead
Control logic is implemented as a four-state FSM with 10 gates and 2 FFs• Cost is amortized over channel width
Example: control logic increases area of a 64-bit channel by 5%
9
Outline
Building elastic-buffered channels Router microarchitecture
• Use EB flow-control through the router Deadlock avoidance Load-sensing for adaptive routing Evaluation
10
Use EB Flow-Control Through the Router
11
VC input-buffered router
EB router
Input bufferreplaced byinput EB
VC & SWallocators removed.Per-output arbitersinstead.
Three-slot outputEB to cover forarbitration doneone cycle inadvance.
LA routing alsoapplicable to EBnetworks.
Topology
122D 4x4 FBFly
Separate routers for networks
13
Outline
Building elastic-buffered channels Router microarchitecture Deadlock avoidance
• How to provide isolation without VCs Load-sensing for adaptive routing Evaluation
14
Deadlock Avoidance: Duplicate Channels
No input buffers no virtual channels Three types of possible deadlocks:1. Protocol deadlock2. Cyclic flit dependency in network
Solution: Duplicate physical channels
15
Deadlock Avoidance: No Interleaving
3. Interleaving deadlock• New head flits require destination registers• Occupied destination registers depend on tail flits• Tail flits cannot bypass the new head flit
Solution: Disallow packet interleaving
16
Duplicating Channels Between Routers
Duplicate channels with neckdown• Small improvement (still one switch port), large cost
Duplicate channels with duplicate switch ports• Excessive cost (switch quadratic cost)
17
Dividing Into Sub-Networks More Efficient
Divide into sub-networks• Double bandwidth, double the cost• However, when narrowing datapath down to normalize
for throughput or power more beneficial• Again, due to switch quadratic cost
18
Outline
Building elastic-buffered channels Router microarchitecture Deadlock avoidance Load-sensing for adaptive routing
• Propose a load metric for EB networks Evaluation
19
Congestion metrics
Blocked Cycles
Blocked Ratio
Output Occupancy
Channel Occupancy
Channel Delay
20
Output Channel Occupancy Load Metric
Flit-buffered networks use credit count EB networks measure output channel occupancy
• At a certain segment of the output channel (shown in red)• Occupancy decremented when flits leave that segment• Incremented by a packet’s length when routing decision is
made. Packets see other decisions in same cycle
21
Outline
Building elastic-buffered channels Router microarchitecture Deadlock avoidance Load-sensing for adaptive routing Evaluation
• Compare throughput, power, area, latency, cycle time
22
Evaluation Methodology
Used a modified version Area/power estimations from a 65nm library
• Input buffers modeled as SRAM cells• Throughput/power optimal # of VCs and buffer depth• Two sub-networks: request and reply
Averaged over a set of 6 traffic patterns Constant packet size (512 bits) Swept channel width from 28 to 192 bits
23
Throughput-Power Gains in 2D Mesh
24
EB network improvement:
Same power: 10% increased throughput
Same throughput: 12% reduced power
Throughput gain
Throughput-Area Gains in 2D Mesh
25
2% improvementfor EB networks
Latency-Throughput in 2D Mesh
26
Zero-load latency equal
Power Breakdown: No Input Buffer Power
27
0 0.2 0.4 0.6 0.8
VC-Buff
EBN
Mesh low-swing power breakdown (2% packet injection rate)
Output clock
Output FF
Crossbar control
Crossbar power
Input buffer write
Input buffer read
Channel FF
Channel clock
Channel traversal
(W)
Area Breakdown: No Input Buffer Area
28
0.0
0.2
0.4
0.6
0.8
1.0
1.2
VC-Buff EBN
Low-swing mesh area breakdown
Channel Switch Input Output(mm2)
Router RTL Implementation
No buffers, VCs, allocators, credits• VC router had look-ahead routing
Buffers: FF arrays. 2 VCs, 8 slots each
Aspect VC router EB router SavingsArea (μm2) 63,515 14,730 77%Clock (ns) 3.3 2.7 18%
Power (mW) 2.59 0.12 95%
29
45nm, LP-CMOS, worst-caseMesh 5x5 routers. DOR. 64-bit datapath
Conclusions
EB flow-control uses channels as distributed FIFOs• Removes input buffers from routers• Uses duplicate physical channels instead of VCs
Increases throughput per unit power up to 12% for low-swing• Depends on what fraction of the overall cost input buffers
constitute Reduces router cycle time by 18% Flow-control choice depends on design parameters
and priorities
30
Questions?
Thanks for your attention