1 Near-Optimal Oblivious Routing for 3D-Mesh Networks ICCD 2008 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University of California, San Diego
Dec 22, 2015
1
Near-Optimal Oblivious Routingfor 3D-Mesh Networks
ICCD 2008
Rohit Sunkam RamanujamBill Lin
Electrical and Computer Engineering DepartmentUniversity of California, San Diego
2
Motivation: Networks-on-Chip• Chip-multiprocessors (CMPs) increasingly popular• 2D-mesh networks often used as on-chip fabric
I/O Area
I/O Area
single tile
1.5mm
2.0mm
21.7
2mm
12.64mm
65nm, 1 poly, 8 metal (Cu)Technology
100 Million (full-chip) 1.2 Million (tile)
Transistors
275mm2 (full-chip) 3mm2 (tile)
Die Area
8390C4 bumps #
65nm, 1 poly, 8 metal (Cu)Technology
100 Million (full-chip) 1.2 Million (tile)
Transistors
275mm2 (full-chip) 3mm2 (tile)
Die Area
8390C4 bumps #
Tilera Tile64Intel 80-core
3
Motivation: 3D Integrated Circuits• 3D Benefits
– Reduced wire delays– Enormous bandwidth– Heterogeneous system
integration
• Natural progression– 3D-mesh for 3D CMPs
2D to 3D
4
Routing Algorithm Objectives
• Maximize throughput– How much load the network can handle
• Minimize hop count– Minimize routing delay between source and destination
5
Challenges• For 2D-case, a near-optimal throughput routing algorithm with
minimal hop count called O1TURN is known [Seo’05].
• Surprisingly, optimality of O1TURN does not extend to 3D case, actual throughput performance degrades severely.
• Only known optimal throughput routing algorithm is Valiant (VAL) load-balancing, but VAL performs poorly on hop count (latency), twice that of minimal routing.
6
Main Contribution
• Developed a new oblivious routing algorithm called “Randomized Partially Minimal” (RPM) routing.
• RPM provably guarantees near-optimal worst-case throughput in 3D case.– Optimal for even radix k (e.g. 8 x 8 x 8 mesh).– Within factor of 1/k2 for odd radix (e.g. 7 x 7 x 7 mesh).
• Good latency performance.– Only factor of 1.33 of minimal routing (much better than 2x cost of
VAL, only known routing algorithm with optimal throughput)– In practice, 3D-meshes are asymmetric because number of device
layers less than number of tiles per edge.– e.g., for 16 x 16 x 4 mesh (4 layers), RPM’s hop count just factor of 1.1
of minimal routing.
7
Outline
• Motivation for our work
Existing 2D routing algorithms don’t extend well into 3D
• RPM routing algorithm
• Simulation results
• Extensions and future work
8
Existing Routing AlgorithmsThe 2D case• Dimension-Ordered Routing (DOR)
– Route minimal XY
• Valiant load-balancing (VAL)– Route source → randomly chosen
intermediate node → destination– Route minimal XY in both phases
• ROMM– Same as VAL, but intermediate node
restricted to minimal direction
• Orthogonal 1-TURN (O1TURN)– Route minimal XY and YX with equal
probability
Extending to the 3D case …• Dimension-Ordered Routing (DOR)
– Route minimal XYZ
• Valiant load-balancing (VAL)– Route source → randomly chosen
intermediate node → destination– Route minimal XYZ in both phases
• ROMM– Same as VAL, but intermediate node
restricted to minimal direction
• Orthogonal 1-TURN (O1TURN)– Route along one of 6 minimal
orthogonal paths (XYZ, XZY, YXZ, YZX, ZXY, ZYX) with equal probability
9
Worst-Case Throughput
• Best theoretical normalized worst-case throughput known to be 50% (well-known result).
• Worst-case throughput analysis can be reduced to a maximal weighted matching problem [Towles’02].
• VAL achieves this optimal throughput, but has poor latency.
• As shown next, DOR, ROMM, and O1TURN are all far from optimal in 3D.
11
How do 2D mesh algorithms fare in 3D?
• Worst case throughput of DOR, ROMM, O1TURN far from optimal• Average hop count of VAL far from minimal• Need a routing algorithm that can trade latency for worst-case
throughput
Hop Count(normalized tominimal)
NormalizedWorst-CaseThroughputNormalizedAverage-CaseThroughput
8 x 8 x 8 NetworkVAL DOR ROMM O1TURN
0.5 0.316 0.454 0.513
VAL DOR ROMM O1TURN
2 1 1 1
0.5 0.063 0.132 0.15
12
Why O1TURN performs poorly in 3D?
• O1TURN – Worst-Case throughput optimal for 2D but more than 3 times worse than optimal for 3D
• The difference – 2D traffic matrix is “admissible” for 2D mesh– In 3D, projected traffic on each 2D plane is no longer admissible !!
• Can we transform the 3D routing problem to routing admissible traffic on each 2D plane ?
13
Outline
• Motivation for our work
• Existing 2D algorithms don’t extend well into 3D
RPM routing algorithm
• Simulation results
• Extensions and future work
14
Randomized Partially-Minimal Routing (RPM)
Source
DestinationRandom
intermediate layer
XY or YX routing on the intermediate layer
X
YZ
Phase-1 Z Source to
intermediate layer
Phase-2 Z Intermediate
layer to destination
15
Main Idea
• Load-balance uniformly across the vertical layers
• Min XY/YX used on each layer
• Main Result: RPM has near-optimal worst-case throughput
– Achieves optimal worst-case throughput when network radix k is even
– Within a factor of 1/k2 optimal when k is odd.
17
Average-Case Throughput
• RPM outperforms VAL, DOR, ROMM and O1TURN in average-throughput on randomly generated traffic.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
4x4x4 8x8x8 16x16x4
VAL
DOR
ROMM
O1TURN
RPM
18
Average Hop Count
• Normalized hop count of RPM – Symmetric Meshes - 1.33 times minimal compared to 2x
for VAL– Asymmetric 16x16x4 Mesh – 1.1 times minimal
19
Outline
• Motivation for our work
• Existing 2D routing algorithms don’t extend well into 3D
• RPM routing algorithm
Simulation results
• Extensions and future work
20
Flit-Level Simulation
• Ideal throughput evaluation assumes– Ideal single-cycle router– Infinite buffers– No contention in switches, no flow control
• Flit-level simulation– PopNet network simulator– 4 stage router pipeline – Route computation, VC allocation, Switch
arbitration, Link traversal– Credit-based flow control– 8 virtual channels, each 5 flits deep– Multi-flit packets injected into the network (5 flits/packet)
21
Flit-Level Simulation (cont’d)• Network configurations simulated
– 4 x 4 x 4 Mesh– 8 x 8 x 8 Mesh– 16 x 16 x 4 Mesh
• Routing algorithms compared: DOR, VAL, ROMM, O1TURN, DUATO, RPM– DUATO is a minimal adaptive routing algorithm implemented for
comparison
• Four different traffic traces used– Transpose traffic – (x,y,z) → (y,z,x)– Complement traffic – (x,y,z) → (k-x-1, k-y-1, k-z-1)– Uniform traffic – Worst Case traffic pattern for DOR (DOR-WC) – (x,y,z) → (k-z-1, k-y-1, k-x-1)
26
To sum it up …
• 3D IC technology is emerging.
• Stacking cores in 3 dimensions offers several advantages over 2D placement of cores.
• 2D minimal Mesh routing algorithms have poor worst-case throughput in 3D, VAL has high latency penalty.
• RPM trades off latency (partially-minimal) for better worst case performance (near-optimal).