Design of Scalable Network Considering Diameter and Cable Delay Kentaro Sano Tohoku University, JAPAN Tohoku
Design of Scalable Network
Considering Diameter and
Cable Delay
Kentaro Sano
Tohoku University, JAPAN
Tohoku
Design of Scalable Network, K Sano
Agenda
Introduction
Assumption
Preliminary evaluation & candidate networks
Cable length and delay
Simulator & emulator
Summary
2
Design of Scalable Network, K Sano
Introduction
Feasibility study : 2012-2014
3 teams working for
next-gen supercomputers
Tohoku-NEC-JAMSTEC team
Working group for
interconnection
network
subsystem
3
Osaka
University
Tohoku
University
W.G. for interconnection
subsystem
Design of Scalable Network, K Sano
Background and Objective
More nodes with higher performance
Requiring high performance and scalable network
Application demands
Global/collective communication
Local communication (ex: p2p w/ 3D decomposition)
Usability, performance robustness
Scalability
4
Goal: find NW for next-gen supercomputers
Exploring design space with
application demands and technology constraints
Small-diameter NW using high-radix SWs,
which is also good at local p2p communication
Performance, cost, power, usability, reliability
Design of Scalable Network, K Sano
Assumption for Design Space Exploration
System scale
~65536 SMP nodes
Technology
~64x64 full crossbar switch
10~ GB/s per link
5
Full
cross
bar
input q 1 out b
input q 2 out b
input q 64 out b
64 x 64 switch
(virtual cut-through with virtual chs)
Fat tree又は
Hybrid NW
Fat tree又は
Hybrid NW
Fat tree又は
Hybrid NW
Network
n planes (for SMP)
node
1
node
65536
System overview
IB technology roadmap
Design of Scalable Network, K Sano
Preliminary Evaluation
Typical topologies
Full fat tree
3D / 5D torus
Dragonfly
6
N N N N
N N N N
N N N N
N N N N
N N N N
N N N N
N N N N
N N N N
N N N N
N N N N
N N N N
N N N N
SW
N N
SW
N N
SW
N N
SW SW SW
SW
N N
SW
N N
SW
N N
SW SW SW
SW
N N
SW
N N
SW
N N
SW SW SW
SW SW SW SW SW SW SW SW SW
Full fat tree
n-D torus Dragonfly
Design of Scalable Network, K Sano
Comparison of Topologies
Too large diameter for low-D torus
Too many links for high-D torus / dragonfly
Fat tree looks good, but long cables?
7
Topology Full fat-tree 3D Torus 5D Torus Dragonfly
Nodes 65,536 65,536 65,536 65,536
Organization3 stages
64 x 32 x 3264x32x32 16x8x8x8x8
all-to-all(1D 16, 2D 16x16)
Node injection BW [GB/s]
Bisection BW [TB/s] 320 20 80 160
min to Max hops 2 ~ 6 1~63 1~23 2 ~ 5
min to Max delay [ns] 100 ~ 500 100 ~ 6300 100 ~ 2300 100 ~ 400
Links 196,608 196,608 1,310,720 468,736
Switches 5120 within nodes within nodes 4096
10
no cable delay
considered
Design of Scalable Network, K Sano
Full Fat Tree
Small diameter, but big latency via spine SWs
Max # of hops is limited especially with high-radix SWs.
Cable length grows with # of nodes.
8
32 nodes
SW
N N
SW
N N
SW
N N
SW SW SW
32 SWs
32 SWs
SW
N N
SW
N N
SW
N N
SW SW SW
SW
N N
SW
N N
SW
N N
SW SW SW
1024 nodes / islands
65536 nodes / 64 islands
SW SW SW
64 links, 10GB/s/link
SW SW SW SW SW SW
Max 6 hops
Spine SW
Design of Scalable Network, K Sano
Another Candidate: FTT Hybrid
Hierarchical network
Local fat tree (group)
256 nodes
2-stage fat tree
Only short cables
in a small fat tree
Global 2D torus
16x16 of 256-node groups
Short cables to connect
adjacent groups
512 links between groups
Expected advantages
Shorter cables
Expandable & flexible
9
Global NW: 2D Torus of 16x16 groups
G
x 16
x 16
128128
128256
Nodes
(FTT : Fat Tree & Torus)
Local fat tree
Global 2D torus
Design of Scalable Network, K Sano
Comparison Summary
Detailed & quantitative evaluation
Full fat tree and FTT hybrid
Consider more details about implementation & apps
10
Features Diameter # of Links Note
Fat tree
General-puropse,
High usability◎ ○ High cable delay?
Low-D torus × ○ -
High-D torus ○ × -
Dragonfly
Pseudo
high-radix NW◎ × -
FTT-hybrid
Combination of
Fat tree and torus○ ○ Low cable delay?
Good cost
performance,
Extendability
Design of Scalable Network, K Sano
Cable Length and Delay
Preliminary estimation
based on expected implementation
Boards (node, switch)
Cabinets (node, switch)
Floor layout
Cabling
11
C0 C1
C2 C3
FTT-hybrid layout example
cabinet
Design of Scalable Network, K Sano
Preliminary Result
12
node A node B node D node E
0.05 % 1.5 % 98.4 %
SW SW SW
SW SW
SW
2 m 10 ns
2 m 10 ns
2 m 10 ns
2 m 10 ns
20 m 100 ns
20 m 100 ns
20 m 100 ns
80 m, 400 ns
Stage 1
Stage 2
Stage 3 spine switch
Fat tree (Max 6 hops)
node A node B node D node E
0.05 % 0.33 % 99.6 %
SW SW SW
SW SW
2 m 10 ns
2 m 10 ns
2 m 10 ns
2 m 10 ns
10 m 50 ns
10 m 50 ns
10 m 50 ns
15 m 75 ns
1~16 hops in 2D torus
FTT-hybrid (Max 20 hops)
No big difference in Max cable delay
Fat tree = 1020ns + (5 SW-delay)
Hybrid = 1395ns + (19 SW-delay)
Hybrid can have shorter delay for local p2p communication.
80 m, 400 ns
Design of Scalable Network, K Sano
Example of 3D Mesh Communication
3D decomposition and
adjacent communication
13
Data exchange among 3D subgrids
x
y
z
Global NW: 2D Torus of 16x16 groups
G
x 16
x 16
128128
128256
Nodes
z (x & y can be assigned)
x
y
Latency (4 hops)
= 195ns + (4 SW-delay) : x, y
= 120ns + (3 SW-delay) : z
Much shorter than Fat tree
= 1020ns + (5 SW-delay) : x, y, z
Design of Scalable Network, K Sano
Quantitative Evaluation (On-going)
Software simulator (OPNET-based)
Purpose
Get rough results quickly
Validate collective comm.
Rough SW model
Simple arbitration
No back pressure
Limited NW size
~8129 nodes
Hardware emulator
FPGA-based emulator
Obtain detailed results
Cycle accurate model
Real arbitration, flit-level transmission, back pressure
Large NW : ~65536 nodes
14
routing
switching
Tx & Rx delay
Rx delay given by send SW
switch delay
routing delay
switching delay
transferring delay
buffering delay
inp
ut
po
rt 0
inp
ut
po
rt 1
inp
ut
po
rt 2
inp
ut
po
rt 6
3
ou
tpu
t p
ort
0
ou
tpu
t p
ort
1
ou
tpu
t p
ort
2
ou
tpu
t p
ort
6
3
Switch structure and delay model
Design of Scalable Network, K Sano
Hardware Emulator Overview
FPGA cluster
4 x host PCs
4 x FPGAs / PC
4 x 10G SFP+ ports / FPGA
Implementation
SW for nodes (on Linux)
HW for switches (on FPGA)
15
Node of
FPGA cluster
FPGA board (Stratix V)
x 4
QDR II+
SRAM A
QDR II+
SRAM B
QDR II+
SRAM C
QDR II+
SRAM D
DD
R3 D
RA
M A
PC
3-1
28
00
(D
DR
3-1
60
0)
DD
R3 D
RA
M B
PC
3-1
28
00
(D
DR
3-1
60
0)
10G SFP+ A(Tx, Rx)
10G SFP+ B(Tx, Rx)
10G SFP+ C(Tx, Rx)
10G SFP+ D(Tx, Rx)
ALTERA
Stratix V FPGA
5SGXEA7
N2F45C2
12.8GB/s
12.8GB/s
x18@500MHz
1GB/s forread/write
10Gbps+ each (Tx, Rx)
18 Mbits each(20-bit addressing for 18-bit data)
2GB as default(up to 8GB)
x64@800MHz(DDR)up to
1066MHz
PCIe 3.0 x 8 : 8GB/s (Tx, Rx)
DE5-NET
PCI-Express
DDR3
memory
QDRII SRAM
SFP+
10G Ether
FPGA
Other nodes
not installed yet
Design of Scalable Network, K Sano
Hardware Emulator Overview
16
64 port
10GbE switch
4 x FPGA boards
SFP+
10GbE ports
Node of
FPGA cluster
Other nodes
not installed yet
Design of Scalable Network, K Sano
Summary
Design space exploration for
small diameter NWs with high-radix switches
Technology constraint
Application demands
global and local-p2p communication
Two candidates after topology comparison
Full fat tree & FTT-hybrid
Preliminary evaluation for cable length & delay
Future (on-going) work
Quantitative evaluation with simulation & emulation
Application performance estimation
17