Top Banner
Design of Scalable Network Considering Diameter and Cable Delay Kentaro Sano Tohoku University, JAPAN Tohoku
18

Design of Scalable Network Considering Diameter and … · Simulator & emulator ... Requiring high performance and scalable network ... Small diameter, but big latency via spine SWs

Apr 27, 2018

Download

Documents

nguyenkhuong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Design of Scalable Network Considering Diameter and … · Simulator & emulator ... Requiring high performance and scalable network ... Small diameter, but big latency via spine SWs

Design of Scalable Network

Considering Diameter and

Cable Delay

Kentaro Sano

Tohoku University, JAPAN

Tohoku

Page 2: Design of Scalable Network Considering Diameter and … · Simulator & emulator ... Requiring high performance and scalable network ... Small diameter, but big latency via spine SWs

Design of Scalable Network, K Sano

Agenda

Introduction

Assumption

Preliminary evaluation & candidate networks

Cable length and delay

Simulator & emulator

Summary

2

Page 3: Design of Scalable Network Considering Diameter and … · Simulator & emulator ... Requiring high performance and scalable network ... Small diameter, but big latency via spine SWs

Design of Scalable Network, K Sano

Introduction

Feasibility study : 2012-2014

3 teams working for

next-gen supercomputers

Tohoku-NEC-JAMSTEC team

Working group for

interconnection

network

subsystem

3

Osaka

University

Tohoku

University

W.G. for interconnection

subsystem

Page 4: Design of Scalable Network Considering Diameter and … · Simulator & emulator ... Requiring high performance and scalable network ... Small diameter, but big latency via spine SWs

Design of Scalable Network, K Sano

Background and Objective

More nodes with higher performance

Requiring high performance and scalable network

Application demands

Global/collective communication

Local communication (ex: p2p w/ 3D decomposition)

Usability, performance robustness

Scalability

4

Goal: find NW for next-gen supercomputers

Exploring design space with

application demands and technology constraints

Small-diameter NW using high-radix SWs,

which is also good at local p2p communication

Performance, cost, power, usability, reliability

Page 5: Design of Scalable Network Considering Diameter and … · Simulator & emulator ... Requiring high performance and scalable network ... Small diameter, but big latency via spine SWs

Design of Scalable Network, K Sano

Assumption for Design Space Exploration

System scale

~65536 SMP nodes

Technology

~64x64 full crossbar switch

10~ GB/s per link

5

Full

cross

bar

input q 1 out b

input q 2 out b

input q 64 out b

64 x 64 switch

(virtual cut-through with virtual chs)

Fat tree又は

Hybrid NW

Fat tree又は

Hybrid NW

Fat tree又は

Hybrid NW

Network

n planes (for SMP)

node

1

node

65536

System overview

IB technology roadmap

Page 6: Design of Scalable Network Considering Diameter and … · Simulator & emulator ... Requiring high performance and scalable network ... Small diameter, but big latency via spine SWs

Design of Scalable Network, K Sano

Preliminary Evaluation

Typical topologies

Full fat tree

3D / 5D torus

Dragonfly

6

N N N N

N N N N

N N N N

N N N N

N N N N

N N N N

N N N N

N N N N

N N N N

N N N N

N N N N

N N N N

SW

N N

SW

N N

SW

N N

SW SW SW

SW

N N

SW

N N

SW

N N

SW SW SW

SW

N N

SW

N N

SW

N N

SW SW SW

SW SW SW SW SW SW SW SW SW

Full fat tree

n-D torus Dragonfly

Page 7: Design of Scalable Network Considering Diameter and … · Simulator & emulator ... Requiring high performance and scalable network ... Small diameter, but big latency via spine SWs

Design of Scalable Network, K Sano

Comparison of Topologies

Too large diameter for low-D torus

Too many links for high-D torus / dragonfly

Fat tree looks good, but long cables?

7

Topology Full fat-tree 3D Torus 5D Torus Dragonfly

Nodes 65,536 65,536 65,536 65,536

Organization3 stages

64 x 32 x 3264x32x32 16x8x8x8x8

all-to-all(1D 16, 2D 16x16)

Node injection BW [GB/s]

Bisection BW [TB/s] 320 20 80 160

min to Max hops 2 ~ 6 1~63 1~23 2 ~ 5

min to Max delay [ns] 100 ~ 500 100 ~ 6300 100 ~ 2300 100 ~ 400

Links 196,608 196,608 1,310,720 468,736

Switches 5120 within nodes within nodes 4096

10

no cable delay

considered

Page 8: Design of Scalable Network Considering Diameter and … · Simulator & emulator ... Requiring high performance and scalable network ... Small diameter, but big latency via spine SWs

Design of Scalable Network, K Sano

Full Fat Tree

Small diameter, but big latency via spine SWs

Max # of hops is limited especially with high-radix SWs.

Cable length grows with # of nodes.

8

32 nodes

SW

N N

SW

N N

SW

N N

SW SW SW

32 SWs

32 SWs

SW

N N

SW

N N

SW

N N

SW SW SW

SW

N N

SW

N N

SW

N N

SW SW SW

1024 nodes / islands

65536 nodes / 64 islands

SW SW SW

64 links, 10GB/s/link

SW SW SW SW SW SW

Max 6 hops

Spine SW

Page 9: Design of Scalable Network Considering Diameter and … · Simulator & emulator ... Requiring high performance and scalable network ... Small diameter, but big latency via spine SWs

Design of Scalable Network, K Sano

Another Candidate: FTT Hybrid

Hierarchical network

Local fat tree (group)

256 nodes

2-stage fat tree

Only short cables

in a small fat tree

Global 2D torus

16x16 of 256-node groups

Short cables to connect

adjacent groups

512 links between groups

Expected advantages

Shorter cables

Expandable & flexible

9

Global NW: 2D Torus of 16x16 groups

G

x 16

x 16

128128

128256

Nodes

(FTT : Fat Tree & Torus)

Local fat tree

Global 2D torus

Page 10: Design of Scalable Network Considering Diameter and … · Simulator & emulator ... Requiring high performance and scalable network ... Small diameter, but big latency via spine SWs

Design of Scalable Network, K Sano

Comparison Summary

Detailed & quantitative evaluation

Full fat tree and FTT hybrid

Consider more details about implementation & apps

10

Features Diameter # of Links Note

Fat tree

General-puropse,

High usability◎ ○ High cable delay?

Low-D torus × ○ -

High-D torus ○ × -

Dragonfly

Pseudo

high-radix NW◎ × -

FTT-hybrid

Combination of

Fat tree and torus○ ○ Low cable delay?

Good cost

performance,

Extendability

Page 11: Design of Scalable Network Considering Diameter and … · Simulator & emulator ... Requiring high performance and scalable network ... Small diameter, but big latency via spine SWs

Design of Scalable Network, K Sano

Cable Length and Delay

Preliminary estimation

based on expected implementation

Boards (node, switch)

Cabinets (node, switch)

Floor layout

Cabling

11

C0 C1

C2 C3

FTT-hybrid layout example

cabinet

Page 12: Design of Scalable Network Considering Diameter and … · Simulator & emulator ... Requiring high performance and scalable network ... Small diameter, but big latency via spine SWs

Design of Scalable Network, K Sano

Preliminary Result

12

node A node B node D node E

0.05 % 1.5 % 98.4 %

SW SW SW

SW SW

SW

2 m 10 ns

2 m 10 ns

2 m 10 ns

2 m 10 ns

20 m 100 ns

20 m 100 ns

20 m 100 ns

80 m, 400 ns

Stage 1

Stage 2

Stage 3 spine switch

Fat tree (Max 6 hops)

node A node B node D node E

0.05 % 0.33 % 99.6 %

SW SW SW

SW SW

2 m 10 ns

2 m 10 ns

2 m 10 ns

2 m 10 ns

10 m 50 ns

10 m 50 ns

10 m 50 ns

15 m 75 ns

1~16 hops in 2D torus

FTT-hybrid (Max 20 hops)

No big difference in Max cable delay

Fat tree = 1020ns + (5 SW-delay)

Hybrid = 1395ns + (19 SW-delay)

Hybrid can have shorter delay for local p2p communication.

80 m, 400 ns

Page 13: Design of Scalable Network Considering Diameter and … · Simulator & emulator ... Requiring high performance and scalable network ... Small diameter, but big latency via spine SWs

Design of Scalable Network, K Sano

Example of 3D Mesh Communication

3D decomposition and

adjacent communication

13

Data exchange among 3D subgrids

x

y

z

Global NW: 2D Torus of 16x16 groups

G

x 16

x 16

128128

128256

Nodes

z (x & y can be assigned)

x

y

Latency (4 hops)

= 195ns + (4 SW-delay) : x, y

= 120ns + (3 SW-delay) : z

Much shorter than Fat tree

= 1020ns + (5 SW-delay) : x, y, z

Page 14: Design of Scalable Network Considering Diameter and … · Simulator & emulator ... Requiring high performance and scalable network ... Small diameter, but big latency via spine SWs

Design of Scalable Network, K Sano

Quantitative Evaluation (On-going)

Software simulator (OPNET-based)

Purpose

Get rough results quickly

Validate collective comm.

Rough SW model

Simple arbitration

No back pressure

Limited NW size

~8129 nodes

Hardware emulator

FPGA-based emulator

Obtain detailed results

Cycle accurate model

Real arbitration, flit-level transmission, back pressure

Large NW : ~65536 nodes

14

routing

switching

Tx & Rx delay

Rx delay given by send SW

switch delay

routing delay

switching delay

transferring delay

buffering delay

inp

ut

po

rt 0

inp

ut

po

rt 1

inp

ut

po

rt 2

inp

ut

po

rt 6

3

ou

tpu

t p

ort

0

ou

tpu

t p

ort

1

ou

tpu

t p

ort

2

ou

tpu

t p

ort

6

3

Switch structure and delay model

Page 15: Design of Scalable Network Considering Diameter and … · Simulator & emulator ... Requiring high performance and scalable network ... Small diameter, but big latency via spine SWs

Design of Scalable Network, K Sano

Hardware Emulator Overview

FPGA cluster

4 x host PCs

4 x FPGAs / PC

4 x 10G SFP+ ports / FPGA

Implementation

SW for nodes (on Linux)

HW for switches (on FPGA)

15

Node of

FPGA cluster

FPGA board (Stratix V)

x 4

QDR II+

SRAM A

QDR II+

SRAM B

QDR II+

SRAM C

QDR II+

SRAM D

DD

R3 D

RA

M A

PC

3-1

28

00

(D

DR

3-1

60

0)

DD

R3 D

RA

M B

PC

3-1

28

00

(D

DR

3-1

60

0)

10G SFP+ A(Tx, Rx)

10G SFP+ B(Tx, Rx)

10G SFP+ C(Tx, Rx)

10G SFP+ D(Tx, Rx)

ALTERA

Stratix V FPGA

5SGXEA7

N2F45C2

12.8GB/s

12.8GB/s

x18@500MHz

1GB/s forread/write

10Gbps+ each (Tx, Rx)

18 Mbits each(20-bit addressing for 18-bit data)

2GB as default(up to 8GB)

x64@800MHz(DDR)up to

1066MHz

PCIe 3.0 x 8 : 8GB/s (Tx, Rx)

DE5-NET

PCI-Express

DDR3

memory

QDRII SRAM

SFP+

10G Ether

FPGA

Other nodes

not installed yet

Page 16: Design of Scalable Network Considering Diameter and … · Simulator & emulator ... Requiring high performance and scalable network ... Small diameter, but big latency via spine SWs

Design of Scalable Network, K Sano

Hardware Emulator Overview

16

64 port

10GbE switch

4 x FPGA boards

SFP+

10GbE ports

Node of

FPGA cluster

Other nodes

not installed yet

Page 17: Design of Scalable Network Considering Diameter and … · Simulator & emulator ... Requiring high performance and scalable network ... Small diameter, but big latency via spine SWs

Design of Scalable Network, K Sano

Summary

Design space exploration for

small diameter NWs with high-radix switches

Technology constraint

Application demands

global and local-p2p communication

Two candidates after topology comparison

Full fat tree & FTT-hybrid

Preliminary evaluation for cable length & delay

Future (on-going) work

Quantitative evaluation with simulation & emulation

Application performance estimation

17

Page 18: Design of Scalable Network Considering Diameter and … · Simulator & emulator ... Requiring high performance and scalable network ... Small diameter, but big latency via spine SWs

Design of Scalable Network, K Sano 18

Thank you!