Vendor Tutorial InfiniBand-over-Distance Transport using Low-Latency WDM Transponders & IB Credit Buffering Christian Illmer ADVA Optical Networking
Dec 25, 2015
Vendor Tutorial
InfiniBand-over-Distance Transport using Low-Latency WDM
Transponders & IB Credit Buffering
Christian IllmerADVA Optical Networking
InfiniBand-over-distance transport using
low-latency WDM transponders and
IB credit buffering
October, 2008
© 2008 ADVA Optical Networking. All rights reserved.3
Connectivity performance
Ishi
da, O
., “T
owar
d Te
rabi
t LA
N/W
AN
”Pan
el, i
GR
ID20
05
Moore’s Law
Doubles every 18m
WDM
FC
Ethernet
InfiniBand
4x12x
4xDDR
12xQDR
Fibe
r Lin
k C
apac
ity [b
/s]
100M
1G
10G
100G
1T
10T
1980 1985 1990 1995 2000 2005 2010
Bandwidth requirements follow Moore’s Law (# transistors on a chip)
So far, InfiniBand outperforms Fibre Channel and also Ethernet regarding bandwidth, and can cope with Moore’s growth rate
© 2008 ADVA Optical Networking. All rights reserved.4
InfiniBand data rates
InfiniBand IBx1 IBx4 IBx12
Single Data Rate, SDR 2.5Gbit/s 10Gbit/s 30Gbit/s
Double Data Rate, DDR 5Gbit/s 20Gbit/s 60Gbit/s
Quad Data Rate, QDR 10Gbit/s 40Gbit/s 120Gbit/s
IB uses 8B/10B coding, e.g., IBx1 DDR has 4Gbit/s throughput
Copper
Defined for all data rates and multiplyers
Serial for SDR x1, DDR x1, QDR x1
Parallel copper cables (x4 or x12)
Fiber optic
Defined for all data rates, up to x4
Serial for SDR x1, DDR x1, QDR x1 and SDR x4 LX (serialized I/F)
Parallel for SDR x4 SX
© 2008 ADVA Optical Networking. All rights reserved.5
HDD
InfiniBand
Synchronous
Protocols and bit rates
10M 100M 1G 10G 100G
10bT
ETR/CLO
FE
ESCON
STM-1 STM-4
1G-FCFICON
ISC
GbE
Ultra160 SCSI
STM-16
Ultra320 SCSI
2G-FCFICON2ISC3 4G-FC
FICON4
STM-64
10GbE
10G-FC
8G-FC
OTU3
40GbE
IBx4QDR
IBx4SDR
IBx1SDR
100GbE
IBx12QDR
Fibre Channel etc.
EthernetIBx4DDR
© 2008 ADVA Optical Networking. All rights reserved.6
CPU connectivity-market
GbEInfiniBand
Myrinet
SP Switch
Proprietary
NUMAlink
Quadrics
Crossbar
Cray
Interconnect
Mixed
0%
10%
20%
30%
40%
50%2006 2007
Market penetration of different CPU interconnect technologies
InfiniBand clearly dominating new high-end DC implementations
TOP 100 Supercomputers 37% in ‘07 50% in ‘08
TOP 100 Supercomputers 37% in ‘0750% in ‘08
© 2008 ADVA Optical Networking. All rights reserved.7
HPC networks today
Servercluster
Typical HPC DC today
Dedicated networks / technologies for LAN, SAN, CPU (server) interconnect
Consolidation required
FC and GbE HBAs and IB HCAs
FC
FC
FCEthernet LANFC SAN
FC
IB
Eth FC
IB
Eth
FC
IB
Eth FC
IB
Eth
Relevant parameters
LAN HBA based on GbE/10GbE
SAN HBAs based on 4G/8G-FC
HCAs based on IB(x4) DDR/QDR
© 2008 ADVA Optical Networking. All rights reserved.8
Unified InfiniBand architecture
API: Application Programming I/F VAPI: Verbs APISDP: Sockets Direct Protocol TS API: Terminal Server APISRP: SCSI RDMA Protocol uDAPL: User-level Direct-Access Programming LibraryDAT: Direct Access Transport BSD Socket: Berkeley Socket API
VAPI
InfiniBand HCA
FCP
Ethernet
MPI
TS API
SDP TS
SDP TS
TS API SRP
DATSCSI
File System
uDAPL NFS-RDMA FS API
FC
Drivers
BSD Sockets
IPoIB
IP
TCP
BSD Sockets
Ethernet Switch
FC Switch
InfiniBand Switch
Ethernet GW FC GW
LAN/WAN Unified Fabric SAN
Network SANInfiniBand (HPC Messaging etc.)
© 2008 ADVA Optical Networking. All rights reserved.9
HPC networks tomorrow?
IPoIB
IB
Gateway
IB
IB IB
IB HCAs
IB SF
Consolidation step: Unified IB switch fabric
IB SF used for CPU cluster and LAN, storage (using IPoIB, SRP and gateways)
LAN now based on IPoIB and Ethernet gateway
Servercluster
IB IB
IB IB Unlikely to be deployed on
a broad scale
Gateway
FC
FC
FCSCSI RDMA protocol
© 2008 ADVA Optical Networking. All rights reserved.10
InfiniBand connections over distance
Why is it relevant?Data centers disperse geographically(GRID computing, virtualization, disaster recovery, …)
Native, low-latency IB-over-distance transport was still the missing part
Cluster connectivity via IB-over-WDMWAN protocol is IB, no conversion neededNo additional latencyFully transparent transport
IB switchfabric
Dark fiber
IB server cluster A
IB-over-DWDM IB switchfabric
IB server cluster B
+50km
© 2008 ADVA Optical Networking. All rights reserved.11
InfiniBand throughput over distance
What is the solution?IB range extender – credit buffering, low latency, and conversion to 10G optical WDM – lowest latency, transparency, capacity, reach, fiber relief
What are the commercial requirements?Solution must be based on commercial products Interworking capabilities must be demonstrated
Throughput drops significantly after several meters
Only buffer credits (B2B credits) ensure maximum InfiniBand performance over distance
Buffer credit size directly related to distance
Thro
ughput
Distance
w/o B2B credits
with B2B credits
© 2008 ADVA Optical Networking. All rights reserved.12
Demonstrator setup at HLRS
Cell cluster
DW
DM
0.4...100.4 kmG.652 SSMF
IBM cluster
Site: HLRS Nobelstrasse Site: HLRS Allmandring
DW
DM
Voltaire ISR2012 Grid Director288 x DDR IBx4 ports11.5 Tb/s backplane<450 ns latency
ADVA FSP 2000 DWDM4 x 10Gbit/s transponders<100 ns link latency
Obsidian Longbow Campus4 x SDR copper to 10G optical2-port switch architecture840 ns port-to-port latency10/40 km reach (Buffer Credits)
© 2008 ADVA Optical Networking. All rights reserved.13
Demonstrator results
SendRecV Throughput vs. Message Length
0
0.2
0.4
0.6
0.8
1
0 1000 2000 3000 4000
Message Length [kB]
Thro
ughp
ut [G
B/s
]
0.4 km
25.4 km
50.4 km
75.4 km
100.4 km
SendRecV Throughput vs. Distance
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100
Distance [km]
Thro
ughp
ut
[GB
/s]
32 kB
128 kB
512 kB
4096 kB
The Intel MPI benchmark SendRecV was used
Constant performance up to 50km
Decreasing performance after 50km
Full InfiniBand throughput over more than 50kmFull InfiniBand throughput over more than 50km
Thank you
Thank you
Danke