1 To Infiniband and Beyond: High Speed Interconnects in Commodity HPC Clusters Teresa Kaltz, PhD Research Computing December 3, 2009
May 06, 2015
1
To Infiniband and Beyond: High Speed Interconnects in Commodity
HPC Clusters
Teresa Kaltz, PhD Research Computing
December 3, 2009
Interconnect Types on Top 500
2
On the latest TOP500 list, there is exactly one 10 GigE deployment, compared to 181 InfiniBand-connected systems.
Michael Feldman, HPCwire Editor
Top 500 Interconnects 2002-2009
3
0
50
100
150
200
250
300
350
400
450
500
2002 2003 2004 2005 2006 2007 2008 2009
Other
Infiniband
Ethernet
What is Infiniband Anyway?
• Open, standard interconnect architecture
– http://www.infinibandta.org/index.php – Complete specification available for download
• Complete "ecosystem" – Both hardware and software
• High bandwidth, low latency, switch-based • Allows remote direct memory access (RDMA)
4
Why Remote DMA?
• TCP offload engines reduce overhead via offloading protocol processing like checksum
• 2 copies on receive: NIC kernel user • Solution is Remote DMA (RDMA)
5
Per Byte Percent Overhead User-system copy 16.5 % TCP Checksum 15.2 % Network-memory copy 31.8 % Per Packet Driver 8.2 % TCP+IP+ARP protocols 8.2 % OS overhead 19.8 %
What is RDMA?
6
Infiniband Signalling Rate
• Each link is a point to point serial connection • Usually aggregated into groups of four • Unidirectional effective bandwidth
– SDR 4X: 1 GB/s – DDR 4X: 2 GB/s – QDR 4X: 4 GB/s
• Bidirectional bandwidth twice unidirectional • Many factors impact measured performance!
7
Infiniband Roadmap from IBTA
8
DDR 4X Unidirectional Bandwidth
9
• Achieved bandwidth limited by PCIe 8x Gen 1
• Current platforms mostly ship with PCIe Gen 2
QDR 4X Unidirectional Bandwidth
10 http://mvapich.cse.ohio-state.edu/performance/interNode.shtml
• Still seem to have bottleneck at host if using QDR
Latency Measurements: IB vs GbE
11
Infiniband Latency Measurements
12
Infiniband Silicon Vendors
• Both switch and HCA parts – Mellanox: Infiniscale, Infinihost – Qlogic: Truescale, Infinipath
• Many OEM's use their silicon • Large switches
– Parts arranged in fat tree topology
13
Infiniband Switch Hardware
14
144 Ports
288 Ports
96 Ports
48 Ports 24 Ports
24 port silicon product line at right
Scales to thousands of ports
Host-based and hardware- based subnet management
Current generation (QDR) based on 36 port silicon
Up to 864 ports in single switch!!
Infiniband Topology
• Infiniband uses credit-based flow control – Need to avoid loops in topology that may produce
deadlock
• Common topology for small and medium size networks is tree (CLOS)
• Mesh/torus more cost effective for large clusters (>2500 hosts)
15
Infiniband Routing
• Infiniband is statically routed • Subnet management software discovers fabric
and generates set of routing tables – Most subnet managers support multiple routing
algorithms • Tables updated with changes in topology only • Often cannot achieve theoretical bisection
bandwidth with static routing • QDR silicon introduces adaptive routing
16
HPCC Random Ring Benchmark
17
0
200
400
600
800
1000
1200
1400
1600
Avg
Ban
dwid
th (M
B/s
)
Number of Enclosures
"Routing 1"
"Routing 2"
"Routing 3"
"Routing 4"
Infiniband Specification for Software
• IB specification does not define API • Actions are known as "verbs"
– Services provided to upper layer protocols – Send verb, receive verb, etc
• Community has standardized around open source distribution called OFED to provide verbs
• Some Infiniband software is also available from vendors – Subnet management
18
Application Support of Infiniband
• All MPI implementations support native IB – OpenMPI, MVAPICH, Intel MPI
• Existing socket applications – IP over IB – Sockets direct protocol (SDP)
• Does NOT require re-link of application
• Oracle uses RDS (reliable datagram sockets) – First available in Oracle 10g R2
• Developer can program to "verbs" layer
19
Infiniband Software Layers
20
OFED Software
• Openfabrics Enterprise Distribution software from Openfabrics Alliance – http://www.openfabrics.org/
• Contains everything needed to run Infiniband – HCA drivers – verbs implementation – subnet management – diagnostic tools
• Versions qualified together
21
Openfabrics Software Components
22
"High Performance" Ethernet
• 1 GbE cheap and ubiquitous – hardware acceleration – multiple multiport NIC's – supported in kernel
• 10 GbE still used primarily as uplinks from edge switches and as backbone
• Some vendors providing 10 GbE to server – low cost NIC on motherboard – HCA's with performance proportional to cost
23
RDMA over Ethernet
• NIC capable of RDMA is called RNIC • RDMA is primary method of reducing latency on
host side • Multiple vendors have RNIC's
– Mainstream: Broadcom, Intel, etc. – Boutique: Chelsio, Mellanox, etc.
• New Ethernet standards – "Data Center Bridging"; "Converged Enhanced
Ethernet"; "Data Center Ethernet"; etc
24
What is iWarp?
• RDMA consortium (RDMAC) standardized some protocols with are now part of the IETF Remote Data Direct Placement (RDDP) working group
• http://www.rdmaconsortium.org/home • Also defined SRP, iSER in addition to verbs • iWARP supported in OFED • Most specification work complete in ~2003
25
RDMA over Ethernet?
26
The name ‘RoCEE’ (RDMA over Converged Enhanced Ethernet), is a working name.
You might hear me say RoXE, RoE, RDMAoE, IBXoE, IBXE or any other of a host of equally obscure names.
Tom Talpey, Microsoft Corporation Paul Grun, System Fabric Works August 2009
The Future: InfiniFibreNet
• Vendors moving towards "converged fabrics" • Using same "fabric" for both networking and
storage • Storage protocols and IB over Ethernet • Storage protocols over Infiniband
– NFS over RDMA, lustre
• Gateway switches and converged adapters – Various combinations of Ethernet, IB and FC
27
Any Questions?
28
THANK YOU!
(And no mention of The Cloud)