Top Banner
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory
18

Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

Dec 29, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

Towards a Common Communication Infrastructure for Clusters and Grids

Darius Buntinas

Argonne National Laboratory

Page 2: Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

2

Overview

Cluster Computing vs. Distributed Grids InfiniBand

– IB for WAN IP and Ethernet

– Improving performance Other LAN/WAN Options Summary

Page 3: Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

3

Cluster Computing vs. Distributed Grids

Typical clusters

– Homogenous architecture

– Dedicated environments Compatibility is not a concern

– Clusters can use high-speed LAN networks• E.g., VIA, Quadrics, Myrinet, InfiniBand

– And specific hardware accelerators• E.g., Protocol offload, RDMA

Page 4: Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

4

Cluster Computing vs. Distributed Grids cont’ed

Distributed environments

– Heterogeneous architecture

– Communication over WAN

– Multiple administrative domains Compatibility is critical

– Most WAN stacks are IP/Ethernet

– Popular grid communication protocols• TCP/IP/Ethernet• UDP/IP/Ethernet

But what about performance?

– TCP/IP/Ethernet latency: 10s of µs

– InfiniBand latency: 1s of µs How do you maintain high intra-cluster performance while enabling inter-

cluster communication?

Page 5: Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

5

Solutions

Use one network for LAN and another for WAN

– You need to manage two networks

– Your communication library needs to be multi-network capable• May have impact on performance or resource utilization

Maybe a better solution: A common network subsystem

– One network for both LAN and WAN

– Two popular network families• InfiniBand• Ethernet

Page 6: Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

6

InfiniBand

Initially introduced as a LAN

– Now expanding onto WAN

Issues with using IB on the WAN

– IB copper cables have limited lengths

– IB uses end-to-end credit-based flow control

Page 7: Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

7

Cable Lengths

IB copper cabling

– Signal integrity decreases with length and data rate

– IB 4x-QDR (32Gbps) max cable length is < 1m Solution: optical cabling for IB E.g., Intel Connects Cables

– Optical cables

– Electrical-to-optical converters at ends• ~50 ps conversion delay

– Plug into existing copper-based adapters

Page 8: Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

8

End-to-End Flow Control

IB uses end-to-end credit-based flow control

– One credit corresponds to one buffer unit at receiver

– Sender can send one unit of data per credit

– Long one-way latencies impact achievable throughput• WAN latencies are on the order of ms

Solution: Hop-by-hop flow control

– E.g., Obsidian Networks Longbow switches

– Switches have internal buffering

– Link-level flow control is performed between node and switch

Page 9: Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

9

Effect of Delay on Bandwidth

Distance (km)

Delay (µs)

1 5

2 10

20 100

200 1000

2000 10000

Source: S. Narravula, et. al., Performance of HPC Middleware over InfiniBand WAN , Ohio State Technical Report, 2007. OSU-CISRC-12/07-TR77

Page 10: Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

10

IP and Ethernet

Traditionally

– IP/Ethernet is used for WAN

– and for low-cost alternative on LAN

– Software-based TCP/IP stack implementation• Software overhead limits performance

Performance limitations

– Small 1500-byte maximum transfer unit (MTU)

– TCP/IP software stack overhead

Page 11: Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

11

Increasing Maximum Transfer Unit

Ethernet standard specifies 1500-byte MTU

– Each packet requires hardware and software processing

– Is considerable at gigabit speeds

MTU can be increased

– 9K Jumbo frames

– Reduce per-byte processing overhead

Not compatible on WAN

Page 12: Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

12

Large Segment Offload Engine on NIC

a.k.a. Virtual MTU Introduced by Intel and Broadcom Allow TCP/IP software stack to use 9K or 16K MTUs

– Reducing software overhead Fragmentation performed by NIC Standard 1500-byte MTU on the wire

– Compatible with upstream switches and routers

Page 13: Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

13

Offload Protocol Processing to NIC

Handling packets at gigabit speeds requires considerable processing

– Even with large MTU

– Uses CPU time that would otherwise be used by application Protocol Offload Engines (POE)

– Perform communication processing on NIC

– Myrinet, Quadrics, IB TCP Offload Engines (TOE) is a specific kind of POE

– Chelsio, NetEffect

Page 14: Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

14

TOE vs Non-TOE: Latency

Source: P. Balaji, W. Feng and D. K. Panda, Bridging the Ethernet-Ethernot Performance Gap. IEEE Micro Journal Special Issue on High-Performance Interconnects, pp. 24-40, May/June Volume, Issue 3, 2006.

Page 15: Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

15

TOE vs Non-TOE: Bandwidth and CPU Utilization

Page 16: Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

16

TOE vs Non-TOE: Bandwidth and CPU Utilization (9K MTU)

Page 17: Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

17

Other LAN/WAN Options

iWARP protocol offload– Runs over IP– Has functionality similar to TCP– Adds RDMA

Myricom– Myri-10G adapter– Uses 10G Ethernet physical layer– POE– Can handle both TCP/IP and MX

Mellanox– ConnectX adapter– Has multiple ports that can be configured for IB or Ethernet– POE– Can handle both TCP/IP and IB

Convergence in software stack: OpenFabrics– Supports IB and Ethernet adapters– Provides a common API to upper layer

Page 18: Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

18

Summary

Clusters can take advantage of high-performance LAN NICs

– E.g., InfiniBand Grids need interoperability

– TCP/IP is ubiquitous Performance gap Bridging the gap

– IB over the WAN

– POE for Ethernet Alternatives

– iWarp, Myricom’s Myri-10G, Mellanox ConnectX