8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
1/37
Quadrics Ltd 128/8/2008
QsNetIII an Adaptively Routed Network for
High Performance Computing
Duncan Roweth, Quadrics Ltd
Hot Interconnects August 2008
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
2/37
Quadrics Ltd 228/8/2008
Quadrics Background
Develops interconnect products for the HPC market
HPC Linux systems
AlphaServer SC systems
Quadrics is owned by the Finmeccanica group
Quadrics was 12 years old in July
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
3/37
Quadrics Ltd 328/8/2008
QsNet Networks
Multi-stage switch network
Components
Adapter: Elan
Router: Elite
Switches, cables Firmware, drivers, libraries
Diagnostics, documentation
HPC specific features
Adaptive routing
Hardware barrier & broadcast
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
4/37
Quadrics Ltd 428/8/2008
Communication Model
Processs
VirtualAddress
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
5/37
Quadrics Ltd 528/8/2008
Quadrics Networks
Elan1 / Elite1, 1994, Meiko Computing Surface 2
Source chooses between pre-defined routes
Elan3 / Elite3, 2000, first Quadrics product, QsNet
First use of packet-by-packet adaptive routing
Crosspoint router, x8 Elan4 / Elite4, 2004, QsNetII
Reduced latency, increased bandwidth
Increased support for offloading collectives
Elan5 / Elite5, 2008, QsNetIII
General purpose crosspoint router, increased radix, x32
Highly programmable adapter
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
6/37
Quadrics Ltd 628/8/2008
What is Adaptive Routing ?
Switch networks typically provide manypaths between any two points
In an adaptively routed network
routers make packet by packet decisions
on the route to use based on Queue occupancy
Channel usage
Error rates and state
Class of traffic
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
7/37
Quadrics Ltd 728/8/2008
Why is Adaptive Routing Important ?
Most HPC networks are statically routed
They use pre-determined paths between nodes
Static routing can work well
If traffic pattern is known in advance
If traffic pattern is persistent If traffic pattern is uniform (i.e. application is load balanced)
If there are no errors
These conditions are not met by real codes on productionHPC systems {see LLNL and Sandia results}
Adaptive routing solves these problems Delivering significantly better aggregate bandwidths and worst
case latencies on real systems running real codes
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
8/37
Quadrics Ltd 828/8/2008
Benefits of Adaptive Routing
Bandwidth achievedwhen 1024 nodes allcommunicate at thesame time
Plots show thedistribution ofmeasured bandwidths
System Interconnect Min Max Average
Atlas Infiniband 95 762 263
Thunder QsNetII 248 403 369Data from Lawrence Livermore National Lab, published at the Sonoma OpenFabrics workshop April 2007
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
9/37
Quadrics Ltd 928/8/2008
Benefits of Adaptive Routing
Classic QsNetII all-to-all bandwidth scaling graph
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
10/37
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
11/37
Quadrics Ltd 1128/8/2008
Adaptive Routing in QsNetIII
More flexible than QsNetII
Operates over arbitrary sets of links
More opportunities to use the technique
Higher radix switches
Select a subset of lightly loaded output ports based on:
Destination
Link state, errors etc
Number of pending acks (programmable threshold)
Programmable algorithm for selecting from this subset: First free, next free, random
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
12/37
Quadrics Ltd 1228/8/2008
Adaptive Routing: standard case
All top switches are equivalent, select one
Adaptive routing selects a lightly loaded path
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
13/37
Quadrics Ltd 1328/8/2008
Implementation of Fat Tree Networks
Connect MN-way node switches by NM-way top switches
In this case M = 16, N = 4
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
14/37
Quadrics Ltd 1428/8/2008
Adaptive Routing in the Top Switch
If top switch radix router radix / 2
i.e. 16 for Elite5, 2048-way networks
Router provides multiple top switches
Select which to use based on load
Example: Traffic from A to B via routers 210 and
300 is blocked by traffic between 300and 200.
The router providing 300, 301, 302 and303 can select a different path
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
15/37
Quadrics Ltd 1528/8/2008
Adaptive Routing on the Final Hop
Multiple connections to a node
Switch can select a free path
Reduces end-point contention
Simple case is not optimal Spreading the connections
Improves fault tolerance
Reduces network contention
Routing decision is made higherin the network
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
16/37
Quadrics Ltd 1628/8/2008
Adaptive routing in the presence of errors
In a production system with 1000sof links it is not uncommon for asmall number to be broken untilthe next maintenance slot
Adaptive routing minimises the
impact Example:
Link between routers 10 and 20 isbroken
Router 10 dynamically selects paths
via 21,22,23 spreading the load.
Reverse case, avoid sending to 10via 20. Reset 20s links or update
switches 11,12,13.
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
17/37
Quadrics Ltd 1728/8/2008
Small Packet Support
Aim to get as close to line rate as possible with small packets
For example:
Small put
32 byte packet
Adapter has multiple packet engines
Adapters support up to 64 outstanding packets per link
Doubles if we use both links
Switches provide 32 virtual channels per output link
Prioritisation buffering on input to the router
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
18/37
Quadrics Ltd 1828/8/2008
Barrier & Broadcast Support
Switches broadcast overa range of output links
Combine Acks / Nacks
Contiguous in QsNetII Sparse in QsNetIII
Barrier implementation
Network conditional
Broadcast release
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
19/37
Quadrics Ltd 1928/8/2008
Fabric
Bridge
x8
PLL
EEPROM ClocksPCIe16 Lanes
Host I/F
TLB
Cmd Launch
PCIe
SERDES
Local Functions
Buffer Manager
Object Cache Tags
Free List
Local Memory
Ext i/fSDRAM i/f
External cache
ExternalDDRII
16K x 8 x 8 banks = 1MB ECC RAM
CX4/QSNet
III
Link
CX4/QSNet
III
Link
Packet Engine16K inst cache9K data buffers
Packet Engine16K inst cache9K data buffers
Packet Engine16K inst cache9K data buffers
Packet Engine16K inst cache9K data buffers
Packet Engine16K inst cache9K data buffers
Packet Engine16K inst cache9K data buffers
Packet Engine16K inst cache9K data buffers
Elan5 Adapter
Elan5 Device Overview
2 QsNetIII links 20Gbit/s/direction after protocol
PCIe, PCIe2 host interface
Multiple packet engines
512KB of high bandwidth onchip local memory
SDRAM interface to optionallocal memory
Buffer manager, object
cache
Details in ISC DresdenPaper
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
20/37
Quadrics Ltd 2028/8/2008
Elite5 Device Overview
64 32 crosspoint router
Direct & buffered input from each link
8K of input buffering per link
32 virtual channels per link
Physical layer DDR XAUI (6.25GHz) Adaptive routing
Hardware barrier and broadcast
Memory mapped stats & errorcounters accessed out-of-band
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
21/37
Quadrics Ltd 2128/8/2008
QsNetIII Device Overview
Elan Elite
Semi custom ASIC
Manufacturing partners LSI / TSMC G90 process500 MHz 312 MHz
High performance BGA package
672 pin 982 pin
< 17W < 18W
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
22/37
Quadrics Ltd 2228/8/2008
QsNetIII Implementation
Node switch chassis
128 links down to the nodes
128 links up to the top switches
Backplane connects 2 sets of cards
Top switches 256 links down to the node switches
Range of system sizes:
Ports Radix Per Chassis
512 4 64
1024 8 32
2048 16 16
4096 32 8
QsNetIII switchlogical design
QsNetIII
switchimplementation
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
23/37
Quadrics Ltd 2328/8/2008
QsNetIII Network 1024way
Fat tree, constructed from 8 128-way node switches connected by128 8-way top switches
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
24/37
Quadrics Ltd 2428/8/2008
QsNetIII Implementation Cables
QSFP connectors throughout
Copper cables (e.g. Gore) 1-10m
Active copper cables (e.g. Gore), 8-20m
Optical cables (e.g. Luxtera), 5-300m
PVDF Plenum rated LSZH available as an option
No longer Quadrics proprietary
Likely usage:
Short copper cables from nodes
Optical cables between switches
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
25/37
Quadrics Ltd 2528/8/2008
QsNetIII Fault Tolerance
All of the QsNetII Features
CRCs on every packet
Automatic retransmission
Redundant routes
Adaptive routing avoids failed links Redundant, hot plugable, PSUs and fans
+ Line rate testing of each link as it comes up
Switches generate CRPAT, CJPAT or PRBS packets
Links are only added to the route tables when they are (a) up, (b)connect to the right place, and (c) can transfer data at full line ratewithout error.
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
26/37
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
27/37
Quadrics Ltd 2728/8/2008
Elite5 silicon in Bristol
Elan5 at TSMC, first parts expected
in 3-4 weeks
Switch PCBs, chassis, backplane,controllers are working
First adapter PCBs are ready
PCI-Express x16, HP Blade,ExpressModule (Sun Blade)
We are porting the QsNetII software
Components at SC08 in Austin
First customer shipment in Q1 of 2009
Current Status
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
28/37
Quadrics Ltd 2828/8/2008
Future Work
QsNetIII hardware
Low cost 32-way switch
1024-way single chassis switch
QsNetIII
Software General framework for optimised collectives
Support for multiport networks - fat nodes have multipleconnections to the same rail
Ethernet firmware for the network adapter
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
29/37
Quadrics Ltd 2928/8/2008
Adaptive routing underwrites the scalability of HPC systemsdesigned to run a single large application
Adaptive routing has been a feature of QsNet systems since 2000
QsNetIII offers significant enhancements over both QsNetII and
competing products
Conclusions
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
30/37
Quadrics Ltd 3028/8/2008
Thank you for listening
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
31/37
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
32/37
Quadrics Ltd 3228/8/2008
Packet Format
Packet size of up to 4K made up of 256 byte packet segment andcontinuations, 8 byte ACK
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
33/37
Quadrics Ltd 3328/8/2008
Impact of static routing on latency
Data from Thunderbird cluster, Sandia National LabBig increases in worst case latency with number of nodes
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
34/37
Quadrics Ltd 3428/8/2008
Impact of static routing on latency
Data from Thunderbird cluster, Sandia National LabBig variation in worst case latency across a large job
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
35/37
Quadrics Ltd 3528/8/2008
Software Model Firmware & Drivers
Base firmware in the ROMs
Firmware modules loadable with the device driver
Elan, OpenFabrics, 10GE Ethernet,
Kernel modules
elan5, elan, rms Device dependent library (libelan5)
Device independent library (libelan)
User libraries
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
36/37
Quadrics Ltd 3628/8/2008
Software Model Elan Libraries
Point-to-point messagepassing
One-sided put/get
Transparent rail striping
Optimised collectives
Locks and atomics ops
Global memory allocation
8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation
37/37
Quadrics Ltd 3728/8/2008
QsNetIII Performance Summary
Similar latencies to QsNetII
The 1.3 to 2 microsecs of latency is mostly in the host PCI andmemory system
Higher issue rates
Improved link utilisation on small transfers
Higher bandwidths
1.5 to 2.25 GB/sec/link depending on host interface
Bi-directional host interface
2 x improvement over QsNetII
Broadcast and barrier in hardware Continued development of adaptive routing underwrites scaling
to high node counts