Top Banner
Quadrics Ltd 1 28/8/2008 QsNet III an Adaptively Routed Network for High Performance Computing Duncan Roweth, Quadrics Ltd Hot Interconnects August 2008
37

QsNetIII Adaptively Routed Network For HPC

Apr 16, 2017

Download

Technology

Federica Pisani
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 128/8/2008

QsNetIII an Adaptively Routed Network for

High Performance Computing

Duncan Roweth, Quadrics Ltd

Hot Interconnects August 2008

Page 2: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 228/8/2008

Quadrics Background

• Develops interconnect products for the HPC market

– HPC Linux systems

– AlphaServer SC systems

• Quadrics is owned by the Finmeccanica group

• Quadrics was 12 years old in July

Page 3: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 328/8/2008

QsNet Networks

• Multi-stage switch network

• Components

– Adapter: Elan

– Router: Elite

– Switches, cables

– Firmware, drivers, libraries

– Diagnostics, documentation

• HPC specific features

– Adaptive routing

– Hardware barrier & broadcast

Page 4: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 428/8/2008

Communication Model

Processs

Virtu

al A

ddre

ss

Page 5: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 528/8/2008

Quadrics Networks

• Elan1 / Elite1, 1994, Meiko Computing Surface 2

– Source chooses between pre-defined routes

• Elan3 / Elite3, 2000, first Quadrics product, QsNet

– First use of packet-by-packet adaptive routing

– Crosspoint router, x8

• Elan4 / Elite4, 2004, QsNetII

– Reduced latency, increased bandwidth

– Increased support for offloading collectives

• Elan5 / Elite5, 2008, QsNetIII

– General purpose crosspoint router, increased radix, x32

– Highly programmable adapter

Page 6: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 628/8/2008

What is Adaptive Routing ?

• Switch networks typically provide many

paths between any two points

• In an adaptively routed network

routers make packet by packet decisions

on the route to use based on

– Queue occupancy

– Channel usage

– Error rates and state

– Class of traffic

Page 7: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 728/8/2008

Why is Adaptive Routing Important ?

• Most HPC networks are statically routed

– They use pre-determined paths between nodes

• Static routing can work well

– If traffic pattern is known in advance

– If traffic pattern is persistent

– If traffic pattern is uniform (i.e. application is load balanced)

– If there are no errors

• These conditions are not met by real codes on production

HPC systems {see LLNL and Sandia results}

• Adaptive routing solves these problems

– Delivering significantly better aggregate bandwidths and worst

case latencies on real systems running real codes

Page 8: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 828/8/2008

Benefits of Adaptive Routing

• Bandwidth achieved

when 1024 nodes all

communicate at the

same time

• Plots show the

distribution of

measured bandwidths

System Interconnect Min Max Average

Atlas Infiniband 95 762 263

Thunder QsNetII 248 403 369Data from Lawrence Livermore National Lab, published at the Sonoma OpenFabrics workshop April 2007

Page 9: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 928/8/2008

Benefits of Adaptive Routing

• Classic QsNetII all-to-all bandwidth scaling graph

Page 10: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 1028/8/2008

Ordering Considerations

• Adaptively routed packets can arrive out of order

– Problems for stream devices, e.g. multipath Ethernet

• Message ordering is required in HPC

– But within a message we are free to deliver the bulk data in

arbitrary order

Get it there as fast as possible then tell me that it is done

• QsNet ordering

– Packets contain the destination virtual address at which to write

the data

– Bulk data transfers can arrive out of order and can be replayed

– Atomic transactions are sequenced

Page 11: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 1128/8/2008

Adaptive Routing in QsNetIII

• More flexible than QsNetII

– Operates over arbitrary sets of links

– More opportunities to use the technique

– Higher radix switches

• Select a subset of lightly loaded output ports based on:

– Destination

– Link state, errors etc

– Number of pending acks (programmable threshold)

• Programmable algorithm for selecting from this subset:

– First free, next free, random

Page 12: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 1228/8/2008

Adaptive Routing: standard case

– All top switches are equivalent, select one

– Adaptive routing selects a lightly loaded path

Page 13: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 1328/8/2008

Implementation of Fat Tree Networks

• Connect M×N-way node switches by N×M-way top switches

• In this case M = 16, N = 4

Page 14: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 1428/8/2008

Adaptive Routing in the Top Switch

• If top switch radix ≤ router radix / 2

– i.e. 16 for Elite5, 2048-way networks

• Router provides multiple top switches

– Select which to use based on load

• Example:

– Traffic from A to B via routers 210 and

300 is blocked by traffic between 300

and 200.

– The router providing 300, 301, 302 and

303 can select a different path

Page 15: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 1528/8/2008

Adaptive Routing on the Final Hop

• Multiple connections to a node

• Switch can select a free path

• Reduces end-point contention

• Simple case is not optimal

• Spreading the connections

– Improves fault tolerance

– Reduces network contention

• Routing decision is made higher

in the network

Page 16: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 1628/8/2008

Adaptive routing in the presence of errors

• In a production system with 1000s

of links it is not uncommon for a

small number to be broken – until

the next maintenance slot

• Adaptive routing minimises the

impact

• Example:

– Link between routers 10 and 20 is

broken

– Router 10 dynamically selects paths

via 21,22,23 spreading the load.

– Reverse case, avoid sending to 10

via 20. Reset 20’s links or update

switches 11,12,13.

Page 17: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 1728/8/2008

Small Packet Support

• Aim to get as close to line rate as possible with small packets

• For example:

– Small put

– 32 byte packet

• Adapter has multiple packet engines

• Adapters support up to 64 outstanding packets per link

– Doubles if we use both links

• Switches provide 32 virtual channels per output link

• Prioritisation – buffering on input to the router

Page 18: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 1828/8/2008

Barrier & Broadcast Support

• Switches broadcast over

a range of output links

• Combine Acks / Nacks

• Contiguous in QsNetII

• Sparse in QsNetIII

• Barrier implementation

– Network conditional

– Broadcast release

Page 19: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 1928/8/2008

Fabric

Bridge

x8

PLL

EEPROM Clocks PCIe

16 Lanes

Host I/F

TLB

Cmd Launch

PCIe

SERDES

Local Functions

Buffer Manager

Object Cache Tags

Free List

Local Memory

Ext i/f

SDRAM i/f

External cache

External

DDRII

16K x 8 x 8 banks = 1MB ECC RAM

CX4/ QSNetIII

Link

CX4/ QSNetIII

Link

Packet Engine 16K inst cache 9K data buffers

Packet Engine 16K inst cache 9K data buffers

Packet Engine 16K inst cache 9K data buffers

Packet Engine 16K inst cache 9K data buffers

Packet Engine 16K inst cache 9K data buffers

Packet Engine 16K inst cache 9K data buffers

Packet Engine 16K inst cache 9K data buffers

Elan5 Adapter

Elan5 – Device Overview

• 2 × QsNetIII links– 20Gbit/s/direction after protocol

• PCIe, PCIe2 host interface

• Multiple packet engines

• 512KB of high bandwidth on

chip local memory

• SDRAM interface to optional

local memory

• Buffer manager, object

cache

• Details in ISC Dresden

Paper

Page 20: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 2028/8/2008

Elite5 – Device Overview

• 64 × 32 crosspoint router

– Direct & buffered input from each link

– 8K of input buffering per link

• 32 virtual channels per link

• Physical layer DDR XAUI (6.25GHz)

• Adaptive routing

• Hardware barrier and broadcast

• Memory mapped stats & error

counters accessed out-of-band

Page 21: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 2128/8/2008

QsNetIII Device Overview

Elan Elite

Semi custom ASIC

Manufacturing partners LSI / TSMC G90 process

500 MHz 312 MHz

High performance BGA package

672 pin 982 pin

< 17W < 18W

Page 22: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 2228/8/2008

QsNetIII Implementation

• Node switch chassis

– 128 links down to the nodes

– 128 links up to the top switches

– Backplane connects 2 sets of cards

• Top switches

– 256 links down to the node switches

– Range of system sizes:

Ports Radix Per Chassis

512 4 64

1024 8 32

2048 16 16

4096 32 8

QsNetIII switch

logical design

QsNetIII switch

implementation

Page 23: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 2328/8/2008

QsNetIII Network 1024–way

• Fat tree, constructed from 8 × 128-way node switches connected by

128 × 8-way top switches

Page 24: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 2428/8/2008

QsNetIII Implementation – Cables

• QSFP connectors throughout

• Copper cables (e.g. Gore) 1-10m

• Active copper cables (e.g. Gore), 8-20m

• Optical cables (e.g. Luxtera), 5-300m

– PVDF Plenum rated

– LSZH available as an option

• No longer Quadrics proprietary

• Likely usage:

– Short copper cables from nodes

– Optical cables between switches

Page 25: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 2528/8/2008

QsNetIII Fault Tolerance

• All of the QsNetII Features

– CRCs on every packet

– Automatic retransmission

– Redundant routes

– Adaptive routing avoids failed links

– Redundant, hot plugable, PSUs and fans

+ Line rate testing of each link as it comes up

– Switches generate CRPAT, CJPAT or PRBS packets

– Links are only added to the route tables when they are (a) up, (b)

connect to the right place, and (c) can transfer data at full line rate

without error.

Page 26: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 2628/8/2008

QsNetIII Implementation – HP BladeSystem

Elite5 switch module

Full bandwidth

16 links to the blades (via backplane)

16 links to back of the module

Elan5 mezzanine adapter

2 QsNet links, PCI-E x8 Gen2

128 MB of memory

Page 27: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 2728/8/2008

• Elite5 silicon in Bristol

• Elan5 at TSMC, first parts expected

in 3-4 weeks

• Switch PCBs, chassis, backplane,

controllers are working

• First adapter PCBs are ready

– PCI-Express x16, HP Blade,

ExpressModule (Sun Blade)

• We are porting the QsNetII software

• Components at SC08 in Austin

• First customer shipment in Q1 of 2009

Current Status

Page 28: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 2828/8/2008

Future Work

• QsNetIII hardware

– Low cost 32-way switch

– 1024-way single chassis switch

• QsNetIII Software

– General framework for optimised collectives

– Support for “multiport” networks - “fat” nodes have multiple

connections to the same rail

– Ethernet firmware for the network adapter

Page 29: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 2928/8/2008

• Adaptive routing underwrites the scalability of HPC systems

designed to run a single large application

• Adaptive routing has been a feature of QsNet systems since 2000

• QsNetIII offers significant enhancements over both QsNetII and

competing products

Conclusions

Page 30: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 3028/8/2008

Thank you for listening

Page 31: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 3128/8/2008

Additional Material

Page 32: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 3228/8/2008

Packet Format

• Packet size of up to 4K made up of 256 byte packet segment and

continuations, 8 byte ACK

Page 33: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 3328/8/2008

Impact of static routing on latency

Data from Thunderbird cluster, Sandia National Lab

Big increases in worst case latency with number of nodes

Page 34: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 3428/8/2008

Impact of static routing on latency

Data from Thunderbird cluster, Sandia National Lab

Big variation in worst case latency across a large job

Page 35: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 3528/8/2008

Software Model – Firmware & Drivers

• Base firmware in the ROMs

• Firmware modules loadable with the device driver

– Elan, OpenFabrics, 10GE Ethernet, …

• Kernel modules

– elan5, elan, rms

• Device dependent library (libelan5)

• Device independent library (libelan)

• User libraries

Page 36: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 3628/8/2008

Software Model – Elan Libraries

• Point-to-point message

passing

• One-sided put/get

• Transparent rail striping

• Optimised collectives

• Locks and atomics ops

• Global memory allocation

Page 37: QsNetIII Adaptively Routed Network For HPC

Quadrics Ltd 3728/8/2008

QsNetIII Performance Summary

• Similar latencies to QsNetII

– The 1.3 to 2 microsecs of latency is mostly in the host PCI and

memory system

• Higher issue rates

– Improved link utilisation on small transfers

• Higher bandwidths

– 1.5 to 2.25 GB/sec/link depending on host interface

• Bi-directional host interface

– 2 x improvement over QsNetII

• Broadcast and barrier in hardware

• Continued development of adaptive routing underwrites scaling

to high node counts