Top Banner
12 th ANNUAL WORKSHOP 2016 INFINIBAND ROUTER PREMIER Mark Bloch, Liran Liss [ April 7 th , 2016 ] Mellanox Technologies
20

12th INFINIBAND ROUTER PREMIER

May 09, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 12th INFINIBAND ROUTER PREMIER

12th ANNUAL WORKSHOP 2016

INFINIBAND ROUTER PREMIER Mark Bloch, Liran Liss

[ April 7th, 2016 ] Mellanox Technologies

[ LOGO HERE ]

Page 2: 12th INFINIBAND ROUTER PREMIER

OpenFabrics Alliance Workshop 2016

AGENDA

§ Why routing? Why now? §  Infiniband routing §  Host stack §  IB and IP(oIB) addressing §  Supporting arbitrary IPoIB subnets

•  IPoIB vs. RDMACM •  IBACM

§  IB routers and HPC •  Preliminary results

Page 3: 12th INFINIBAND ROUTER PREMIER

OpenFabrics Alliance Workshop 2016

WHY ROUTING?

§  What changed? •  Complex RDMA systems and deployments

•  Interconnected appliances •  Interconnected clusters •  Inter data center connections

•  Exascale is here •  100Ks of nodes

§  Routing requirements •  Isolation

•  Local failures should not affect whole fabric •  Consolidation

•  Interconnect resources provided by different IB “islands”

•  Scaling •  Scale up addressable endpoints •  Maintain bi-sectional bandwidth and

latency characteristics of switches

Cluster  A

Cluster  B

Cluster  C

Host Host

Switch Switch

Host Host

Switch Switch

Page 4: 12th INFINIBAND ROUTER PREMIER

OpenFabrics Alliance Workshop 2016

PACKET RELAY

§  Packets with HopCount < 2 are discarded §  Tclass is preserved

•  May be used to map incoming SL to outgoing SL §  Partitions are global

•  In/Out-bound P_Key enforcement in routers is optional §  Routers may support multiple paths for a given DGID

•  Via different next-hop routers or LMC •  Identical GRH:FlowLabel values indicate packets for which ordering is important

§  Ordering must be maintained per <in-port, out-port, SL>

LID=2 LID=3

SLID=5  DLID=3  SL=0

LRHDGID=B  SGID=A  Tclass=T1  HopLimit=2

GRH

BTH+

Payload

X

ICRC

Y

VCRCSLID=2  DLID=7  SL=1

LRHDGID=B  SGID=A  Tclass=T1  HopLimit=1

GRH

BTH+

Payload

X

ICRC

Y’

VCRC

FIB  Lookup

Host1Host2LID=5LID=7

GID=AGID=B

Page 5: 12th INFINIBAND ROUTER PREMIER

OpenFabrics Alliance Workshop 2016

ROUTER MANAGEMENT

§  Specified •  Router NodeType •  Each SM manages the router ports discovered in its own subnet •  Endpoints obtain paths to remote destinations by querying the local SA

•  SA determines next-hop router •  Communication management

§  Unspecified •  Router manager and agent entities •  Routing MADs, methods, and attributes •  Endpoint local interface selection

Page 6: 12th INFINIBAND ROUTER PREMIER

OpenFabrics Alliance Workshop 2016

HOST STACK TODAY

§  Path queries •  Use standard path queries to obtain paths to remote nodes •  If PathRecord.HopCount > 0

•  GRH is specified by AH attributes

§  Raw verbs •  Modify QP with AH attributes specifying a GRH •  Create an AH with AH attributes specifying a GRH

§  AF_INET / AF_INET6 address resolution •  Local IPoIB interface selected by IP stack •  SGID extracted from local interface HW address •  DGID extracted from neighbor HW address

•  Assumption: single IPoIB subnet spans the whole IB fabric

IB  subnet

IB  subnet

IPoIB  subnet

Page 7: 12th INFINIBAND ROUTER PREMIER

OpenFabrics Alliance Workshop 2016

HOST STACK TODAY (CONT.)

§  AF_IB address resolution •  SGID must be provided by either rdma_bind_addr() or rdma_resolve_addr()

•  Used to locate local IB port •  Choosing local port based on DGID:subprefix doesn’t apply to routers !!!!

•  DGID provided by rdma_resolve_addr()

§  Connection Management •  No standard way to obtain remote path attributes required in CM REQ •  On active side: set SLID = DLID = 0xffff •  On passive side

•  If SLID == 0xffff •  Set SLIDßCQE.SLID (router LID)

•  If DLID = 0xffff •  Set DLID ß CQE.path-bits

•  Otherwise, no change in 3-way handshake

RouNng  management  is  transparent  to  host  stack

Page 8: 12th INFINIBAND ROUTER PREMIER

OpenFabrics Alliance Workshop 2016

AF_INET - PUTTING IT ALL TOGETHER (1/2)

IB  Router Host  B SM  BHost  ASM  A

ARP  query  (mulNcast)  

ARP  reply  (unicast)  

Path  query(A)  àHopLimit=2  àDLID=3  

Path  query(B)  àHopLimit=2  àDLID=2  

LID1 LID3 LID2 LID3 LID7 LID2

MulNcast  rouNng

Unicast  rouNng

resolve_addr

resolve_route

Issued  by  IPoIB

Issued  by  CMA

Page 9: 12th INFINIBAND ROUTER PREMIER

OpenFabrics Alliance Workshop 2016

AF_INET - PUTTING IT ALL TOGETHER (2/2)

IB  Router Host  B SM  BHost  ASM  A

CM  REP  

CM  REQ  

Data  traffic  

CM  RTU  

Primary/alternate  fields:  {SLID,  DLID,  SL}  

replaced  by  reversing  CQE  values:  {7,  3,  cqe.SL)  

LID1 LID3 LID2 LID3 LID7 LID2

CQE:  SLID=3  DLID=7  (path  bits)

conn

ect

Primary/alt  path:  SLID=DLID=0xffff

Page 10: 12th INFINIBAND ROUTER PREMIER

OpenFabrics Alliance Workshop 2016

IB  subnet

IB ROUTING AND IP(OIB) ADDRESSING

§  IP can be used to •  Select local interface •  Determine SGID •  Determine next-hop DGID (for IP connectivity) •  Resolve ServiceIDs within proper network namespace

§  This does not mandate a global IPoIB subnet •  Additional models are possible

IPoIB  subnet

IB  subnet

IB  subnet

IPoIB  subnet

IPoIB  subnet

IP  rouNng  fabric

IB/IP  router

Page 11: 12th INFINIBAND ROUTER PREMIER

OpenFabrics Alliance Workshop 2016

ARBITRARY IPOIB SUBNETS

IB  subnet

IB  subnet

IB  router

IP  router

IP  router

IPoIB  subnet

IPoIB  subnet

IP  rouNng  fabric

Host  BHost  A

NaNve  IB  traffic

IP  traffic

IPoIB  Next-­‐Hop  !=  RDMACM  DesNnaNon  !!!

True  even  for  paired  IB/IPoIB  

interfaces

Page 12: 12th INFINIBAND ROUTER PREMIER

OpenFabrics Alliance Workshop 2016

ARBITRARY IPOIB SUBNETS (CONT.)

§ Global IPoIB •  Neighbor (ARP table) holds HW address of peer node •  CMA may derive peer GID from HW address

§ Multiple IPoIB subnets •  Neighbor holds HW address of the next-hop IP router •  CMA needs to resolve remote IP to peer GID

§ Global IPàGID resolution is not a kernel task

§  Solution: use IBACM daemon

Page 13: 12th INFINIBAND ROUTER PREMIER

OpenFabrics Alliance Workshop 2016

IBACM

§  IBACM assists in establishing IB connections §  Implemented as user-space daemon

•  Plugin architecture for augmenting behavior and implementation

§  Provides •  Mapping of hostname/IPàpath record for rdmacm •  Path record lookups for the Kernel

§  Lookup results are cached for fast future access

App

rdmacm

App

rdmacm

IBACM

Plugin

Kernel

Page 14: 12th INFINIBAND ROUTER PREMIER

OpenFabrics Alliance Workshop 2016

IBACM EXISTING FLOWS

ApplicaNon rdmacm IBACMrdma_getaddrinfo   ACM_OP_RESOLVE  

Path  record  and  DGID/

SGID    

rdma_create_ep  

RDMA_NL_LS_OP_RESO

LVE  

Path  record    

Kernel

Path  record  and  DGID/

SGID    

TCP

netlink

Page 15: 12th INFINIBAND ROUTER PREMIER

OpenFabrics Alliance Workshop 2016

IPàGID RESOLUTION FLOW

§  Kernel CMA •  If destination IP is in a non-adjacent IP subnet: obtain DGID from ibacm •  Otherwise: fall back to neighbor lookup

§  RDMACM not changed •  Applications that obtain path via rdma_getaddrinfo() will use existing flow •  Others will obtain remote GID and path from the kernel CMA

RDMA_NL_LS_OP_RESOLVE_IP  

DGID  

IBACM Kernel

Page 16: 12th INFINIBAND ROUTER PREMIER

OpenFabrics Alliance Workshop 2016

IB ROUTERS AND HPC

§  Configuration •  Mellanox SB7780 configurable, 36-port, EDR switch/router •  Dell PowerEdge R720 16-node cluster

•  Dual-Socket 10-Core Intel E5-2680v2 @ 2.80 GHz CPUs •  Vanilla OpenMPI 1.10.3a1

§  Test environment •  Compare single subnet vs. splitting ports across 2 subnets

Preliminary results

Host0 Host15

SB7780

P0

Host7 Host8

P7 P8 P15… …

… …

P16 P35…

Router

Switch Switch Switch

Page 17: 12th INFINIBAND ROUTER PREMIER

OpenFabrics Alliance Workshop 2016

OSU MPI BENCHMARKS

§  2 node MPI test §  ~50ns difference in latency §  No apparent difference in bandwidth

Page 18: 12th INFINIBAND ROUTER PREMIER

OpenFabrics Alliance Workshop 2016

GROMACS APPLICATION

§ GROningen MAchine for Chemical Simulation •  Molecular dynamics simulation package

§  Run up to 16 nodes §  No apparent differences between switch/router

Page 19: 12th INFINIBAND ROUTER PREMIER

OpenFabrics Alliance Workshop 2016

NAMD APPLICATION

§  Parallel molecular dynamics •  High-performance simulation of large biomolecular systems

§  Run up to 16 modes §  No apparent differences between switch/router

Page 20: 12th INFINIBAND ROUTER PREMIER

12th ANNUAL WORKSHOP 2016

THANK YOU Mark Bloch, Liran Liss

Mellanox Technologies

[ LOGO HERE ]