Top Banner
Evaluating the Impact of Infiniband Routing Algorithms on Network Performance Fabrice Mizero Philander Smith College, SIParCS Mentor: Dr. John Dennis Collaborators: Prof. Malathi Veeraraghavan, Zhengyang Liu, UVA Dr. Robert D. Russell, Patrick MacArthur, UNH 08/01/2013 1
18

Evaluating the Impact of Infiniband Routing Algorithms on ... F... · Evaluating the Impact of Infiniband Routing Algorithms on Network Performance ... Need of low latency and high

May 28, 2018

Download

Documents

lamkhanh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Evaluating the Impact of Infiniband Routing Algorithms on ... F... · Evaluating the Impact of Infiniband Routing Algorithms on Network Performance ... Need of low latency and high

Evaluating the Impact of Infiniband Routing Algorithms on Network Performance

Fabrice Mizero Philander Smith College, SIParCS

Mentor: Dr. John Dennis

Collaborators: Prof. Malathi Veeraraghavan, Zhengyang Liu, UVA Dr. Robert D. Russell, Patrick MacArthur, UNH

08/01/2013

1

Page 2: Evaluating the Impact of Infiniband Routing Algorithms on ... F... · Evaluating the Impact of Infiniband Routing Algorithms on Network Performance ... Need of low latency and high

Roadmap Motivation Routing Algorithms: Emphasis on UpDn Scatter-Ports How routing works: Emphasis Subnet Manager Link Failures in Infiniband Networks Subnet Manager reaction to Infiniband Link failures Experiments Results and Observations

2

Page 3: Evaluating the Impact of Infiniband Routing Algorithms on ... F... · Evaluating the Impact of Infiniband Routing Algorithms on Network Performance ... Need of low latency and high

Motivation Application performance variability – CESM

Execution Time for ASD on Yellowstone

3

Large Variability in execution time

Page 4: Evaluating the Impact of Infiniband Routing Algorithms on ... F... · Evaluating the Impact of Infiniband Routing Algorithms on Network Performance ... Need of low latency and high

CAM Scalasca Analysis

4

Page 5: Evaluating the Impact of Infiniband Routing Algorithms on ... F... · Evaluating the Impact of Infiniband Routing Algorithms on Network Performance ... Need of low latency and high

Possible Explanation

5

Execution time variability in CESM

Slow communications affecting ys5456

Routing Table Recalculations

Page 6: Evaluating the Impact of Infiniband Routing Algorithms on ... F... · Evaluating the Impact of Infiniband Routing Algorithms on Network Performance ... Need of low latency and high

Motivation A better understanding of Infiniband routing, routing

algorithms , and subnet management. Need of low latency and high throughput for Large Scale

parallel message passing applications such as Community Earth System Model (CESM)

6

Page 7: Evaluating the Impact of Infiniband Routing Algorithms on ... F... · Evaluating the Impact of Infiniband Routing Algorithms on Network Performance ... Need of low latency and high

Routing Algorithms The Infiniband Architecture currently supports: UpDown UpDown --Scatter Ports Others: FatTree Minhop LASH DOR The choice of Routing Algorithms largely depends:

• Network Topology • Expected nature of traffic and applications demands

7

Page 8: Evaluating the Impact of Infiniband Routing Algorithms on ... F... · Evaluating the Impact of Infiniband Routing Algorithms on Network Performance ... Need of low latency and high

Routing Algorithms Yellowstone - UpDown – Scatter Ports How it works: 3 Steps: ◦ Auto-Detection of Root Nodes ◦ Ranking Process ◦ Minhop table setting

Advantages: • Randomness in port selection • Reduces credit loops potential by reducing number of routes • Better adaptation to link failures. (it’s not topology-bound like

Fattree)

8

Page 9: Evaluating the Impact of Infiniband Routing Algorithms on ... F... · Evaluating the Impact of Infiniband Routing Algorithms on Network Performance ... Need of low latency and high

Subnet Management in Infiniband Networks

Subnet Manager Infiniband compliant subnet manager – OpenSM Tasks:

Initialize Infiniband Hardware

Local Identifiers Assignment

Routing Table Calculations & Distributions

Regularly Sweeps for changes in the

Topology

If found

Routing Recalculation is a huge task in Large Scale Networks 9

(reassign lids) -r

Page 10: Evaluating the Impact of Infiniband Routing Algorithms on ... F... · Evaluating the Impact of Infiniband Routing Algorithms on Network Performance ... Need of low latency and high

Link Failures in Infiniband Networks

Main cause: • Dysfunctional cables

Impact on network performance: • Higher latencies • Possible packets drop due timeouts • Overall poor performance of parallel message passing

applications.

10

Page 11: Evaluating the Impact of Infiniband Routing Algorithms on ... F... · Evaluating the Impact of Infiniband Routing Algorithms on Network Performance ... Need of low latency and high

Topo-file Example in Use

32 nodes, 3 levels, full symmetrical FatTree

Page 12: Evaluating the Impact of Infiniband Routing Algorithms on ... F... · Evaluating the Impact of Infiniband Routing Algorithms on Network Performance ... Need of low latency and high

12

Infiniband Routing On a Healthy Subnet Destination-Based Routing & Credit Based Flow Control

0x0001

0x0009

0x0013

0x0017

0x0021

0x0025

0x0013

Packet 0x0009

Destination LID compared to Current LID

= ≠

Consult Routing Table Find Port for Dest. LID

Request for Buffer Space Availability

Destination Reached

No

Yes

Wait for Credits

Send Packet

Page 13: Evaluating the Impact of Infiniband Routing Algorithms on ... F... · Evaluating the Impact of Infiniband Routing Algorithms on Network Performance ... Need of low latency and high

13

Subnet Manager Adaptation to Link Failures

0x0001 0x0009

OpenSM scheduled Sweeps

Link Failure Detected

Find Directly Affected Switches

SW-000

SW-100 Update Routing Tables in

both Switches

Subnet UP

Packet 0x0009

Page 14: Evaluating the Impact of Infiniband Routing Algorithms on ... F... · Evaluating the Impact of Infiniband Routing Algorithms on Network Performance ... Need of low latency and high

Experiments Tools:

• Infiniband Management Simulator(IBMgtSim) • Subnet Manager (OpenSM) • Opensm Logs: Calculate subnet recovery times.

14

IBMgtSim Virtual Infiniband Fabric OpenSM

Boots up the SUBNET on the Fabric

Dumps Routing Tables, Logs to a Temp directory

Topology File

Page 15: Evaluating the Impact of Infiniband Routing Algorithms on ... F... · Evaluating the Impact of Infiniband Routing Algorithms on Network Performance ... Need of low latency and high

15

Page 16: Evaluating the Impact of Infiniband Routing Algorithms on ... F... · Evaluating the Impact of Infiniband Routing Algorithms on Network Performance ... Need of low latency and high

Observations

Subnet Management can lower network performance The more the number of dysfunctional links in the

network, the more overall latency is affected by subnet management.

Future Work Causes of link failures? Evaluating ways in which OpenSM can adapt

faster to changes.

16

Page 17: Evaluating the Impact of Infiniband Routing Algorithms on ... F... · Evaluating the Impact of Infiniband Routing Algorithms on Network Performance ... Need of low latency and high

Acknowledgments

John Dennis – Mentor Collaborators – UVA, UNH Zhengyang Liu - Colleague & Friend Babak Behzad, Sean Fisk, Joseph Usset - For all the enriching

discussions.

17

Page 18: Evaluating the Impact of Infiniband Routing Algorithms on ... F... · Evaluating the Impact of Infiniband Routing Algorithms on Network Performance ... Need of low latency and high

Thank You for Your Attention! Q&A

Fabrice Mizero [email protected] CS Junior Philander Smith College

18