Top Banner
DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi *+ Isaac Keslassy + Avinoam Kolodny + ANCS 2012
20

DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

Dec 16, 2015

Download

Documents

Byron Byrd
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS

RUNNING ON DATA CENTER NETWORKS

* Mellanox Technologies LTD, + Technion - EE Department

Eitan Zahavi*+

Isaac Keslassy+ Avinoam Kolodny+

ANCS 2012

Page 2: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

2

Big Data – Larger Flows

Data-set sizes keep rising Web2 and Cloud Big-Data applications

Data Center Traffic changes to:

Longer, Higher BW and Fewer Flows

Google

Page 3: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

3

Static Routing of Big-Data = Low BW

Static Routing cannot balance a small number of flows

Congestion: when BW of link flows > link capacity When longer and higher-BW flows contend:

On lossy network: packet drop → BW drop On lossless network: congestion spreading → BW drop

Data flow

SR

Page 4: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

4

Traffic Aware Load Balancing Systems

Centralized Flows are routed according

to a “global” knowledge

Distributed Each flow is routed by its input

switch with “local” knowledge

Central Routing Control

SR

SR

SR

Adaptive Routing adjusts routing to network load

Self Routing

Unit

Page 5: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

5

Central vs. Distributed Adaptive Routing

Distributed Adaptive Routing is either scalable or have global knowledge

It is Reactive

Property Central Adaptive Routing

Distributed Adaptive Routing

Scalability Low High

Knowledge

Global Local (to keep scalability)

Non-Blocking

Yes Unknown

Page 6: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

6

Research Question

Can a Scalable Distributed Adaptive Routing System perform like centralized system and produce non-blocking routing assignments in reasonable time?

Page 7: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

7

Trial and ErrorIs Fundamental to Distributed AR

Randomize output port – Trial 1 Send the traffic Contention 1 Un-route contending flow

Randomize new output port – Trial 2 Send the traffic Contention 2 Un-route contending flow

Randomize new output port – Trial 3 Send the traffic

Convergence!

SR

SR

SR

Page 8: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

8

Routing Trials Cause BW Loss

Packet Simulation: R1 is delivered followed by G1 R2 is stuck behind G1 Re-route R3 arrives before R2

Out-of-Order Packets delivery!

Implications are significant drop in flow BW TCP* sees out-of-order as packet-drop and throttle the

senders See “Incast” papers…* Or any other reliable transport

R1

R1

G1

R2

SR

SR

SR

R3

Page 9: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

9

Research Plan

Given

1. Analyze Distributed Adaptive Routing systems

2. Find how many routing trials are required to converge

3. Find conditions that make the system reach a non-blocking assignment in a reasonable time

t

events

New Traffic

Trial 1Trial 2

Trial N

No Contention

Page 10: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

10

A Simple Policy for Selecting a Flow to Re-Route

At each time step Each output switch

Request re-route of a single worst contending flow

At t=0 New traffic pattern is applied Randomize output-ports

and Send flows At t=0.5 Request Re-Routes Repeat for t=t+1 until no contention

1

m

1

r

SR

SR

SR

1

n

1

n

input switch

output switch

Page 11: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

11

Evaluation

Measure average number of iterations I to convergence

I is exponential with system size !

Page 12: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

12

A Balls and Bins Representation Each output switch is a “balls and bins” system Bins are the switch input links, balls are the link

flows Assume 1 ball (=flow) is allowed on each bin

(=link) A “good” bin has ≤ 1 ball Bins are either “empty”, “good” or “bad”

SR

SR

SR

1

m

empty

bad

good

Middle Switch

Page 13: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

13

System Dynamics

Two reasons of ball moves Improvement or Induced-move

Balls are numbered by their input switch number

1 2 3

Middle Switch: 1 2 3 4

Output switch 1

3

Induced

2 1

3

Middle Switch: 1 2 3 4

Output switch 2

Improve

2

1

3

4

2

1

3

SW2

SW1

SW3 3

Page 14: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

14

The “Last” Step Governs Convergence

Estimated Markov chain models What is the probability of the required last

Improvement to not cause a bad Induced move?

Each one of the r output-switches must do that step

Therefore convergence time is exponential with r

1 0

AB

CD

GoodBad

1 0

AB

CD

GoodBad

Output switch 1

Output switch 2

1 0

AB

CD

GoodBad

Output switch r

Absorbing – 1 Absorbing

Page 15: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

15

Introducing p

Assume a symmetrical system: flows have same BW

What if the Flow_BW < Link_BW? The network load is Flow_BW/Link_BW p = how many balls are allowed in one bin

SR

p=1

p=1

p=2

p=2SR

SR

Page 16: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

16

p has Great Impact on Convergence

Measure average number of iterations I to convergence

I shows very strong dependency on p

Page 17: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

17

Implementable Distributed System

Replace congestion detection by flow-count with QCN Detected on middle switch output – not output

switch input Replace “worst flow selection” by congested flow

sampling Implement as extension to detailed InfiniBand flit

level model

1152 nodes 2 Levels Fat-Tree SNB m=2n=2rParameter Full BW: 40Gbps Full BW: 40Gbps 48% BW: 19.2GbpsAvg of Avg Throughput 3,770MB/s 2,302MB/s 1,890MB/sMin of Avg Throughput 3,650MB/s 2,070MB/s 1,890MB/sAvg of Avg Pkt Network Latency 17.03usec 32usec 0.56usecMax of Max Pkt Network Latency 877.5usec 656usec 11.68usecAvg of Avg Out-of-order/In-Order Ratio 1.87 / 1 5.2 / 1 0.0023 / 1Max of Avg Out-of-order/In-Order Ratio 3.25 / 1 7.4 / 1 0.0072 / 1Avg of Avg Out-of-order Window 6.9pkts 5.1pkts 2.28pktsMax of Avg Out-of-order Window 8.9pkts 5.7pkts 5pkts

RNB m=n=r

Page 18: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

18

52% Load on 1152 nodes Fat-Tree

No change in number of adaptations over time !

No convergence

Page 19: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

19

48% Load on 1152 nodes Fat-Tree

t [sec]

Sw

itch

Rou

ting

Ada

ptat

ions

/ 10u

sec

Page 20: DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS RUNNING ON DATA CENTER NETWORKS * Mellanox Technologies LTD, + Technion - EE Department Eitan Zahavi.

20

Conclusions

Study: Distributed Adaptive Routing of Big-Data flows Focus on: Time to convergence to non-blocking routing Learning: The cause for the slow convergence Corollary: Half link BW flows converge in few iterations Evaluation: 1152 nodes fat-tree simulation reproduce these

results

Distributed Adaptive Routing of Half Link_BW

Flows

is both Non-Blocking and Scalable