ElasticSwitch : Practical Work-Conserving Bandwidth Guarantees for Cloud Computing

ElasticSwitch: Practical Work-Conserving Bandwidth Guarantees for Cloud

Computing

HP Labs* Avi Networks+ Google

Lucian Popa Praveen Yalagandula* Sujata Banerjee Jeffrey C. Mogul+ Yoshio Turner Jose Renato Santos

Goals

1. Provide Minimum Bandwidth Guarantees in Clouds

Tenants can affect each other’s traffic MapReduce jobs can affect performance of user-facing

applications Large MapReduce jobs can delay the completion of small

jobs Bandwidth guarantees offer predictable

performance

Goals


VMs of one tenant

Hose model

Bandwidth Guarantees

Virtual (imaginary) Switch

X Y

BX BYZ

BZ

VS

Other models based on hose model such as TAG [HotCloud’13]

Goals


2. Work-Conserving Allocation Tenants can use spare bandwidth from unallocated or

underutilized guarantees

Goals


2. Work-Conserving Allocation Tenants can use spare bandwidth from unallocated or

underutilized guarantees Significantly increases performance

Average traffic is low [IMC09,IMC10] Traffic is bursty

Goals


2. Work-Conserving Allocation

ElasticSwitch

X Y

Bmin

BminX

Y ba

ndwi

dth

Free capacity

Everything reserved & used

Time

Bmin

Goals


2. Work-Conserving Allocation3. Be Practical

Topology independent: work with oversubscribed topologies

Inexpensive: per VM/per tenant queues are expensive work with commodity switches

Scalable: centralized controller can be bottleneck distributed solution Hard to partition: VMs can cause bottlenecks anywhere in

the network

Goals

1. Provide Minimum Bandwidth Guarantees in Clouds2. Work-Conserving Allocation3. Be Practical

Prior WorkGuarante

esWork-

conserving

Practical

Seawall [NDSI’11], NetShare [TR], FairCloud (PS-L/N) [SIGCOMM’12]

X (fair

sharing)√ √

SecondNet [CoNEXT’10] √ X √

Oktopus [SIGCOMM’10] √ X ~X (centralized)

Gatekeeper [WIOV’11], EyeQ [NSDI’13] √ √ X (congestion-

free core)

FairCloud (PS-P) [SIGCOMM’12] √ √ X (queue/VM)

Hadrian [NSDI’13] √ √ X (weighted RCP)

ElasticSwitch √ √ √

Outline Motivation and Goals Overview More Details

Guarantee Partitioning Rate Allocation

Evaluation

ElasticSwitch Overview: Operates At Runtime

Tenant selects bandwidth guarantees. Models: Hose,

TAG, etc.

VMs placed, Admission Control ensures all

guarantees can be met

Enforce bandwidth guarantees

& Provide work-conservation

VM setup

RuntimeElasticSwitch

Oktopus [SIGCOMM’10]Hadrian [NSDI’10]CloudMirror [HotCLoud’13]

Network

ElasticSwitch Overview: Runs In Hypervisors

VM

VM

VM

ElasticSwitchHypervisor

VM


VM


VM

• Resides in the hypervisor of each host• Distributed: Communicates pairwise following

data flows

ElasticSwitch Overview: Two Layers

Guarantee Partitioning

Rate Allocation

Hypervisor

Provides Work-conservation

Provides Guarantees

ElasticSwitch Overview: Guarantee Partitioning1. Guarantee Partitioning: turns hose model into VM-to-VM pipe guarantees

VM-to-VM control is necessary, coarser granularity is not enough

X Y Z

VS

BX BY BZ

ElasticSwitch Overview: Guarantee Partitioning1. Guarantee Partitioning: turns hose model into VM-to-VM pipe guarantees

BXY BXZVM-to-VM guarantees bandwidths as if tenant communicates on a physical

hose network

Intra-tenant

ElasticSwitch Overview: Rate Allocation1. Guarantee Partitioning: turns hose model into VM-to-VM pipe guarantees

RateXY ≥

2. Rate Allocation: uses rate limiters, increases rate between X-Y above BXY when there is no congestion between X and Y

YX

Unreserved/Unused Capacity

Work-conserving allocation

X Y Z

VS

BX BY BZ

BXY

Inter-tenant

Hypervisor Limiter Hypervisor

ElasticSwitch Overview: Periodic Application


Rate Allocation

Hypervisor

VM-to-VMguarantees Demand

estimates

Applied periodically, more often

Applied periodically and onnew VM-to-VM pairs



Evaluation

Guarantee Partitioning – Overview

BXY

BXZ BTY

X Y

Z T

QBQY

X Y

BX

Z

VS1

Q

BQ

T

Goals:A. Safety – don’t violate hose

modelB. Efficiency – don’t waste

guaranteeC. No Starvation – don’t block

traffic

Max-min allocation

Guarantee Partitioning – Overview

X Y

Z T

Q

X Y

BX

Z

VS1

Q

BQ

TBX = … = BQ = 100Mbps

33Mbps66Mbps 33Mbps

33Mbps

Max-min allocationGoals:

A. Safety – don’t violate hose model

B. Efficiency – don’t waste guarantee

C. No Starvation – don’t block traffic

Guarantee Partitioning – Operation

X Y

Z T

Q

X Y

BX

Z

VS1

Q

BQ

T

BX BY

BXBY

TYXZ

XY XY

BYQYHypervisor divides guarantee of each

hosted VM between VM-to-VM pairs in each direction

Source hypervisor uses the minimum between the source and destination

guarantees

X YBXY = min( BX , BY )

XY XY

Guarantee Partitioning – Safety

X Y

Z T

Q

X Y

BX

Z

VS1

Q

BQ

T

BX BY

BXBY

TYXZ

XY XY

BYQY

BXY = min( BX , BY )

XY XY

X Y

Safety: hose-model guarantees are not exceeded


X Y

Z T

Q

X Y

BX

Z

VS1

Q

BQ



XY XY


X Y

Z T

Q

X Y

BX

Z

VS1

Q

BQ



XY XY

BX = 50 BY = 33BXY = 33BX =

50BY = 33

TYXZ

XY XY

BY = 33QY

X Y

Guarantee Partitioning – Efficiency

What happens when flows have low demands?1

X Y

Z T

Q

BX = 50 BY = 33BXY = 33BX =

50BY = 33

TYXZ

XY XY

BY = 33QY

X Y

BX

Z

VS1

Q

BQ


Hypervisor divides guarantees max-min based on demands

(future demands estimated based on history)


2How to avoid unallocated guarantees?

X Y

Z T

Q

BX = 50 BY = 33BXY = 33BX =

50BY = 33

TYXZ

XY XY

BY = 33QY

X Y

BX

Z

VS1

Q

BQ


What happens when flows have low demands?1

66

33


X Y

Z T

Q

BX = 50 BY = 33BXY = 33BX =

50BY = 33

TYXZ

XY XY

BY = 33QY

X Y

BX

Z

VS1

Q

BQ

TBX = … = BQ = 100Mbps 6

6

33Source considers destination’s

allocation when destination is bottleneck

Guarantee Partitioning converges



Evaluation

Rate Allocation

Rate Allocation

BXY


X

Y

Congestion data

Rate RXY

Spare bandwidth

Time

RXY

BXY

Fully used

Limiter

Rate Allocation

Rate Allocation

BXY


X

Y

Congestion data

Rate RXYLimite

r

RXY = max(BXY , RTCP-like)

Rate AllocationRXY = max(BXY , RTCP-like)

144.178 144.678 145.178 145.678 146.178 146.678 147.178 147.678 148.1780100200300400500600700800900

1000Rate Limiter Rate

Seconds

Mbp

s

Guarantee

Another Tenant

X Y

Rate AllocationRXY = max(BXY , Rweighted-TCP-

like)

Weight is the BXY guarantee

BXY = 100Mbps

Z TBZT = 200Mbps

L = 1GbpsRXY = 333Mbps

RXT = 666Mbps

Rate Allocation – Congestion Detection

Detect congestion through dropped packets Hypervisors add/monitor sequence numbers in

packet headers

Use ECN, if available

Rate Allocation – Adaptive Algorithm

Use Seawall [NSDI’11] as rate-allocation algorithm TCP-Cubic like

Essential improvements (for when using dropped packets)

Many flows probing for spare bandwidth affect guarantees of others

Rate Allocation – Adaptive Algorithm

Use Seawall [NSDI’11] as rate-allocation algorithm TCP-Cubic like

Essential improvements (for when using dropped packets) Hold-increase: hold probing for free bandwidth after a

congestion event. Holding time is inversely proportional to guarantee.

GuaranteeRate increasing

Holding time



Evaluation

Evaluation Setup

Implementation in Linux Logic in user-space: controls rate limiters, sends

control packets Modified kernel OVS

Testbed ~100 servers 1Gbps tree network

Evaluation – Many-to-one

Z

XL = 1Gbps

…Z

VS2

X

VS1

450Mbps

450Mbps

UDP

TCP

Edge or core

0 1 2 10 100 200 3000

100200300400500600700800900

1000

Evaluation – Many-to-oneTh

roug

hput

(Mbp

s)

Senders to Z

XZ

0 1 2 10 100 200 3000

100200300400500600700800900

1000


roug

hput

(Mbp

s)

Senders to Z

VM Z takes all the bandwidth

X

…

No Protection

Z

0 1 2 10 100 200 3000

100200300400500600700800900

1000


roug

hput

(Mbp

s)

Senders to Z

No ProtectionStatic Reservation (e.g., Oktopus)

Wasted bandwidth

XZ

0 1 2 10 100 200 3000

100200300400500600700800900

1000


roug

hput

(Mbp

s)

Senders to Z


ElasticSwitch

Work-conserving

XZ

0 1 2 10 100 200 3000

100200300400500600700800900

1000


roug

hput

(Mbp

s)

Senders to Z


ElasticSwitch

Provides guarantees

XZ

0 1 2 10 100 200 3000

100200300400500600700800900

1000


roug

hput

(Mbp

s)

Senders to Z

ElasticSwitch Ideal behavior

XZ

Evaluation – MapReduce Setup

44 servers, 4x oversubscribed topology, 4 VMs/server Each tenant runs one job, all VMs of all tenants same

guarantee

Two scenarios: Light

10% of VM slots are either a mapper or a reducer Randomly placed

Heavy 100% of VM slots are either a mapper or a reducer Mappers are placed in one half of the datacenter

0.2 0.5 1 50

0.2

0.4

0.6

0.8

1

Evaluation – MapReduceCD

F

Worst case shuffle completion time / static reservation

0.2 0.5 1 50

0.2

0.4

0.6

0.8

1


FNo ProtectionElasticSwitch

Light Setup

Work-conserving pays off: finish faster than static reservation


Longest completion reduced from No Protection

0.2 0.5 1 50

0.2

0.4

0.6

0.8

1


F

Heavy Setup

ElasticSwitch enforces guarantees in worst case


No Protection

ElasticSwitch

Guarantees are useful in reducing worst-case shuffle completion

up to 160X

ElasticSwitch Summary Properties

1. Bandwidth Guarantees: hose model or derivatives2. Work-conserving3. Practical: oversubscribed topologies, commodity switches,

decentralized

Design: two layers Guarantee Partitioning: provides guarantees by transforming

hose-model guarantees into VM-to-VM guarantees Rate Allocation: enables work conservation by increasing

rate limits above guarantees when no congestion

HP Labs is hiring!

Backup Slides

Evaluation – OverheadNon-linear in number of limiters

Splitting control per hosted VM should keep operation point here

Can be significantly improved!

0 50 100 150 200 250 3000

20

40

60

80

100

Number of VM-to-VM flows

Over

head

of o

ne C

PU c

ore

(%)

Guarantee Partitioning – Dynamic Behavior Optimize for bimodal distribution flows

Most flows short, a few flows carry most bytes Short flows care about latency, long flows care

about throughput

Start with a small guarantee for a new VM-to-VM flow

If demand not satisfied, increase guarantee exponentially

Short Flows

Short Flows/s0 20 40 60 80 100

Thro

ughp

utLo

ng-F

low

(Mbp

s)

050

100150200250300350400450500

1 new VM-to-VM pair/sec10 new VM-to-VM pairs/sec20 new VM-to-VM pairs/sec30 new VM-to-VM pairs/sec50 new VM-to-VM pairs/sec

Latency

# VM-to-VM flows in background

0 1 5 20 23(Tight Guar.)

Com

plet

ion

time

(ms)

00.20.4

0.60.8

1

1.21.41.6

1.82

No ProtectionOktopus-like Res.ElasticSwitch

Future Work

Reduce Overhead ElasticSwitch: average 1 core / 15 VMs ,worst case

1 core /VM

Multi-path solution Single-path reservations are inefficient No existing solution works on multi-path networks

Guarantee Partitioning – Efficiency: Reallocation

21 3 4

Intuition: propagates from smaller to larger guarantee shares

Guarantee Partitioning converges

Rate AllocationRXY = max(BXY , Rweighted-TCP-like)

Weight is the BXY guarantee

Order of magnitude longer timescale than RTT – not affects TCP

ElasticSwitch : Practical Work-Conserving Bandwidth Guarantees for Cloud Computing

Documents

spare bandwidth

metenforce bandwidth

vm control

expensive work

vm pipe guaranteesbxybxzvm

vm pipe guaranteesvm

guarantee partitioning1

cloud computing elasticswitch