Top Banner
EyeQ: (An engineer’s approach to) Taming network performance unpredictability in the Cloud Vimal Mohammad Alizadeh Balaji Prabhakar David Mazières Changhoon Kim Albert Greenberg
17

EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

Feb 25, 2016

Download

Documents

zytka

EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud. Vimal. Mohammad Alizadeh Balaji Prabhakar David Mazières. Changhoon Kim Albert Greenberg. What are we depending on?. Many customers don’t even realise network issues: - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

EyeQ:(An engineer’s approach to)

Taming network performance unpredictability in the Cloud

VimalMohammad Alizadeh

Balaji PrabhakarDavid Mazières

Changhoon KimAlbert Greenberg

Page 2: EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

2

What are we depending on?

http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html

5 Lessons We’ve Learned Using AWS

… in the Netflix data centers, we have a high capacity, super fast, highly reliablenetwork. This has afforded us the luxury of designing around chatty APIs to remote systems. AWS networking has more variable latency.

Overhaul appsto deal with variability

Many customersdon’t even realise network issues:

Just “spin up more VMs!”Makes app more network dep.

Page 3: EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

3

Cloud: Warehouse Scale ComputerMulti-tenancy: To increase cluster utilisation

6/11/12

http://research.google.com/people/jeff/latency.html

Provisioning the WarehouseCPU, memory, disk

Network

Page 4: EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

4

Sharing the Network

• Policy– Sharing model

• Mechanism– Computing rates– Enforcing rates on entities…• Per-VM (multi-tenant)• Per-service (search, map-reduce, etc.)

6/11/12

Can we achieve this?

2Ghz VCPU15GB memory1Gbps network

Tenant X’s Virtual Switch

VM1 VM2 VMnVM3 …

Tenant Y’s Virtual Switch

VM1 VM2 VMiVM3 …

Customer X specifiesthe thickness of each pipe.No traffic matrix.(Hose Model)

Page 5: EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

5

Why is it hard? (1)

• Bandwidth demands can be…– Random, bursty– Short: few millisecond requests

• Timescales matter!– Need guarantees on the order of few RTTs (ms)

6/11/12

• Default policy insufficient: 1 vs many TCP flows, UDP, etc.• Poor scalability of traditional QoS mechanisms

10–100KB 10–100MB

Page 6: EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

6

Seconds: Eternity

6/11/12

Switch

1 Long livedTCP flow

Bursty UDP sessionON: 5msOFF: 15ms

Shared10G pipe

Page 7: EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

7

Under the hood

6/11/12

Switch

Page 8: EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

8

Why is it hard? (2)

6/11/12

Switch

• Switch sees contention, but lacks VM state• Receiver-host has VM state, but does not see contention

(1) Drops in network: servers don’t see true demand

(2) Elusive TCP (back-off) makes true demand detection harder

Page 9: EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

9

Key Idea: Bandwidth Headroom• Bandwidth guarantees: managing congestion• Congestion: link util reaches 100%

– At millisecond timescales• Don’t allow 100% util

– 10% headroom: Early detection at receiver

6/11/12

N x 10G

UDP

TCP

Shared pipeLimit to 9G

Single Switch: Headroom

What about a network?

Page 10: EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

10

Network design: the old

6/11/12

http://bradhedlund.com/2012/04/30/network-that-doesnt-suck-for-cloud-and-big-data-interop-2012-session-

teaser/

Over-subscription

Page 11: EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

11

Network design: the new

6/11/12

http://bradhedlund.com/2012/04/30/network-that-doesnt-suck-for-cloud-and-big-data-interop-2012-session-

teaser/

(1) Uniform capacity across racks

(2) Over-subscription only atTop-of-Rack

Page 12: EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

12

Mitigating Congestion in a Network

6/11/12

Load balancing + Admissibility =Hotspot free network core

[VL2, FatTree, Hedera, MicroTE]

Aggregate rate > 10GbpsFabric gets congested

Server

VM

10Gbps pipe

Fabric

Aggregate rate < 10GbpsCongestion free Fabric

Server

VM

10Gbps pipe

FabricLoad balancing: ECMP, etc.

Admissibility: e2e congestion control (EyeQ)

Page 13: EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

13

EyeQ Platform

6/11/12

TX packets

VMVM

TX

VM

Software VSwitchAdaptive Rate

Limiters

untrusted

RX

3Gbps6Gbps

RX packets

Software VSwitch

VM

Congestion Detectors

untrusted VM

RX componentdetects

TX componentreacts

End-to-endflow control

(VSwitch—VSwitch)

DataCentreFabric

Congestion Feedback

Page 14: EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

14

Does it work?

6/11/12

Without EyeQ With EyeQ

Improves utilisation

Provides protection

TCP: 6GbpsUDP: 3Gbps

Page 15: EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

15

State: only at edge

EyeQ

One Big Switch

Page 16: EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

16

[email protected]

EyeQ Load balancing+ Bandwidth headroom+ Admissibility at millisec timescales= Network as one big switch= Bandwidth sharing at edge

Linux, Windows implementation for 10Gbps~1700 lines C codehttp://github.com/jvimal/perfiso_10g (Linux kmod)No documentation, yet.

Page 17: EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

176/11/12