Top Banner
1 Congestion Control EE122 Fall 2012 Scott Shenker http://inst.eecs.berkeley.edu/~ee122/ Materials with thanks to Jennifer Rexford, Ion Stoica, Vern Paxson and other colleagues at Princeton and UC Berkeley
74

Congestion Control

Feb 25, 2016

Download

Documents

chelsey

Congestion Control. EE122 Fall 2012 Scott Shenker http:// inst.eecs.berkeley.edu /~ee122/ Materials with thanks to Jennifer Rexford, Ion Stoica , Vern Paxson and other colleagues at Princeton and UC Berkeley. Announcements. Project 3 is out!. A few words from Panda…. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Congestion Control

1

Congestion Control

EE122 Fall 2012

Scott Shenkerhttp://inst.eecs.berkeley.edu/~ee122/

Materials with thanks to Jennifer Rexford, Ion Stoica, Vern Paxsonand other colleagues at Princeton and UC Berkeley

Page 2: Congestion Control

Announcements• Project 3 is out!

2

Page 3: Congestion Control

A few words from Panda….

3

Page 4: Congestion Control

4

Congestion Control Review

Did not have slides last time

Going to review key points

Page 5: Congestion Control

Caveat: In this lecture• Sometimes CWND is in units of MSS’s

– Because I want to count CWND in small integers– This is only for pedagogical purposes

• Sometimes CWND is in bytes– Because we actually are keeping track of real windows– This is how TCP code works

• Figure it out from context….

5

Page 6: Congestion Control

6

Load and Delay

AveragePacket delay

Load

Typical queuing system with bursty arrivals

Must balance utilization versus delay and loss

AveragePacket loss

Load

Page 7: Congestion Control

Not All Losses the Same• Duplicate ACKs: isolated loss

– Still getting ACKs

• Timeout: possible disaster– Not enough dupacks– Must have suffered several losses

7

Page 8: Congestion Control

AIMD• Additive increase

– On success of last window of data, increase by one MSS

• Multiplicative decrease– On loss of packet, divide congestion window in half

8

Page 9: Congestion Control

9

Leads to the TCP “Sawtooth”

t

Window

halved

Loss

Page 10: Congestion Control

A

Simple geometric analysis

10

Timeouts

t

cwnd

1

RTT

maxW

2maxW

Page 11: Congestion Control

11

AIMD Starts Too Slowly!

t

Window

It could take a long time to get started!

Need to start with a small CWND to avoid overloading the network.

Page 12: Congestion Control

“Slow-Start” Phase• Start with a small congestion window

– Initially, CWND is 1 MSS– So, initial sending rate is MSS/RTT

• But want to increase quickly– Rather than just use additive increase….– ..we enter “slow-start” phase (actually “fast start”)

• Sender starts at a slow rate (hence the name)– but increases exponentially until first loss

12

Page 13: Congestion Control

13

Slow Start in ActionDouble CWND per round-trip time

Simple implementation:on each ack, CWND += MSS

D A D D A A D D

Src

Dest

D D

1 2 43

A A A A

8

Page 14: Congestion Control

14

Slow Start and the TCP Sawtooth

Loss

Exponential“slow start”

t

Window

Why is it called slow-start? Because TCP originally hadno congestion control mechanism. The source would just

start by sending a whole window’s worth of data.

Page 15: Congestion Control

What is really looks like…

15

Page 16: Congestion Control

16

Congestion Control Details

Page 17: Congestion Control

17

Increasing CWND• Increase by MSS for every successful window

• Increase a fraction of MSS per received ACK• # packets (thus ACKs) per window: CWND / MSS• Increment per ACK:

CWND += MSS / (CWND / MSS)

• Termed: Congestion Avoidance– Very gentle increase

Page 18: Congestion Control

18

Fast Retransmission• Sender sees 3 dupACKs

• Multiplicative decrease: CWND halved

Page 19: Congestion Control

19

CWND with Fast Retransmit

segment 1cwnd = 1

ACK 2cwnd = 2 segment 2

segment 3

ACK 4

cwnd = 4 segment 4segment 5segment 6segment 7

ACK 4

ACK 4

ACK 3cwnd = 3

ACK 4 segment 4

3 duplicateACKs

cwnd = 2

Page 20: Congestion Control

20

Loss Detected by Timeout• Sender starts a timer that runs for RTO seconds• Restart timer whenever ack for new data arrives

• If timer expires:– Set SSTHRESH CWND / 2 (“Slow-Start Threshold”)– Set CWND MSS– Retransmit first lost packet– Execute Slow Start until CWND > SSTHRESH– After which switch to Additive Increase

Page 21: Congestion Control

21

Summary of Decrease• Cut CWND half on loss detected by dupacks

– “fast retransmit”

• Cut CWND all the way to 1 MSS on timeout– Set ssthresh to cwnd/2

• Never drop CWND below 1 MSS

Page 22: Congestion Control

Summary of Increase• “Slow-start”: increase cwnd by MSS for each ack

• Leave slow-start regime when either:– cwnd > SSThresh– Packet drop

• Enter AIMD regime– Increase by MSS for each window’s worth of acked data

22

Page 23: Congestion Control

23

Repeating Slow Start After Timeout

t

Window

Slow-start restart: Go back to CWND of 1 MSS, but take advantage of knowing the previous value of CWND.

Slow start in operation until it reaches half of previous CWND, I.e.,

SSTHRESH

TimeoutFast Retransmission

SSThreshSet to Here

Page 24: Congestion Control

More Advanced Fast Restart• Set ssthresh to cwnd/2

• Set cwnd to cwnd/2 + 3– for the 3 dup acks already seen

• Increment cwnd by 1 MSS for each additional duplicate ACK

• After receiving new ACK, reset cwnd to ssthresh

24

Page 25: Congestion Control

Example• Consider a TCP connection with:

– MSS=10bytes– ISN=100– CWND=100 bytes– Last ACK was for seq # 110

i.e., receiver expecting next packet to have seq. no. 110

• Packets with seq. no. 110 to 200 are in flight– What ACKs do they generate?– And how does the sender respond?

25

Page 26: Congestion Control

History• ACK 110 (due to 120) cwnd=100 dup#1• ACK 110 (due to 130) cwnd=100 dup#2• ACK 110 (due to 140) cwnd=100 dup#3• RXMT 110 ssthresh=50 cwnd=80• ACK 110 (due to 150) cwnd=90• ACK 110 (due to 160) cwnd=100• ACK 110 (due to 170) cwnd=110 xmit 210• ACK 110 (due to 180) cwnd=120 xmit 220

26

Page 27: Congestion Control

History (cont’d)• ACK 110 (due to 190) cwnd=130 xmit 230• ACK 110 (due to 200) cwnd=140 xmit 240• ACK 210 (due to 110 rxmit) cwnd=ssthresh=50

xmit 250• ACK 220 (due to 210) cwnd=60• …..

27

Page 28: Congestion Control

28

Why AIMD?

Page 29: Congestion Control

Four alternatives• AIAD: gentle increase, gentle decrease

• AIMD: gentle increase, drastic decrease

• MIAD: drastic increase, gentle decrease– too many losses: eliminate

• MIMD: drastic increase and decrease

29

Page 30: Congestion Control

30

AIMD Sharing Dynamics

A Bx1

D E

0

10

20

30

40

50

60

1 28 55 82 109

136

163

190

217

244

271

298

325

352

379

406

433

460

487

No congestion rate increases by one packet/RTT every RTT Congestion decrease rate by factor 2

Rates equalize fair share

x2

Page 31: Congestion Control

31

AIAD Sharing Dynamics

A Bx1

D E No congestion x increases by one packet/RTT every RTT Congestion decrease x by 1

0

10

20

30

40

50

60

1 28 55 82 109

136

163

190

217

244

271

298

325

352

379

406

433

460

487

x2

Page 32: Congestion Control

32

Other Congestion Control Topics

Page 33: Congestion Control

TCP fills up queues• Means that delays are large for everyone

• And when you do fill up queues, many packets have to be dropped (not really)

• Alternative: Random Early Drop (LBL)– Drop packets on purpose before queue is full– Set drop probability D as a function of queue size– Keep queue average small, but tolerate bursts

33

Page 34: Congestion Control

What if loss isn’t congestion-related?• Can use Explicit Congestion Notification (ECN)

• Bit in IP packet header, that is carried up to TCP

• When RED router would drop, it sets bit instead– Congestion semantics of bit exactly like that of drop

• Advantages:– Don’t confuse corruption with congestion– Don’t confuse recovery with rate adjustment

34

Page 35: Congestion Control

How does this work at high speed?• Throughput = (MSS/RTT) sqrt(3/2p)

– Assume that RTT = 100ms, MSS=1500bytes

• What value of p is required to go 100Gbps?– Roughly 2 x 10-12

• How long between drops?– Roughly 16.6 hours

• How much data has been sent in this time?– Roughly 6 petabits

• These are not practical numbers! 35

Page 36: Congestion Control

Adapting TCP to High Speed• One approach: once speed is past some

threshold, change equation to p-.8 rather than p-.5

– Let the additive constant in AIMD depend on CWND– At very high speeds, increase CWND by more than MSS

• We will discuss other approaches next later…

36

Page 37: Congestion Control

How “Fair” is TCP?• Throughput depends inversely on RTT

• If open K TCP flows, get K times more bandwidth!

• What is fair, anyway?

37

Page 38: Congestion Control

What happens if hosts “cheat”?• Can get more bandwidth by being more

aggressive– Source can set CWND =+ 2MSS upon success– Gets much more bandwidth (see forthcoming HW4)

• Currently we require all congestion-control protocols to be “TCP-Friendly”– To use no more than TCP does in similar setting

• But Internet remains vulnerable to non-friendly implementations– Need router support to deal with this…

38

Page 39: Congestion Control

Router-Assisted Congestion Control• There are two different tasks:

– Isolation/fairness– Adjustment

• Isolation/fairness:– We would like to make sure each flow gets its “fair share”– This protects flows from cheaters

Safety/Security issue– No longer requires everyone use same CC algorithm

Innovation issue

• Adjustment:– Can routers help flows find the right sending rate?

39

Page 40: Congestion Control

Isolation: Intuition• Treat each “flow” separately

– For now, flows are packets between same Source/Dest.

• Each flow has its own FIFO queue in router

• Service flows in a round-robin fashion– When line becomes free, take packet from next flow

• Assuming all flows are sending MTU packets, all flows can get their fair share– But what if not all are sending at full rate?– And some are sending at more than their share?

40

Page 41: Congestion Control

Max-Min Fairness• Given set of bandwidth demands ri and total

bandwidth C, max-min bandwidth allocations are:

ai = min(f, ri)

• where f is the unique value such that Sum(ai) = C

• This is what round-robin service gives– if all packets are MTUs

• Property:– If you don’t get full demand, no one gets more than you

41

Page 42: Congestion Control

42

Example• C = 10; r1 = 8, r2 = 6, r3 = 2; N = 3

• C/3 = 3.33 – Can service all of r3

– Remove r3 from the accounting: C = C – r3 = 8; N = 2

• C/2 = 4 – Can’t service all of r1 or r2

– So hold them to the remaining fair share: f = 4

862

442

f = 4: min(8, 4) = 4 min(6, 4) = 4 min(2, 4) = 2

10

Page 43: Congestion Control

43

Fair Queuing (FQ)• Implementation of round-robin generalized to case

where not all packets are MTUs

• Weighted fair queueing (WFQ) lets you assign different flows different shares

• WFQ is implemented in almost all routers– Variations in how implemented

Packet scheduling (here) Just packet dropping (AFD)

Page 44: Congestion Control

With FQ Routers• Flows can pick whatever CC scheme they want

– Can open up as many TCP connections as they want

• There is no such thing as a “cheater”– To first order…

• Bandwidth share does not depend on RTT

• Does require complication on router– Cheating not a problem, so there’s little motivation– But WFQ is used at larger granularities 44

Page 45: Congestion Control

FQ is really “processor sharing”• Every current flow gets same service

• When flows end, other flows pick up extra service

• FQ realizes these rates through packet scheduling

• But we could just assign them directly– This is the Rate-Control Protocol (RCP) [Stanford]

Follow on to XCP (MIT/ICSI)

45

Page 46: Congestion Control

RCP Algorithm• Packets carry “rate field”

• Routers insert “fair share” f in packet header– Router inserts FS only if it is smaller than current value

• Routers calculate f by keeping link fully utilized– Remember basic equation: Sum(Min[f,ri]) = C

46

Page 47: Congestion Control

Fair Sharing is more than a moral issue• By what metric should we evaluate CC?

• One metric: average flow completion time (FCT)

• Let’s compare FCT with RCP and TCP– Ignore XCP curve….

47

Page 48: Congestion Control

48

Flow Completion Time: TCP vs. PS (and XCP)Flow Duration (secs) vs. Flow Size # Active Flows vs. time

Page 49: Congestion Control

Why the improvement?

Page 50: Congestion Control

50

Why is Scott a Moron?

Or why does Bob Briscoe think so?

Page 51: Congestion Control

Giving equal shares to “flows” is silly• What if you have 8 flows, and I have 4?

– Why should you get twice the bandwidth

• What if your flow goes over 4 congested hops, and mine only goes over 1?– Why shouldn’t you be penalized for using more scarce

bandwidth?

• And what is a flow anyway?– TCP connection– Source-Destination pair?– Source? 51

Page 52: Congestion Control

Charge people for congestion!• Use ECN as congestion markers

• Whenever I get ECN bit set, I have to pay $$$

• Now, there’s no debate over what a flow is, or what fair is…

• Idea started by Frank Kelly, backed by much math– Great idea: simple, elegant, effective– Never going to happen…

52

Page 53: Congestion Control

53

Datacenter Networks

Page 54: Congestion Control

What makes them special?• Huge scale:

– 100,000s of servers in one location

• Limited geographic scope:– High bandwidth– Very low RTT

• Extreme latency requirements– With real money on the line

• Single administrative domain– No need to follow standards, or play nice with others

• Often “green field” deployment– So can “start from scratch”…

54

Page 55: Congestion Control

Deconstructing Datacenter Packet Transport

Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker

Stanford University U.C. Berkeley/ICSI

HotNets 2012 55

Page 56: Congestion Control

Transport in Datacenters

• Latency is King– Web app response time

depends on completion of 100s of small RPCs

• But, traffic also diverse– Mice AND Elephants– Often, elephants are the

root cause of latency

Large-scale Web Application

Fabric

Data Tier

App Tier

AppLogic

AppLogic

AppLogic

AppLogic

AppLogic

AppLogic

AppLogic

AppLogic

AppLogic

App Logic Alice

Who does she know?What has she done?

MinnieEric Pics VideosApps

HotNets 2012 56

Page 57: Congestion Control

Transport in Datacenters• Two fundamental requirements

– High fabric utilization• Good for all traffic, esp. the large flows

– Low fabric latency (propagation + switching)• Critical for latency-sensitive traffic

• Active area of research– DCTCP[SIGCOMM’10], D3[SIGCOMM’11] HULL[NSDI’11], D2TCP[SIGCOMM’12] PDQ[SIGCOMM’12], DeTail[SIGCOMM’12]

vastly improve performance,

but fairly complex

HotNets 2012 57

Page 58: Congestion Control

pFabric in 1 Slide

HotNets 2012

Packets carry a single priority #• e.g., prio = remaining flow size

pFabric Switches • Very small buffers (e.g., 10-20KB)• Send highest priority / drop lowest priority pkts

pFabric Hosts• Send/retransmit aggressively• Minimal rate control: just prevent congestion collapse

58

Page 59: Congestion Control

DC Fabric: Just a Giant Switch!

HotNets 2012

H1 H2 H3 H4 H5 H6 H7 H8 H9

59

Page 60: Congestion Control

HotNets 2012

H1 H2 H3 H4 H5 H6 H7 H8 H9

DC Fabric: Just a Giant Switch!

60

Page 61: Congestion Control

H1H2

H3H4

H5H6

H7H8

H9H1

H2H3

H4H5

H6H7

H8H9

HotNets 2012

H1H2

H3H4

H5H6

H7H8

H9

TX RX

DC Fabric: Just a Giant Switch!

61

Page 62: Congestion Control

HotNets 2012

DC Fabric: Just a Giant Switch!

H1H2

H3H4

H5H6

H7H8

H9H1

H2H3

H4H5

H6H7

H8H9

TX RX62

Page 63: Congestion Control

H1H2

H3H4

H5H6

H7H8

H9H1

H2H3

H4H5

H6H7

H8H9

HotNets 2012

Objective? Minimize avg FCT

DC transport = Flow scheduling on giant switch

ingress & egress capacity constraints

TX RX63

Page 64: Congestion Control

“Ideal” Flow SchedulingProblem is NP-hard [Bar-Noy et al.]

– Simple greedy algorithm: 2-approximation

HotNets 2012

1

2

3

1

2

3

64

Page 65: Congestion Control

HotNets 2012

pFabric Design

65

Page 66: Congestion Control

pFabric Switch

HotNets 2012

Switch Port

7 1

9 43

Priority Scheduling send higher priority packets first

Priority Dropping drop low priority packets first

5

small “bag” of packets per-port

66

prio = remaining flow size

Page 67: Congestion Control

Near-Zero Buffers• Buffers are very small (~1 BDP)

– e.g., C=10Gbps, RTT=15µs → BDP = 18.75KB – Today’s switch buffers are 10-30x larger

Priority Scheduling/Dropping Complexity• Worst-case: Minimum size packets (64B)

– 51.2ns to find min/max of ~300 numbers– Binary tree implementation takes 9 clock cycles– Current ASICs: clock = 1-2ns

HotNets 2012 67

Page 68: Congestion Control

pFabric Rate Control• Priority scheduling & dropping in fabric also

simplifies rate control– Queue backlog doesn’t matter

HotNets 2012

H1 H2 H3 H4 H5 H6 H7 H8 H9

50% Loss

One task: Prevent congestion collapse when elephants collide

68

Page 69: Congestion Control

pFabric Rate Control

• Minimal version of TCP1. Start at line-rate

• Initial window larger than BDP

2. No retransmission timeout estimation• Fix RTO near round-trip time

3. No fast retransmission on 3-dupacks• Allow packet reordering

HotNets 2012 69

Page 70: Congestion Control

Why does this work?

Key observation: Need the highest priority packet destined for a port available at the port at any given time.

• Priority scheduling High priority packets traverse fabric as quickly as possible

• What about dropped packets? Lowest priority → not needed till all other packets depart Buffer larger than BDP → more than RTT to retransmit

HotNets 2012 70

Page 71: Congestion Control

Evaluation

HotNets 2012

55% of flows3% of bytes

5% of flows35% of bytes

• 54 port fat-tree: 10Gbps links, RTT = ~12µs• Realistic traffic workloads

– Web search, Data mining * From Alizadeh et al. [SIGCOMM 2010]

<100KB >10MB

71

Page 72: Congestion Control

Evaluation: Mice FCT (<100KB)

HotNets 2012

Average 99th Percentile

Near-ideal: almost no jitter72

Page 73: Congestion Control

Evaluation: Elephant FCT (>10MB)

HotNets 2012

Congestion collapse at high load w/o rate control

73

Page 74: Congestion Control

Summary

pFabric’s entire design: Near-ideal flow scheduling across DC fabric• Switches

– Locally schedule & drop based on priority

• Hosts – Aggressively send & retransmit– Minimal rate control to avoid congestion collapse

HotNets 2012 74