Programmable Packet Scheduling - Stanford University · 2019-03-11 · Weighted Fair Queueing (WFQ) ˃Bandwidth guarantees ˃Fully utilize capacity ˃Common Implementations: Deficit

Programmable Packet Scheduling

Stephen Ibanez, Nick McKeown, Gordon Brebner, Anthony Dalleggio, Anirudh Sivaraman

2/7/2019

What is Packet Scheduling?

>> 2

3 x 3Switch

scheduler

Is Packet Scheduling Important?

>> 3

Web Search Workload

˃ Goal: Minimize Flow Completion Time

˃ Shortest Remaining Processing Time (SRPT)

[1] Alizadeh, Mohammad, et al. "pfabric: Minimal near-optimal datacenter transport." ACM SIGCOMM, 2013.

0.2 0.4 0.6 0.80

2

4

6

8

10

Load

Nor

mal

ized

FC

T

TCP-DropTailDCTCPPDQpFabricIdeal

(a) (0, 100KB]: Avg

0.2 0.4 0.6 0.80

2

4

6

8

10

Load

Nor

mal

ized

FC

T(b) (0, 100KB]: 99th prctile

0.2 0.4 0.6 0.80

5

10

15

20

25

Load

Nor

mal

ized

FC

T

(c) (10MB, ∞): Avg

Figure 8: Web search workload: Normalized FCT statistics across different flow sizes. Note that TCP-DropTail does not appear inpart (b) because its performance is outside the plotted range and the y-axis for part (c) has a different range than the other plots.

0.2 0.4 0.6 0.80

2

4

6

8

10

Load

Nor

mal

ized

FC

T

TCP-DropTailDCTCPPDQpFabricIdeal

(a) (0, 100KB]: Avg

0.2 0.4 0.6 0.80

2

4

6

8

10

Load

Nor

mal

ized

FC

T

(b) (0, 100KB]: 99th prctile

0.2 0.4 0.6 0.80

2

4

6

8

10

LoadN

orm

aliz

ed F

CT

(c) (10MB, ∞): Avg

Figure 9: Data mining workload: Normalized FCT statistics across different flow sizes. Note that TCP-DropTail does not appear inpart (b) because its performance is outside the plotted range.

results for the medium (100KB, 10MB] flows whose performanceis qualitatively similar to the small flows (the complete results areprovided in [6]) The results are shown in Figures 8 and 9 for thetwo workloads. We plot the average (normalized) FCT in each binand also the 99th percentile for the small flows The results showthat for both workloads, pFabric achieves near-optimal average and99th percentile FCT for the small flows: it is within ∼1.3–13.4%of the ideal average FCT and within ∼3.3–29% of the ideal 99thpercentile FCT (depending on load). Compared to PDQ, the av-erage FCT for the small flows with pFabric is ∼30-50% lower forthe web search workload and ∼45-55% lower for the data miningworkload with even larger improvements at the 99th percentile.

pFabric also achieves very good performance for the averageFCT of the large flows, across all but the highest loads in the websearch workload. pFabric is roughly the same as TCP and ∼30%worse than Ideal at 80% load for the large flows in the web searchworkload (for the data mining workload, it is within ∼3.3% ofIdeal across all flows). This gap is mainly due to the relativelyhigh loss rate at high load for this workload which wastes band-width on the upstream links (§4.2). Despite the rate control, at80% load, the high initial flow rates and aggressive retransmissionscause a ∼4.3% packet drop rate in the fabric (excluding drops at thesource NICs which do not waste bandwidth), almost all of whichoccur at the last hop (the destination’s access link). However, atsuch high load, a small amount of wasted bandwidth can cause adisproportionate slowdown for the large flows [4]. Note that thisperformance loss occurs only in extreme conditions — with a chal-lenging workload with lots of elephant flows and at very high load.As Figure 8(c) shows, under these conditions, PDQ’s performanceis more than 75% worse than pFabric.

5.4.2 Mix of deadline-constrained anddeadline-unconstrained traffic

We now show that pFabric maximizes the number of flows thatmeet their deadlines while still minimizing the flow completiontime for flows without deadlines. To perform this experiment, we

assign deadlines for the flows that are smaller than 200KB in theweb search and data mining workloads. The deadlines are assumedto be exponentially distributed similar to prior work [21, 14, 18].We vary the mean of the exponential distribution (in different sim-ulations) from 100µs to 100ms to explore the behavior under tightand loose deadlines and measure the Application Throughput (thefraction of flows that meet their deadline) and the average normal-ized FCT for the flows that do not have deadlines. We lower boundthe deadlines to be at least 25% larger than the minimum FCT pos-sible for each flow to avoid deadlines that are impossible to meet.

In addition to the schemes used for the baseline simulations withdeadline-unconstrained traffic, we present the results for pFabricwith Earliest-Deadline-First (EDF) scheduling. pFabric-EDF as-signs the packet priorities for the deadline-constrained flows to bethe flow’s deadline quantized to microseconds; the packets of flowswithout deadlines are assigned priority based on remaining flowsize. Separate queues are used at each fabric port for the deadline-constrained and deadline-unconstrained traffic with strict prioritygiven to the deadline-constrained queue. Within each queue, thepFabric scheduling and dropping mechanisms determine which pack-ets to schedule or drop. Each queue has 36KB of buffer.

Figure 10 shows the application throughout for the two work-loads at 60% load. We picked this moderately high load to testpFabric’s deadline performance under relatively stressful condi-tions. We find that for both workloads, both pFabric-EDF andpFabric achieve almost 100% application throughput even at thetightest deadlines and perform significantly better than the otherschemes. For the web search workload, pFabric-EDF achieves anApplication Throughput of 98.9% for average deadline of 100µs;pFabric (which is deadline-agnostic and just uses the remainingflow size as the priority) is only slightly worse at 98.4% (the num-bers are even higher in the data mining workload). This is notsurprising; since pFabric achieves a near-ideal FCT for the smallflows, it can meet even the tightest deadlines for them. As expected,PDQ achieves a higher application throughput than the other schemes.But it misses a lot more deadlines than pFabric, especially at the

442

10x

>>10xSRPT

Weighted Fair Queueing (WFQ)

˃ Bandwidth guarantees

˃ Fully utilize capacity

˃ Common Implementations:

Deficit Round Robin (DRR)

Start Time Fair Queueing (STFQ)

Stochastic Fair Queueing (SFQ)

˃ Unideal for latency sensitive traffic

>> 4

Web

Video

Big Data

Backup

0.4

0.3

0.1

0.2

Strict Priority

˃ For latency sensitive traffic

˃ Commonly supported today

˃ Problems:

Only up to 8 priority levels per output port

Starvation of low priority traffic

>> 5

Control

Memcached

Other

HI

MED

LOW

Windowed Strict Priority

˃ Goal: Prioritize but avoid starvation

˃ Over time window of length T, serve at most N packets from class i if there are lower priority packets to be served

˃ Need to build a new switch ASIC for this

>> 6

Control

Other

HI

LOW

TN = 3

MemcachedMED

T

Modern Programmable Switch

>> 7

Still fixed function

TrafficManager

ProgrammableParser

ProgrammableIngress Pipeline Programmable

Deparser

ProgrammableEgress Pipeline

Motivation:

>> 8

Be able to deploy new scheduling policies in production networks

Can we find a programmable abstraction for scheduling policies?

Questions:

Can our abstraction be efficiently implemented in hardware?

Packet Scheduler

>> 9

.

.

.

SchedulingLogic

100’s of queues

5 Tbps

One decision every:64B/5Tbps = 100ps

Observations:1. Switches are great at making per-packet decisions

2. Virtually all scheduling policies can decide a packet’s scheduling priority before it is queued

Packet Scheduler

>> 10

.

.

.

SchedulingLogic

Push-In-First-Out (PIFO) Model [1]

˃ Key constraint: packets can’t change rank after insert

˃ Clear separation of fixed and programmable logic

˃ Can implement virtually every scheduling policy that we care about today

>> 11

03678

Fixed PIFO

Programmable rank computation

4Packets always

dequeued from the head

[1] A. Sivaraman, et al. "Programmable packet scheduling at line rate." ACM SIGCOMM 2016

PIFO Examples

>> 12

0123

PIFO

4

p.rank = now

Rank computation

0011

PIFO

0

p.rank = p.tos

Rank computation

FIFO:

Strict Priority:

WFQ using PIFO

˃ If rank computations can utilize state …

˃ Start time fair queueing (STFQ)

>> 13

01235

Fixed PIFO

4

f = flow(p)p.start = max(T[f].finish, virtual_time)T[f].finish = p.start + p.lenp.rank = p.start

Rank computation

Shortest Remaining Processing Time with PIFO

˃ Packet rank set by end host

>> 14

f = flow(p)p.rank = f.rem_size

13789

PIFO SchedulerRank computation

Fine grained priorities

˃ Shortest Flow First (SFF)

˃ Least Slack Time First (LSTF)

˃ Earliest Deadline First (EDF)

˃ Shortest Remaining Processing Time (SRPT)

˃ Service Curve Earliest Deadline first (SCED)

˃ Least Attained Service (LAS)

>> 15

Hierarchical Scheduling

>> 16

xyx bbby

a

Slide credit: Anirudh Sivaraman

˃ Hierarchical Packet Fair Queueing (HPFQ)

˃ Cannot be expressed with a single PIFO

WFQ

WFQ WFQ

a b x y

0.50.5

0.010.99 0.50.5

Hierarchical Scheduling

˃ Hierarchical Packet Fair Queueing (HPFQ)

>> 17

bb b a

PIFO-Red(WFQ on a & b)

PIFO-root (WFQ on Red & Blue)

xx yy

PIFO-Blue(WFQ on x & y)

a1a

BRBB RRBR

Slide credit: Anirudh Sivaraman

WFQ

WFQ WFQ

a b x y

0.50.5

0.010.99 0.50.5

Getting Creative

˃ Punish heavy hitters

˃ Prioritize flows that have experienced the most queueing delay at previous hops

˃ WFQ with weights determined by buffer occupancy

>> 18

Question 2:

Can our scheduling abstraction be efficiently implemented in hardware?

(My focus in this area)

7

8

Can we actually implement a PIFO?

>> 20

01334

˃ Observation: Don’t need to perfectly sort all packetsHead packets are the most important – they will be scheduled soonIt’s ok if there is some churn in the tail packets

566PIFO

HeadTail

˃ Key Features:Head is a small & fast sorting elementTail passes sorted packets to head at line rate (head can never go empty)

compare

Can we actually implement a PIFO?

>> 21

˃ Observation: Don’t need to perfectly sort all packetsHead packets are the most important – they will be scheduled soonIt’s ok if there is some churn in the tail packets

Load Balancer

Selector

Register Head

Det. Skip List

Register Head

Det. Skip List

Register Head

Det. Skip List

Register Head

. . .

Insertion

Removal

HeadTail

NetFPGA Implementation

˃ NetFPGA SUME (Virtex-7)

˃ 200 MHz

˃ 10G PIFO

˃ HeadRegister based – 16 packets

˃ TailDeterministic skip lists – 2K packets

˃ BRAM packet storage

>> 22

Buffer 1

Buffer i

Buffer N

. . .

. . .

Classification & Policing & Drop

Policy

Packet Storage

03678

RankComputation

PIFO

Main Takeaways

q Programmable abstraction for scheduling policiesq Can be efficiently implemented in hardwareq Expose scheduling abstraction to data-plane

programmers

>> 23

Questions?

Extra Slides

References

[1] Anirudh Sivaraman, Suvinay Subramanian, Mohammad Alizadeh, Sharad Chole, Shang-Tse Chuang, Anurag Agrawal, Hari Balakrishnan, Tom Edsall, Sachin Katti, Nick McKeown. "Programmable packet scheduling at line rate." Proceedings of the 2016 ACM SIGCOMM Conference https://cs.nyu.edu/~anirudh/pifo-sigcomm.pdf

>> 26

https://cs.nyu.edu/~anirudh/pifo-sigcomm.pdf

Dealing with rank wrap

˃ Use larger keys (e.g. 56 bits, wraps every 2 years @ 1 GHz)

˃ Restart at rank=0 once PIFO is empty

˃ Use parallel PIFOsOnce close to wrap, switch to alternate PIFO & use rank=0Give old PIFO strict priority until empty

>> 27

Traffic Manager

>> 28

SharedPacket Buffer

.

.

.

Scheduler

Classification

packetdescriptor

packetdescriptor

One decision every clock cycle

(1ns)

Pipelined Dequeue Logic

>> 29

FIFO 1

FIFO 2

FIFO 3

ClassificationLogic

a

a

a

b

b

b

Pipeline State

1a, 2a, 3a

Dequeue LogicStage 1


1a

2a

3a


>> 30

FIFO 1

FIFO 2

FIFO 3

ClassificationLogic

a

a

a

b

b

b

Pipeline State

1a, 2a, 3a



1a

2a

3a

1a, 2a, 3a1b

2b

3b

1b, 2a, 3a

1a, 2b, 3a

1a, 2a, 3b

speculative execution


>> 31

FIFO 1

FIFO 2

FIFO 3

ClassificationLogic

a

a

a

b

b

b

˃ Pros:Provides a way to pipeline dequeue logic, so dequeue logic can be more complicated and hence more programmable

˃ Cons:Complicated implementationSpeculative execution requires extra resources and power, which increases dramatically with dequeue pipeline depth

Pipeline StateDequeue Logic

Stage 1Dequeue Logic

Stage 2

1a, 2a, 3a1b, 2a, 3a

1a, 2b, 3a

1a, 2a, 3bRemove

1a

bc

7

642

036

4

88

Line Rate Sorting

˃ Don’t need to sort all packets!

˃ Observation 1: Ranks increase within a flow

>> 32

45

67

Flow AFlow BFlow C

FIFO queues

PIFO

Flow D

2

Implementation Concern: Rank Computations

>> 33

f = flow(p)p.start = max(T[f].finish, virtual_time)T[f].finish = p.start + p.lenp.rank = p.start

01235

Fixed PIFO

4

virtual_time = p.rank

virtual_time

˃ Some rank computations require shared state between PIFO enqueue and dequeue

˃ Problem:State is local to a PISA pipeline stage, cannot be shared

Implementation Concern: Rank Computations

>> 34

f = flow(p)if (deq_trigger):

virtual_time = deq_rank;if (enq_trigger):

p.start = max(T[f].finish, virtual_time)T[f].finish = p.start + p.lenp.rank = p.start

01235

Fixed PIFO

4

bool deq_triggerbit<16> deq_rank

virtual_timeTrigger pipeline on both enqueue and dequeue events

EnqueueEvent

DequeueEvent

Beyond Packet Scheduling˃ Event-driven packet processing

˃ Events in today’s architectures:Ingress, Egress, Recirculation

˃ New events:Timer, Enqueue, Dequeue, Drop, Loop, Ingress-to-Egress, Control Plane, Link Status Change

˃ What can you do with events?Derive congestion signals for AQMTime-based state updatesEvent-triggered Network TelemetryImproved load balancingOffload control-plane functionality to data-plane

>> 35

Why should we care about packet scheduling?

˃ Lots of different types of traffic w/ different characteristics and requirements

˃ Network operators have a wide range of objectives

˃ Network devices are picking up more functionality

˃ WAN links are expensive à want to make best use of them by prioritizing traffic

˃ Performance isolation for thousands of VMs per server

>> 36

Benefits of programmable packet scheduling˃ Benefits of an unambiguous way to define scheduling algorithms:

Portability

Formal verification and static analysis

Precise way to express customer needs

˃ Benefits of having programmable scheduler:

Innovation

Differentiation

Reliability

Network operators can fine tune for performance

Small menu of algorithms to choose from today

Many possible algorithms that can be expressed

>> 37

p1

Enqueue Packet 1

>> 38

Classification / Policing /

Drop Policy

Queue 1

Queue i

Queue N

. . .

. . .

State

&p1

&p13

rank logic rank logic

rank logic

L2

p1

compute path

p1

L

t1

t2

t3t4

t5

t6

t7t8

Output Port


rank logic

p1

p2

Enqueue Packet 2

>> 39


Drop Policy

Queue 1

Queue i

Queue N

. . .

. . .

State

&p2

&p13

&p27

L2

R1

p2

compute path

p2

R

t1

t2

t3t4

t5

t6

t7t8

Output Port


rank logic

p1

p3

p2

Enqueue Packet 3

>> 40


Drop Policy

Queue 1

Queue i

Queue N

. . .

. . .

State

&p3

&p13

&p32

&p27

L2

R1

p3

compute path

p3

R

R1 t1

t2

t3t4

t5

t6

t7t8

Output Port


rank logic

p1

p3

p2

Dequeue Packet 3

>> 41


Drop Policy

Queue 1

Queue i

Queue N

. . .

. . .

State

&p13

&p32

&p27

L2

R1

compute path

R1

R

&p3

t1

t2

t3t4

t5

t6

t7t8

Output Port

p1

p2

Dequeue Packet 2

>> 42


Drop Policy

Queue 1

Queue i

Queue N

. . .

. . .

State

&p13

&p27

L2

compute path

R1

R

&p2

t1

t2

t3t4

t5

t6

t7t8

Output Port


rank logic


rank logic

p1

Dequeue Packet 1

>> 43


Drop Policy

Queue 1

Queue i

Queue N

. . .

. . .

State

&p13

L2

compute path

L

&p1

t1

t2

t3t4

t5

t6

t7t8

Output Port

PIFO Paper ASIC Design [1]

˃ Flow schedulerChoose amongst head pkt of each flow

˃ Rank StoreStore computed ranks for each flow in FIFO order

˃ PIFO blocks connected in a full mesh

˃ 64 port 10Gbps shared memory switch, 1GHz

˃ 1000 flows, 64K packets

>> 44

235

54 2

46

8

AB

C

Rank Store(SRAM)

Flow Scheduler(flip-flops)

C 3 B 1 A 0

Increasing ranks Increasing ranks

PIFO block diagram

NetFPGA Prototype (P4 Workshop Demo)

˃ Parallel deterministic skip lists and register-based sorting cache

˃ BRAM based packet buffer

>> 45

Packet Buffer

InputPacket

rank computation

Queue 1

Queue i

Queue N

. . .

. . .

03478

Classification

descriptor & metadata

descriptor & rank

descriptor

PIFO Scheduler

OutputPacket

Load Balancer

Selector

Register Cache

Skip List

Register Cache

Skip List

Register Cache

Skip List

Register Cache

. . .

Insertion

Removal

Top level PIFO block diagram

Approximate Pfabric

>> 46

SRPT

FCFS FCFS

7p0

0p0

7p0

7L

9p19p1

0p1

9R

8p28p2

1p2

8R

6p36p3

2p3

6R

p1p0p2p3Final Scheduling Order:

p1p0 p2p3Pfabric Scheduling Order:

Next Steps

˃ Support traffic shaping

˃ Formally understanding what PIFOs can and can’t express

˃ Language abstractions to program PIFO tree

>> 47

Programmable Packet Scheduling - Stanford University · 2019-03-11 · Weighted Fair Queueing (WFQ) ˃Bandwidth guarantees ˃Fully utilize capacity ˃Common Implementations: Deficit

Documents