Programmable Packet Scheduling Stephen Ibanez, Nick McKeown, Gordon Brebner, Anthony Dalleggio, Anirudh Sivaraman 2/7/2019
Programmable Packet Scheduling
Stephen Ibanez, Nick McKeown, Gordon Brebner, Anthony Dalleggio, Anirudh Sivaraman
2/7/2019
What is Packet Scheduling?
>> 2
3 x 3Switch
scheduler
Is Packet Scheduling Important?
>> 3
Web Search Workload
˃ Goal: Minimize Flow Completion Time
˃ Shortest Remaining Processing Time (SRPT)
[1] Alizadeh, Mohammad, et al. "pfabric: Minimal near-optimal datacenter transport." ACM SIGCOMM, 2013.
0.2 0.4 0.6 0.80
2
4
6
8
10
Load
Nor
mal
ized
FC
T
TCP-DropTailDCTCPPDQpFabricIdeal
(a) (0, 100KB]: Avg
0.2 0.4 0.6 0.80
2
4
6
8
10
Load
Nor
mal
ized
FC
T(b) (0, 100KB]: 99th prctile
0.2 0.4 0.6 0.80
5
10
15
20
25
Load
Nor
mal
ized
FC
T
(c) (10MB, ∞): Avg
Figure 8: Web search workload: Normalized FCT statistics across different flow sizes. Note that TCP-DropTail does not appear inpart (b) because its performance is outside the plotted range and the y-axis for part (c) has a different range than the other plots.
0.2 0.4 0.6 0.80
2
4
6
8
10
Load
Nor
mal
ized
FC
T
TCP-DropTailDCTCPPDQpFabricIdeal
(a) (0, 100KB]: Avg
0.2 0.4 0.6 0.80
2
4
6
8
10
Load
Nor
mal
ized
FC
T
(b) (0, 100KB]: 99th prctile
0.2 0.4 0.6 0.80
2
4
6
8
10
LoadN
orm
aliz
ed F
CT
(c) (10MB, ∞): Avg
Figure 9: Data mining workload: Normalized FCT statistics across different flow sizes. Note that TCP-DropTail does not appear inpart (b) because its performance is outside the plotted range.
results for the medium (100KB, 10MB] flows whose performanceis qualitatively similar to the small flows (the complete results areprovided in [6]) The results are shown in Figures 8 and 9 for thetwo workloads. We plot the average (normalized) FCT in each binand also the 99th percentile for the small flows The results showthat for both workloads, pFabric achieves near-optimal average and99th percentile FCT for the small flows: it is within ∼1.3–13.4%of the ideal average FCT and within ∼3.3–29% of the ideal 99thpercentile FCT (depending on load). Compared to PDQ, the av-erage FCT for the small flows with pFabric is ∼30-50% lower forthe web search workload and ∼45-55% lower for the data miningworkload with even larger improvements at the 99th percentile.
pFabric also achieves very good performance for the averageFCT of the large flows, across all but the highest loads in the websearch workload. pFabric is roughly the same as TCP and ∼30%worse than Ideal at 80% load for the large flows in the web searchworkload (for the data mining workload, it is within ∼3.3% ofIdeal across all flows). This gap is mainly due to the relativelyhigh loss rate at high load for this workload which wastes band-width on the upstream links (§4.2). Despite the rate control, at80% load, the high initial flow rates and aggressive retransmissionscause a ∼4.3% packet drop rate in the fabric (excluding drops at thesource NICs which do not waste bandwidth), almost all of whichoccur at the last hop (the destination’s access link). However, atsuch high load, a small amount of wasted bandwidth can cause adisproportionate slowdown for the large flows [4]. Note that thisperformance loss occurs only in extreme conditions — with a chal-lenging workload with lots of elephant flows and at very high load.As Figure 8(c) shows, under these conditions, PDQ’s performanceis more than 75% worse than pFabric.
5.4.2 Mix of deadline-constrained anddeadline-unconstrained traffic
We now show that pFabric maximizes the number of flows thatmeet their deadlines while still minimizing the flow completiontime for flows without deadlines. To perform this experiment, we
assign deadlines for the flows that are smaller than 200KB in theweb search and data mining workloads. The deadlines are assumedto be exponentially distributed similar to prior work [21, 14, 18].We vary the mean of the exponential distribution (in different sim-ulations) from 100µs to 100ms to explore the behavior under tightand loose deadlines and measure the Application Throughput (thefraction of flows that meet their deadline) and the average normal-ized FCT for the flows that do not have deadlines. We lower boundthe deadlines to be at least 25% larger than the minimum FCT pos-sible for each flow to avoid deadlines that are impossible to meet.
In addition to the schemes used for the baseline simulations withdeadline-unconstrained traffic, we present the results for pFabricwith Earliest-Deadline-First (EDF) scheduling. pFabric-EDF as-signs the packet priorities for the deadline-constrained flows to bethe flow’s deadline quantized to microseconds; the packets of flowswithout deadlines are assigned priority based on remaining flowsize. Separate queues are used at each fabric port for the deadline-constrained and deadline-unconstrained traffic with strict prioritygiven to the deadline-constrained queue. Within each queue, thepFabric scheduling and dropping mechanisms determine which pack-ets to schedule or drop. Each queue has 36KB of buffer.
Figure 10 shows the application throughout for the two work-loads at 60% load. We picked this moderately high load to testpFabric’s deadline performance under relatively stressful condi-tions. We find that for both workloads, both pFabric-EDF andpFabric achieve almost 100% application throughput even at thetightest deadlines and perform significantly better than the otherschemes. For the web search workload, pFabric-EDF achieves anApplication Throughput of 98.9% for average deadline of 100µs;pFabric (which is deadline-agnostic and just uses the remainingflow size as the priority) is only slightly worse at 98.4% (the num-bers are even higher in the data mining workload). This is notsurprising; since pFabric achieves a near-ideal FCT for the smallflows, it can meet even the tightest deadlines for them. As expected,PDQ achieves a higher application throughput than the other schemes.But it misses a lot more deadlines than pFabric, especially at the
442
10x
>>10xSRPT
Weighted Fair Queueing (WFQ)
˃ Bandwidth guarantees
˃ Fully utilize capacity
˃ Common Implementations:
Deficit Round Robin (DRR)
Start Time Fair Queueing (STFQ)
Stochastic Fair Queueing (SFQ)
˃ Unideal for latency sensitive traffic
>> 4
Web
Video
Big Data
Backup
0.4
0.3
0.1
0.2
Strict Priority
˃ For latency sensitive traffic
˃ Commonly supported today
˃ Problems:
Only up to 8 priority levels per output port
Starvation of low priority traffic
>> 5
Control
Memcached
Other
HI
MED
LOW
Windowed Strict Priority
˃ Goal: Prioritize but avoid starvation
˃ Over time window of length T, serve at most N packets from class i if there are lower priority packets to be served
˃ Need to build a new switch ASIC for this
>> 6
Control
Other
HI
LOW
TN = 3
MemcachedMED
T
Modern Programmable Switch
>> 7
Still fixed function
TrafficManager
ProgrammableParser
ProgrammableIngress Pipeline Programmable
Deparser
ProgrammableEgress Pipeline
Motivation:
>> 8
Be able to deploy new scheduling policies in production networks
Can we find a programmable abstraction for scheduling policies?
Questions:
Can our abstraction be efficiently implemented in hardware?
Packet Scheduler
>> 9
.
.
.
SchedulingLogic
100’s of queues
5 Tbps
One decision every:64B/5Tbps = 100ps
Observations:1. Switches are great at making per-packet decisions
2. Virtually all scheduling policies can decide a packet’s scheduling priority before it is queued
Packet Scheduler
>> 10
.
.
.
SchedulingLogic
Push-In-First-Out (PIFO) Model [1]
˃ Key constraint: packets can’t change rank after insert
˃ Clear separation of fixed and programmable logic
˃ Can implement virtually every scheduling policy that we care about today
>> 11
03678
Fixed PIFO
Programmable rank computation
4Packets always
dequeued from the head
[1] A. Sivaraman, et al. "Programmable packet scheduling at line rate." ACM SIGCOMM 2016
PIFO Examples
>> 12
0123
PIFO
4
p.rank = now
Rank computation
0011
PIFO
0
p.rank = p.tos
Rank computation
FIFO:
Strict Priority:
WFQ using PIFO
˃ If rank computations can utilize state …
˃ Start time fair queueing (STFQ)
>> 13
01235
Fixed PIFO
4
f = flow(p)p.start = max(T[f].finish, virtual_time)T[f].finish = p.start + p.lenp.rank = p.start
Rank computation
Shortest Remaining Processing Time with PIFO
˃ Packet rank set by end host
>> 14
f = flow(p)p.rank = f.rem_size
13789
PIFO SchedulerRank computation
Fine grained priorities
˃ Shortest Flow First (SFF)
˃ Least Slack Time First (LSTF)
˃ Earliest Deadline First (EDF)
˃ Shortest Remaining Processing Time (SRPT)
˃ Service Curve Earliest Deadline first (SCED)
˃ Least Attained Service (LAS)
>> 15
Hierarchical Scheduling
>> 16
xyx bbby
a
Slide credit: Anirudh Sivaraman
˃ Hierarchical Packet Fair Queueing (HPFQ)
˃ Cannot be expressed with a single PIFO
WFQ
WFQ WFQ
a b x y
0.50.5
0.010.99 0.50.5
Hierarchical Scheduling
˃ Hierarchical Packet Fair Queueing (HPFQ)
>> 17
bb b a
PIFO-Red(WFQ on a & b)
PIFO-root (WFQ on Red & Blue)
xx yy
PIFO-Blue(WFQ on x & y)
a1a
BRBB RRBR
Slide credit: Anirudh Sivaraman
WFQ
WFQ WFQ
a b x y
0.50.5
0.010.99 0.50.5
Getting Creative
˃ Punish heavy hitters
˃ Prioritize flows that have experienced the most queueing delay at previous hops
˃ WFQ with weights determined by buffer occupancy
>> 18
Question 2:
Can our scheduling abstraction be efficiently implemented in hardware?
(My focus in this area)
7
8
Can we actually implement a PIFO?
>> 20
01334
˃ Observation: Don’t need to perfectly sort all packetsHead packets are the most important – they will be scheduled soonIt’s ok if there is some churn in the tail packets
566PIFO
HeadTail
˃ Key Features:Head is a small & fast sorting elementTail passes sorted packets to head at line rate (head can never go empty)
compare
Can we actually implement a PIFO?
>> 21
˃ Observation: Don’t need to perfectly sort all packetsHead packets are the most important – they will be scheduled soonIt’s ok if there is some churn in the tail packets
Load Balancer
Selector
Register Head
Det. Skip List
Register Head
Det. Skip List
Register Head
Det. Skip List
Register Head
. . .
Insertion
Removal
HeadTail
NetFPGA Implementation
˃ NetFPGA SUME (Virtex-7)
˃ 200 MHz
˃ 10G PIFO
˃ HeadRegister based – 16 packets
˃ TailDeterministic skip lists – 2K packets
˃ BRAM packet storage
>> 22
Buffer 1
Buffer i
Buffer N
. . .
. . .
Classification & Policing & Drop
Policy
Packet Storage
03678
RankComputation
PIFO
Main Takeaways
q Programmable abstraction for scheduling policiesq Can be efficiently implemented in hardwareq Expose scheduling abstraction to data-plane
programmers
>> 23
Questions?
Extra Slides
References
[1] Anirudh Sivaraman, Suvinay Subramanian, Mohammad Alizadeh, Sharad Chole, Shang-Tse Chuang, Anurag Agrawal, Hari Balakrishnan, Tom Edsall, Sachin Katti, Nick McKeown. "Programmable packet scheduling at line rate." Proceedings of the 2016 ACM SIGCOMM Conference https://cs.nyu.edu/~anirudh/pifo-sigcomm.pdf
>> 26
Dealing with rank wrap
˃ Use larger keys (e.g. 56 bits, wraps every 2 years @ 1 GHz)
˃ Restart at rank=0 once PIFO is empty
˃ Use parallel PIFOsOnce close to wrap, switch to alternate PIFO & use rank=0Give old PIFO strict priority until empty
>> 27
Traffic Manager
>> 28
SharedPacket Buffer
.
.
.
Scheduler
Classification
packetdescriptor
packetdescriptor
One decision every clock cycle
(1ns)
Pipelined Dequeue Logic
>> 29
FIFO 1
FIFO 2
FIFO 3
ClassificationLogic
a
a
a
b
b
b
Pipeline State
1a, 2a, 3a
Dequeue LogicStage 1
Dequeue LogicStage 2
1a
2a
3a
Pipelined Dequeue Logic
>> 30
FIFO 1
FIFO 2
FIFO 3
ClassificationLogic
a
a
a
b
b
b
Pipeline State
1a, 2a, 3a
Dequeue LogicStage 1
Dequeue LogicStage 2
1a
2a
3a
1a, 2a, 3a1b
2b
3b
1b, 2a, 3a
1a, 2b, 3a
1a, 2a, 3b
speculative execution
Pipelined Dequeue Logic
>> 31
FIFO 1
FIFO 2
FIFO 3
ClassificationLogic
a
a
a
b
b
b
˃ Pros:Provides a way to pipeline dequeue logic, so dequeue logic can be more complicated and hence more programmable
˃ Cons:Complicated implementationSpeculative execution requires extra resources and power, which increases dramatically with dequeue pipeline depth
Pipeline StateDequeue Logic
Stage 1Dequeue Logic
Stage 2
1a, 2a, 3a1b, 2a, 3a
1a, 2b, 3a
1a, 2a, 3bRemove
1a
bc
7
642
036
4
88
Line Rate Sorting
˃ Don’t need to sort all packets!
˃ Observation 1: Ranks increase within a flow
>> 32
45
67
Flow AFlow BFlow C
FIFO queues
PIFO
Flow D
2
Implementation Concern: Rank Computations
>> 33
f = flow(p)p.start = max(T[f].finish, virtual_time)T[f].finish = p.start + p.lenp.rank = p.start
01235
Fixed PIFO
4
virtual_time = p.rank
virtual_time
˃ Some rank computations require shared state between PIFO enqueue and dequeue
˃ Problem:State is local to a PISA pipeline stage, cannot be shared
Implementation Concern: Rank Computations
>> 34
f = flow(p)if (deq_trigger):
virtual_time = deq_rank;if (enq_trigger):
p.start = max(T[f].finish, virtual_time)T[f].finish = p.start + p.lenp.rank = p.start
01235
Fixed PIFO
4
bool deq_triggerbit<16> deq_rank
virtual_timeTrigger pipeline on both enqueue and dequeue events
EnqueueEvent
DequeueEvent
Beyond Packet Scheduling˃ Event-driven packet processing
˃ Events in today’s architectures:Ingress, Egress, Recirculation
˃ New events:Timer, Enqueue, Dequeue, Drop, Loop, Ingress-to-Egress, Control Plane, Link Status Change
˃ What can you do with events?Derive congestion signals for AQMTime-based state updatesEvent-triggered Network TelemetryImproved load balancingOffload control-plane functionality to data-plane
>> 35
Why should we care about packet scheduling?
˃ Lots of different types of traffic w/ different characteristics and requirements
˃ Network operators have a wide range of objectives
˃ Network devices are picking up more functionality
˃ WAN links are expensive à want to make best use of them by prioritizing traffic
˃ Performance isolation for thousands of VMs per server
>> 36
Benefits of programmable packet scheduling˃ Benefits of an unambiguous way to define scheduling algorithms:
Portability
Formal verification and static analysis
Precise way to express customer needs
˃ Benefits of having programmable scheduler:
Innovation
Differentiation
Reliability
Network operators can fine tune for performance
Small menu of algorithms to choose from today
Many possible algorithms that can be expressed
>> 37
p1
Enqueue Packet 1
>> 38
Classification / Policing /
Drop Policy
Queue 1
Queue i
Queue N
. . .
. . .
State
&p1
&p13
rank logic rank logic
rank logic
L2
p1
compute path
p1
L
t1
t2
t3t4
t5
t6
t7t8
Output Port
rank logic rank logic
rank logic
p1
p2
Enqueue Packet 2
>> 39
Classification / Policing /
Drop Policy
Queue 1
Queue i
Queue N
. . .
. . .
State
&p2
&p13
&p27
L2
R1
p2
compute path
p2
R
t1
t2
t3t4
t5
t6
t7t8
Output Port
rank logic rank logic
rank logic
p1
p3
p2
Enqueue Packet 3
>> 40
Classification / Policing /
Drop Policy
Queue 1
Queue i
Queue N
. . .
. . .
State
&p3
&p13
&p32
&p27
L2
R1
p3
compute path
p3
R
R1 t1
t2
t3t4
t5
t6
t7t8
Output Port
rank logic rank logic
rank logic
p1
p3
p2
Dequeue Packet 3
>> 41
Classification / Policing /
Drop Policy
Queue 1
Queue i
Queue N
. . .
. . .
State
&p13
&p32
&p27
L2
R1
compute path
R1
R
&p3
t1
t2
t3t4
t5
t6
t7t8
Output Port
p1
p2
Dequeue Packet 2
>> 42
Classification / Policing /
Drop Policy
Queue 1
Queue i
Queue N
. . .
. . .
State
&p13
&p27
L2
compute path
R1
R
&p2
t1
t2
t3t4
t5
t6
t7t8
Output Port
rank logic rank logic
rank logic
rank logic rank logic
rank logic
p1
Dequeue Packet 1
>> 43
Classification / Policing /
Drop Policy
Queue 1
Queue i
Queue N
. . .
. . .
State
&p13
L2
compute path
L
&p1
t1
t2
t3t4
t5
t6
t7t8
Output Port
PIFO Paper ASIC Design [1]
˃ Flow schedulerChoose amongst head pkt of each flow
˃ Rank StoreStore computed ranks for each flow in FIFO order
˃ PIFO blocks connected in a full mesh
˃ 64 port 10Gbps shared memory switch, 1GHz
˃ 1000 flows, 64K packets
>> 44
235
54 2
46
8
AB
C
Rank Store(SRAM)
Flow Scheduler(flip-flops)
C 3 B 1 A 0
Increasing ranks Increasing ranks
PIFO block diagram
NetFPGA Prototype (P4 Workshop Demo)
˃ Parallel deterministic skip lists and register-based sorting cache
˃ BRAM based packet buffer
>> 45
Packet Buffer
InputPacket
rank computation
Queue 1
Queue i
Queue N
. . .
. . .
03478
Classification
descriptor & metadata
descriptor & rank
descriptor
PIFO Scheduler
OutputPacket
Load Balancer
Selector
Register Cache
Skip List
Register Cache
Skip List
Register Cache
Skip List
Register Cache
. . .
Insertion
Removal
Top level PIFO block diagram
Approximate Pfabric
>> 46
SRPT
FCFS FCFS
7p0
0p0
7p0
7L
9p19p1
0p1
9R
8p28p2
1p2
8R
6p36p3
2p3
6R
p1p0p2p3Final Scheduling Order:
p1p0 p2p3Pfabric Scheduling Order:
Next Steps
˃ Support traffic shaping
˃ Formally understanding what PIFOs can and can’t express
˃ Language abstractions to program PIFO tree
>> 47