Top Banner

Click here to load reader

Titan: Fair Packet Scheduling for Commodity MultiqueueNICs · PDF file Titan: Fair Packet Scheduling for Commodity MultiqueueNICs Brent Stephens, Arjun Singhvi, Aditya Akella, and

Oct 09, 2020

ReportDownload

Documents

others

  • Titan: Fair Packet Scheduling for Commodity Multiqueue NICs Brent Stephens, Arjun Singhvi, Aditya Akella, and Mike Swift

    July 13th, 2017

  • Ethernet line-rates are increasing!

    2

  • Servers need:

    3

    To drive increasing line-rates

    Low CPU utilization networking

  • Underlying mechanisms:

    4

    Segmentation Offload

    Multiqueue NICs

  • Using large segments (64KB) instead of packets can reduce CPU load

    5

    F1 F2 F1 F2

    Wire

    F1

    F2

    Wire

    TCP Segmentation Offload (TSO)

    • Many operations performed by the OS are per-packet, not per- byte • TSO allows the OS to send large segments to the NIC • TSO NIC hardware generates packets from segments

  • Core 2Core 1

    Multiqueue NICs enable parallelism 6

    Multiqueue NICs

    TXQ-2TXQ-1

    Wire

    Packet Scheduler

    F1

    F3

    F2

    F2 Locking/Polling

    Wire

    Core 1 Core 2

    F1 F2 F3

  • Fairness Problems

    7 TSO and multiqueue cause pervasive unfairness

    Core 2Core 1 TXQ-2TXQ-1

    Wire

    Packet Scheduler

    F1 F3

    F2 F2

    Wire

    F1 F3 F2F1F2F2F2 F3 TSO

    unfairness Multiqueue unfairness

    Wire F3

    Fair packet

    schedule:

    Actual packet

    schedule: F1 F2F1F3 F2F1F3 F2

  • Fairness is important

    8

    • Fairness is needed so competing applications can share the network

    • Fairness is needed for predictability • Unfairness leads to unpredictable completion times across runs • Perfect fairness → perfect predictability

    • Fairness can improve application performance • Ex: Weighted Coflow Scheduling

    • [Chowdhury SIGCOMM11, Chowdhury SIGCOMM14]

  • Titan Goals:

    9

    Drive increasing line-rates

    Low CPU utilization

    Per-flow fairness

    Work on commodity

    NICs

  • Multiqueue Fairness in Linux:

    • Flow arrivals to each transmit queue are dynamic • The OS statically uses a per-flow hash to assign flows to queues • The NIC scheduler statically uses deficit round-robin (DRR) to provide per-queue fairness • In the datacenter, the OS statically chooses a TSO size

    10

  • Titan Design: As flows dynamically arrive and complete, in Titan: The OS dynamically: • Assigns weights to flows • Tracks the flow occupancy of queues • Picks queues for flows • Updates the NIC with queue weights

    The NIC dynamically: • Applies queue weights from the OS

  • Causes of Unfairness:

    12

    Multiqueue unfairness TSO unfairness

  • Problem: Hash collisions

    13

    TXQ-2TXQ-1

    Wire

    Packet Scheduler

    F1

    F3

    TXQ-3

    F2

    Wire

    F1F3 F2F1F2F2F2 F3

    Multiqueue unfairness

  • Problem: Hash collisions

    14

    TXQ-2TXQ-1

    Wire

    Packet Scheduler

    F1

    TXQ-3

    F2

    Solution: Dynamic Queue Assignment (DQA) • OS assigns a weight to each flow • DQA picks the queue with the lowest occupancy when a flow starts • Queue occupancies are updated: • Any time a flow starts enqueuing data • Any time a flow has no enqueued bytes (at most each TX interrupt)

    F3

  • Problem: Hash collisions

    15

    TXQ-2TXQ-1

    Wire

    Packet Scheduler

    F1

    TXQ-3

    F2

    Wire

    F1F3 F2

    F3

    Solution: Dynamic Queue Assignment (DQA)

    F1F3 F2F1F3 F2

  • Problem: Asymmetric Oversubscription

    16

    TXQ-2TXQ-1

    Wire

    Packet Scheduler

    F1

    TXQ-3

    F3 F2

    F4

    Wire

    F1F3F4F1F3F4F2F3F4F2F3F4

    F1 and F2 receive half throughput

    W: 1 W: 1 W: 1

  • Problem: Asymmetric Oversubscription

    17

    Solution: Dynamic Queue Weight Assignment (DQWA)

    TXQ-2TXQ-1

    Wire

    Packet Scheduler

    F1

    TXQ-3

    F3 F2

    F4

    ndo_set_tx_weight

    • OS assigns weights to flows • OS updates the NIC scheduler with queue occupancies as flows start and stop (at most each TX interrupt) • NIC updates DRR weights

    W: 2 W: 1 W: 1

    This is implementable on existing commodity NICs because it only needs to update DRR weights!

  • Problem: Asymmetric Oversubscription

    18

    Solution: Dynamic Queue Weight Assignment (DQWA)

    TXQ-2TXQ-1

    Wire

    Packet Scheduler

    F1

    TXQ-3

    F3 F2

    F4

    ndo_set_tx_weight

    Wire

    F1F3F4 F1F2F3F4 F2

    DQA and DQWA provide long-term fairness

    W: 2 W: 1 W: 1

    This is implementable on existing commodity NICs because it only needs to update DRR weights!

  • Problem: TSO Unfairness

    19

    TXQ-2TXQ-1

    Wire

    Packet Scheduler

    F1

    TXQ-3

    F3 F2

    F4

    Wire

    F1F3F4 F1F2F3F4 F2 Short-term unfairness

    W: 2 W: 1 W: 1

    • Short-term unfairness can cause bursts of congestion in the network • Short-term unfairness can increase latency

  • Problem: TSO Unfairness

    20

    Solution: Dynamic Segmentation Offload Sizing (DSOS)

    TXQ-2TXQ-1

    Wire

    Packet Scheduler

    F1

    TXQ-3

    F3F2 F4

    Wire

    F1F3F4 F2F1F3F4 F2

    • DSOS dynamically changes the segment size during oversubscription • Same implementation as GSO

    • CPU vs fairness tradeoff • Segmenting after the TCP/IP stack reduces CPU costs

    F1 F2

    W: 2 W: 1 W: 1

  • Implementation

    • DQA, DQWA, and DSOS are implemented in Linux 4.4.6

    • Support for ndo_set_tx_weight is implemented in the Intel ixgbe driver for the Intel 82599 10Gbps NIC

    • Titan is open source!

    21 https://github.com/bestephe/titan

  • Evaluation • Microbenchmarks • 2 servers, 1 switch • 8 queue NICs • Vary number of flows (level of oversubscription)

    • Incremental fairness benefits of DQA, DQWA, and DSOS • DQA and DQWA: expected to improve long-term fairness

    • DSOS: expected to improve short-term fairness

    22

  • Evaluation – Fairness Metric Metrics: • Normalized fairness metric

    (NFM) inspired by Shreedhar and Varghese: • NFM = 0 is fair • NFM > 1 is very unfair

    23

    Wire

    F1F3 F2F1F2F2F2 F3 Wire

    F3 Ideal packet

    schedule:

    Unfair packet

    schedule:

    F1F2F1F3 F2F1F3 F2 NFM = 0

    NFM = 1

    NFM = (Bytes(MaxFlow) – Bytes(MinFlow)) / Bytes(FairShair)

  • Microbenchmarks – 1s Timescale

    24

    0

    0.5

    1

    1.5

    2

    2.5

    6 12 24 48

    N FM

    -1 s

    Number of Flows Linux DQA DQA + DQWA DQA + DQWA + DSOS (16KB)

    • Linux is unfair at all subscription levels • DQA often significantly improves fairness • At 48 flows, flow churn prevents DQA from evenly spreading flows

    • DQWA improves fairness when DQA cannot evenly spread flows across queues • DSOS does not have a significant impact on long- term fairness

  • Microbenchmarks – 1ms Timescale

    25

    0

    1

    2

    3

    4

    5

    6

    6 12 24 48

    N FM

    -1 m s

    Number of Flows Linux DQA DQA + DQWA DQA + DQWA + DSOS (16KB)

    • At short timescales and under oversubscription, DQA and DQWA do not significantly improve fairness • TSO is the primary cause of unfairness

    • DSOS (16KB) often reduces unfairness by >2x

  • Cluster Experiments

    26

    CDF of completion times in a 1GB all-to-all shuffle (24 servers)

    2.5 3.0 3.5 4.0 4.5 5.0 5.5 Flow Completion Time (s)

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    C um

    ul at

    iv e

    Pr ob

    ab ilit

    y (a) 6 servers

    4 5 6 7 8 9 101112 Flow Completion Time (s)

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    C um

    ul at

    iv e

    Pr ob

    ab ilit

    y (b) 12 servers

    10 12 14 16 18 20 22 24 Flow Completion Time (s)

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    C um

    ul at

    iv e

    Pr ob

    ab ilit

    y (c) 24 servers

    Vanilla Vanilla (Cmax) Titan

    2.5 3.0 3.5 4.0 4.5 5.0 5.5 Flow Completion Time (s)

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    C um

    ul at

    iv e

    Pr ob

    ab ilit

    y (a) 6 servers

    4 5 6 7 8 9 101112 Flow Completion Time (s)

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    C um

    ul at

    iv e

    Pr ob

    ab ilit

    y (b) 12 servers

    10 12 14 16 18 20 22 24 Flow Completion Time (s)

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    C um

    ul at

    iv e

    Pr ob

    ab ilit

    y (c) 24 servers

    Vanilla Vanilla (Cmax) Titan

    2.5 3.0 3.5 4.0 4.5 5.0 5.5 Flow Completion Time (s)

    0.0

    0.2

    0.4

    0.6