TCP Incast in Data Center Networks A study of the problem and proposed solutions.

TCP Incast in Data Center Networks

A study of the problem and proposed solutions

Outline

• TCP Incast - Problem Description• Motivation and challenges• Proposed Solutions• Evaluation of proposed solutions• Conclusion• References

Outline


TCP Incast – Problem Description

• Incast jargons:– Barrier Synchronized Workload– SRU (Server Request Unit)– Goodput, Throughput– MTU– BDP– and TCP acronyms like RTT, RTO, CA, AIMD, etc.

TCP Incast – Problem

A typical implementation scenario in the Data Centers

TCP Incast - Problem Many-to-one barrier synchronized workload:

Receiver requests k blocks of data from S storage servers.

Each block of data striped across S storage servers Each server responses with a “fixed” amount of data.

(fixed-fragment workload) Client won’t request block k+1 until all the fragments

of block k have been received. Datacenter scenario:

k=100 S = 1-48 fragment size : 256KB

TCP Incast - Problem

Goodput Collapse

TCP Incast - Problem

• Switch buffers are inherently small in magnitude i.e. 32KB-128KB per port

• Bottleneck switch buffer gets overwhelmed by synchronous sending of data by servers and consequently switch drops the packets

• RTT is typically 1-2ms in datacenters and RTOmin is 200ms. This gap results in packets not getting retransmitted soon

• All the other senders who have already sent the data have to wait until the dropped packet gets retransmitted.

• But large RTO implies that retransmission will be delayed resulting in decrease in goodput

Outline


• Internet datacenters support a myriad of service and applications.– Google, Microsoft, Yahoo, Amazon

• Vast majority of datacenter use TCP for communication between nodes.• Companies like Facebook have adopted UDP as their transport layer protocol

to avoid TCP incast and endowed the responsibility of flow control to the application layer protocols

• The unique workload such as Mapreduce , Hadoop, scale and environment of internet datacenter violate the WAN assumption on which TCP was originally designed.

– Ex: In a Web search application, many workers respond near simultaneously to search queries in which key-value pairs from many Mappers are transferred to appropriate Reducers during the shuffle stage

Motivation

Incast in Bing (Microsoft)

Ref : Slide from Albert Greenberg(Microsoft) presentation at SIGCOMM’10

Challenges

• Minimum changes to TCP implementation needed

• Cannot decrease the RTO min to less than 1ms as operating systems fail to work with high resolution timers for RTO

• Have to address internal and external flows

• Cannot afford large buffer at the switch because it is costly

• Solution needs to be easily deployed and should be cost effective

Outline

• TCP Incast - Problem Description• Characteristics of the problem and challenges• Proposed Solutions• Evaluation of proposed solutions• Conclusion• References

Proposed Solutions

Solutions can be divided into• Application level solutions• Transport layer solutions• Transport layer solutions aided by switch’s ECN and QCN capabilities.

Alternative way to categorize the solutions • Avoiding timeouts in TCP• Reducing RTOmin• Replace TCP• Call lower layer functionalities like Ethernet Flow control for help

Understanding the problem…

• Collaborated study by EECS Berkeley and Intel labs[1]

• Their study focused on– proving this problem is general,

– deriving an analytical model

– Studying the impact of various modifications to TCP on incast behavior.

Different RTO Timers

Observations:

– Initial goodput min occurs at the same number of servers.

– Smaller RTO timer value has faster goodput “recovery” rate

– The decrease rate after local max is the same between different min RTO settings.

• Decreasing the RTO – proportional increase in the goodput

• Surprisingly, 1ms RTO with delayed ack enabled was a better performer

• Delayed ack disabled in 1ms forces overriding of TCP congestion window on the sender side due to high transmission of acks resulting in fluctuations in smoothed RTT

D

L (R* r)

S *D

L (R* r)

Net goodput:

• D: total amount of data to be sent, 100 blocks of 256KB

• L: total transfer time of the workload without and RTO events.

• R: the number of RTO events during the transfer

• S: number of server:• r: the value of the minimum RTO

timer value• I : Interpacket wait time

Modeling of R and I was done based on empirically observed behavior

QUANTITATIVE MODEL:

Key Observations• A smaller minimum RTO timer value means larger goodput values

for the initial minimum.

• The initial goodput minimum occurs at the same number of senders, regardless the value of the minimum RTO times.

• The second order goodput peak occurs at a higher number of senders for a larger RTO timer value

• The smaller the RTO timer values, the faster the rate of recovery between the goodput minimum and the second order goodput maximum.

• After the second order goodput maximum, the slope of goodput decrease is the same for different RTO timer values.

Application level solution[5]

• No changes required to the TCP stack or network switches• Based on scheduling server responses to the same data block

so that no data loss occurs• Caveats:

– Retransmissions can be interesting – Scheduling at the application level cannot be easily

synchronized– Limited control over transport layer

Application level solution

Application level solution

ICTCP-Incast Congestion Control for TCP in Data Center Networks[8]

• Features– Solution based on modifying Congestion window dynamically– Can choose implementation on the receiver side only– focuses on avoiding packet losses before the incast congestion

occurs– Test implementation on Windows NDIS– Novelties in the solution:

• Using Available bandwidth to coordinate the receive window increase in all incoming connections

• Per flow congestion control is performed independently in slotted time of RTT

• Receive window adjustment is based on the ratio of difference of measured and expected throughput over expected one.

• Design considerations– Receiver knows how much throughput is achieved and what is the

available bandwidth

– While overly controlled window mechanism may constrain TCP performance, less controlled does not prevent incast congestion

– Only low latency flows less than 2ms are considered

– Receive window increase is determined by the available bandwidth

– Frequency of receive window based congestion control should be per-flow

– Receive window based scheme should adjust the window according to link congestion and application requirement

ICTCP Algorithm

• Control trigger: Available bandwidth– Calculate available bandwidth

– Estimate the potential throughput/flow increase before increasing receive window

– Time divided into two slots– For each network interface, measure

available bandwidth in first sub-slot and calculate quota for window increase in second sub-slot

– Ensure the total increase in the receive window is less than the total available bandwidth calculated in the first sub-slot

ICTCP Algorithm

• per connection control interval: 2*RTT– to estimate the throughput of a TCP connection for receive window adjustment, the

shortest time scale is an RTT for that connection

– control interval for a TCP connection is 2*RTT in ICTCP

• One RTT latency for adjusted window to take effect

• One additional RTT for measuring throughput with the newly adjusted window

– For any TCP connection, if now time is in the second global sub-slot and it observes that the past time is larger than 2*RTT since its last receive window adjustment, it may increase its window based on newly observed TCP throughput and current available bandwidth.

ICTCP Algorithm• Window adjustment on single connection

– Receive window is adjusted based on its incoming measured throughput• Measured throughput is current requirement of the application over that TCP

connection

• Expected throughput is the expectation of throughput on that TCP connection if throughput is constrained by receive window

– Define ratio of throughput difference

– Make receive adjustment based on the following conditions• MSS and i increase receive window if it’s now in global

second sub-slot and there is enough quota of available bandwidth on the network interface. Decrease the quota correspondingly if the receive window is increased.

• decrease receive window by one MSS if this condition holds for three continuous RTT. The minimal receive window is 2*MSS.

• Otherwise, keep current receive window.

ICTCP Algorithm

• Fairness Controller for multiple connections– Fairness considered only for low latency flows– Decrease window for fairness only when

BWA < 0.2C

– For window decrease, we cut the receive window by one MSS3, for some selected TCP connections.

• Select those connections that have receive window larger than the average window value of all connections.

– For window increase, this is automatically achieved by our window adjustment

ICTCP Experimental Results

• Testbed– 47 servers– 1 LB4G 48-port Gigabit

Ethernet switch– Gigabit Ethernet

Broadcom NIC at the hosts

– Windows Server 2008 R2 Enterprise 64-bit version

Issues with ICTCP

• ICTCP scalability to a large number of TCP connections is an issue because receive window can decrease below 1 MSS resulting in degraded TCP performance

• Extending ICTCP to handle congestion in general cases where sender and receiver are not under the same switch and bottleneck link is not the last link to the receiver

• ICTCP for future high bandwidth low latency networks

DCTCP

• Features– TCP like protocol for data centers

– It uses ECN (Explicit Congestion Notification) to provide multi-bit feedback to the end-hosts

– Claim is that DCTCP provides better throughput than TCP using 90% less buffer space

– Provides high burst tolerance and low latency for short flows

– Also can handle 10X increase in foreground and background traffic without significant hit on the performance front.

DCTCP

• Overview– Applications in data centers largely require

• Low latency for short flows• High burst tolerance• High utilization for long flows

– Low latency for short flows have real time deadlines of about approximately 10-100ms

– To avoid continuously modifying internal data structures, high utilization for long flows is essentia

– Study analyzed production traffic from app. 6000 servers with app. 150 TB of traffic for a period of 1 month

– Query traffic (of 2KB to 20KB) experience incast impairment

DCTCP

• Overview (Contd)– Proposed DCTCP uses ECN capability available in most modern switches– Uses multi-bit feedback on congestion from single bit stream of ECN marks– Essence of the proposal is to keep switch buffer occupancies persistently

low, while maintaining high throughput – to control queue length at switches, use Active Queue management(AQM)

approach that uses explicit feedback from congested switches– Claim is also that only as much as 30 LoC to TCP and setting of a single

parameter on switches is needed– DCTCP focuses on 3 problems

• Incast• Queue Buildup• Buffer pressure

Our area

DCTCP

• Algorithm– Mainly concentrates on extent of congestion

rather than just the presence of it.– Act of deriving multi-bit feedback from single bit

sequence of marks– Three components of the algorithm

• Simple marking at the switch• ECN-echo at the receiver• Controller at the sender

DCTCP

• Simple marking at the switch– An arriving packet is marked with CE (Congestion Experienced)

codepoint if the queue occupancy is greater than K (marking threshold)

– Marking is not based on average queue length , but instantaneous

• ECN-ECHO at the receiver:– Normally, in TCP, an ECN-ECHO is set to all packets until receiver gets

CWR from the sender– DCTCP receiver sends a ECN-ECHO only if a CE codepoint is seen on

the packet

DCTCP

• Controller at the sender:– Sender maintains an estimate of the fraction of

packets that are marked and updated every window

– When is close to 0 , low congestion and close to 1 indicate high congestion

– While TCP cuts its window by half, DCTCP uses to determine the sender’s window size

• cwnd cwnd x (1 - /2)

DCTCP• Modeling of when window reaches W*(when K is at the critcal point)

• Maximum size Q max

of the queue depends on the number of synchronously sending servers N

• Lower bound for K can be derived by

DCTCP

• How DCTCP solves Incast?– TCP suffers from timeouts

when N>10– DCTCP senders receive ECN

marks, slow their rate– Suffers timeouts when N

large enough to overwhelm static buffer size

– Solution is Dynamic Buffering

Outline


Evaluation of proposed solutions

• Application level solution – Genuine retransmissions cascading timeouts congestion– Scheduling at the application level cannot be easily synchronized– Limited control over transport layer

• ICTCP- Solution that needs minimal change and is cost effective– ICTCP scalability to a large number of TCP connections is an issue

– Extending ICTCP to handle congestion in general cases has a limited solution

– ICTCP for future high bandwidth low latency networks will need extra support from Link layer technologies

• DCTCP- Solution that needs minimal change but requires switch support

– DCTCP requires dynamic buffering for larger number of senders

Conclusion

• No solution completely solves the problem other than configuring less RTO

• Solutions have less focus on foreground an background traffic together

• Need solutions which are cost effective+ requiring minimal change to environment+ and of course solving incast!!!

References1. Y.Chen, R.Griffith, J.Liu, R.H.Katz, and A.D.Joseph, “Understanding TCP Incast Throughput Collapse

in Datacenter Networks” in Proc. of ACM WREN, 2009.2. Kulkarni.S., Agrawal. P, “A Probabilistic Approach to Address TCP Incast in Data Center

Networks”,Distributed Computing Systems Workshops (ICDCSW), 20113. Peng Zhang, Hongbo Wang, Shiduan Cheng “Shrinking MTU to Mitigate TCP Incast Throughput

Collapse in Data Center Networks”, Communications and Mobile Computing (CMC), 20114. Yan Zhang, Ansari, N.,”On mitigating TCP Incast in Data Center Networks”, INFOCOM Proceedings-

IEEE,20115. Maxim Podlesny, Carey Williamson, “An application-level solution for the TCP-incast problem in

data center networks”,IWQoS ’11: Proceedings of the 19th International Workshop on Quality of Service,IEEE, June,2011

6. Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan, “Data center TCP (DCTCP)”, SIGCOMM ’10: Proceedings of the ACM\SIGCOMM, August 2010

7. Hongyun Zheng, Changjia Chen, Chunming Qiao, “Understanding the Impact of Removing TCP Binary Exponential Backoff in Data Centers”, Communications and Mobile Computing (CMC), 2011

8. Haitao Wu, Zhenqian Feng, Chuanxiong Guo, Yongguang Zhang, “ICTCP: Incast Congestion Control for TCP in data center networks”,Co-NEXT ’10: Proceedings of the 6th International COnference, ACM,November 2010

TCP Incast in Data Center Networks A study of the problem and proposed solutions.

Documents

tcp implementation

block of data

blocks of data

synchronous sending

data center networksa

datacenter scenario

large rto

application layer protocols