TCP for Data center networks

DeeptiSurjyendu Ray

What is a Datacenter ?

• A facility used for housing a large amount of computer and communications equipment maintained by an organization for the purpose of handling the data necessary for its operations. ( MSDN Glossary )

• A data center (sometimes spelled datacenter) is a centralized repository, either physical or virtual, for the storage, management, and dissemination of data and information organized around a particular body of knowledge or pertaining to a particular

business.

What is a Datacenter network ?

• Data centers consist of:– Server racks with servers ( compute

nodes/storage )– Switches,– Connecting links along with its topology.

• The network architecture typically is made up of: – tree of routing and – switching elements with progressively

more specialized and expensive equipment moving up the network hierarchy.

Properties of a typical Datacenter network ?

• Characters of a datacenter network:– High-fan-in of the tree.– High-bandwidth, low-latency workload.– Clients that issue barrier-synchronized requests in

parallel.– Relatively small amount of data per request.– Network constraint: Small switch buffer.

The TCP Incast problem?

• Incast :- TCP Throughput Collapse i.e– drastic reduction in application throughput when

simultaneously requesting data from many servers using TCP.

• Leading to :-– Gross under utilization of link capacity in many- to-

one communication networks, like Data Center networks.

The root of TCP Incast :

• Highly bursty, fast data transmissions overfill Ethernet switch buffers.

• Cause being:-– Intense packet loss that results in TCP timeouts.

• The TCP timeouts last 100’s of milliseconds.– TCP timeout ≈ 100’s ms

• But, round trip time of a data centre network is around 100’s of microsecond.– RTT ≈ 100’s us

Round trips !!!!

•RTT << TCP Timeout.•Sender will have to wait for TCP timeout before re-transmission i.e Retransmission time out (RTO)•Coarse grained RTOs reduce application throughput by 90%

Link Idle Time Due To Timeouts

•RTT << TCP Timeout.•Sender will have to wait for TCP timeout before re-transmission i.e Retransmission time out (RTO)•Coarse grained RTOs reduce application throughput by 90%

Induced timeout due to barrier synchronization

• The client can not make forward progress untill the responses from every server for the current request have been received.•Barrier synchronized workloads are becoming increasingly common in today’s commodity clusters.

•E.g. parallel reads/writes in cluster file systems like Lustre, Panasas.• search queries sent to dozen of nodes, with results returned to be sorted.

Barrier Synchronization: a typical request pattern in Data Centers

Idle Link issue !!

200ms timeouts Throughput Collapse

• Advent of more servers into the network induces overflow of switch buffer.• This overflow causes severe packet loss.• Under packet loss, TCP experiences a time out that lasts a minimum of 200 ms.

Proposed solution to TCP Incast

A bi-pronged attack on the problem entails:• System extensions to enable microsecond granularity retransmission

Fine grained TCP retransmission through high resolution Linux kernel timers.

Reducing RTOmin improves system throughput.• Removing acknowledgement delay

The client acknowledges every other packet, thus reducing network load.

Motivation to resolve Incast using TCP

• TCP is well-understood and mature, facilitating its use as a transport protocol, in data centers.

• Commodity Ethernet switches are cost-competitive to specialized technology i.e. Infiniband.

• TCP, being well understood, gives us the potential to harness the TCP stack and modify it to overcome the limitation of the presence of a small buffer in the switch.

Solution Domain

Insight into fine-grained TCP

• Premise:– The timers must operate on a granularity close to the

RTT of the network, hundreds of us or less.• Commodity Ethernet switches are cost-competitive to

specialized technology i.e. Infiniband.• TCP, being well understood, gives us the potential to

harness the TCP stack and modify it to overcome the limitation of the presence of a small buffer in the switch.

RTO Estimation and Minimum Bound

• Jacobson’s TCP RTO Estimator– RTOEstimated = SRTT + (4 * RTTVAR)

• Actual RTO = max(minRTO, RTOEstimated)

• Minimum RTO bound (minRTO) = 200ms– TCP timer granularity– Safety (Allman99)– minRTO (200ms) >> Datacenter RTT (100µs)– 1 TCP Timeout lasts 1000 datacenter RTTs!

RTO Estimation and Minimum Bound

• Jacobson’s TCP RTO Estimator– RTOEstimated = SRTT + (4 * RTTVAR)

• Actual RTO = max(minRTO, RTOEstimated)• Minimum RTO bound (minRTO) = 200ms

– TCP timer granularity– minRTO (200ms) >> Datacenter RTT (100µs)– 1 TCP Timeout lasts 1000 datacenter RTTs!

Evaluation workload

• The test client requests for a block of data, striped across “n” servers.

• Thus each server responds with blocksize/n bytes of data

µsecond Retransmission Timeouts (RTO)

RTO = max( minRTO, f(RTT) )

Does eliminating RTOmin helps avoid TCP incast collapse ?

Simulation result

Reducing the RTOmin in the simulation to us from the current default value of 200ms improves goodput

Real world cluster

Experiments on a real cluster validate the simulation result => reducing the RTOmin improves the goodput.

Real world cluster

Experiments on a real cluster validate the simulation result => reducing the RTOmin improves the goodput.

TCP Requirements for µsecond RTO

• TCP must track RTT in microseconds– Efficient high-resolution kernel timers

• Use HPET for efficient interrupt signaling (HPET is High Precision Event Timer)

• The HPET is a programmable hardware timer that consists of a free-running up counter and several comparators and registers, which modern operating systems can set.

Modifications to the TCP stack

• The minimal modifications required of the TCP stack, to support ”hrtimers” are:– microsecond resolution time accounting to track

RTTs with greater precision.– redefinition of TCP constants.– Replacement of low resolution timers with

hrtimers.

µsecond TCP + no minRTO

For a 48 node cluster providing TCP re transmissions in us eliminates incast collapse for upto 47 servers.

Simulation: Scaling to thousands

In simulation, introducing a randomized component to the RTO desynchronizes retransmission following timeouts and avoids good put degradation for a large number of flows.

Conclusion

• Microsecond granularity TCP timeouts (with no minRTO) improve datacenter application response time and throughput.

• Safe for wide-area communication.• This paper presented a practical, effective,

and safe solution to eliminate TCP incast in data centre environments.– microsecond granularity TCP timeouts.– Randomized re transmissions.

Future Work

• The practical implémentation of the proposed work is shown on about 48 servers in a data centre.

• Its practical implémentation needs to be seen on thousands of machines.

• Narrow down the TCP variables of interest for introducing microsecond granularity to decrease the problem space.

TCP for Data center networks

real cluster

rto min improves

simulation

datacenter

tcp stack

tcp timeouts

practical

tcp

Technology

TCP/IP Networks 101 network layer provides a...

TCP in Mobile Ad-hoc Networks ─ Split TCP

TCP & Data Center Networking TCP & Data Center Networking:.....

TCP Pacing in Data Center Networks

TCP over Wireless Networks - Kau

TCP/IP Networks

TCP for wireless networks

TCP in Wireless Mobile Networks

TCP in Wireless Ad Hoc Networks

TCP/IP Performance over Asymmetric Networks · TCP/IP...

TCP Incast in Data Center Networks

Wireless Networks Spring 2007 TCP for Wireless Networks.

TCP over Wireless Networks

TCP for Wireless Networks - College of Computer and ... ·.....

Networks: IP and TCP

Computer Networks. TCP/IP