Top Banner
TCP for Data center networks Deepti Surjyendu Ray
30

TCP for Data center networks

Nov 01, 2014

Download

Technology

Cameroon45

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: TCP for Data center networks

TCP for Data center networks

DeeptiSurjyendu Ray

Page 2: TCP for Data center networks

What is a Datacenter ?

• A facility used for housing a large amount of computer and communications equipment maintained by an organization for the purpose of handling the data necessary for its operations. ( MSDN Glossary )

• A data center (sometimes spelled datacenter) is a centralized repository, either physical or virtual, for the storage, management, and dissemination of data and information organized around a particular body of knowledge or pertaining to a particular

business.

Page 3: TCP for Data center networks

What is a Datacenter network ?

• Data centers consist of:– Server racks with servers ( compute

nodes/storage )– Switches,– Connecting links along with its topology.

• The network architecture typically is made up of: – tree of routing and – switching elements with progressively

more specialized and expensive equipment moving up the network hierarchy.

Page 4: TCP for Data center networks

Properties of a typical Datacenter network ?

• Characters of a datacenter network:– High-fan-in of the tree.– High-bandwidth, low-latency workload.– Clients that issue barrier-synchronized requests in

parallel.– Relatively small amount of data per request.– Network constraint: Small switch buffer.

Page 5: TCP for Data center networks

The TCP Incast problem?

• Incast :- TCP Throughput Collapse i.e– drastic reduction in application throughput when

simultaneously requesting data from many servers using TCP.

• Leading to :-– Gross under utilization of link capacity in many- to-

one communication networks, like Data Center networks.

Page 6: TCP for Data center networks

The root of TCP Incast :

• Highly bursty, fast data transmissions overfill Ethernet switch buffers.

• Cause being:-– Intense packet loss that results in TCP timeouts.

• The TCP timeouts last 100’s of milliseconds.– TCP timeout ≈ 100’s ms

• But, round trip time of a data centre network is around 100’s of microsecond.– RTT ≈ 100’s us

Page 7: TCP for Data center networks

Round trips !!!!

•RTT << TCP Timeout.•Sender will have to wait for TCP timeout before re-transmission i.e Retransmission time out (RTO)•Coarse grained RTOs reduce application throughput by 90%

Page 8: TCP for Data center networks

Link Idle Time Due To Timeouts

•RTT << TCP Timeout.•Sender will have to wait for TCP timeout before re-transmission i.e Retransmission time out (RTO)•Coarse grained RTOs reduce application throughput by 90%

Page 9: TCP for Data center networks

Induced timeout due to barrier synchronization

• The client can not make forward progress untill the responses from every server for the current request have been received.•Barrier synchronized workloads are becoming increasingly common in today’s commodity clusters.

•E.g. parallel reads/writes in cluster file systems like Lustre, Panasas.• search queries sent to dozen of nodes, with results returned to be sorted.

Page 10: TCP for Data center networks

Barrier Synchronization: a typical request pattern in Data Centers

Page 11: TCP for Data center networks

Idle Link issue !!

Page 12: TCP for Data center networks

Idle Link issue !!

Page 13: TCP for Data center networks

200ms timeouts Throughput Collapse

• Advent of more servers into the network induces overflow of switch buffer.• This overflow causes severe packet loss.• Under packet loss, TCP experiences a time out that lasts a minimum of 200 ms.

Page 14: TCP for Data center networks

Proposed solution to TCP Incast

A bi-pronged attack on the problem entails:• System extensions to enable microsecond granularity retransmission

Fine grained TCP retransmission through high resolution Linux kernel timers.

Reducing RTOmin improves system throughput.• Removing acknowledgement delay

The client acknowledges every other packet, thus reducing network load.

Page 15: TCP for Data center networks

Motivation to resolve Incast using TCP

• TCP is well-understood and mature, facilitating its use as a transport protocol, in data centers.

• Commodity Ethernet switches are cost-competitive to specialized technology i.e. Infiniband.

• TCP, being well understood, gives us the potential to harness the TCP stack and modify it to overcome the limitation of the presence of a small buffer in the switch.

Page 16: TCP for Data center networks

Solution Domain

Page 17: TCP for Data center networks

Insight into fine-grained TCP

• Premise:– The timers must operate on a granularity close to the

RTT of the network, hundreds of us or less.• Commodity Ethernet switches are cost-competitive to

specialized technology i.e. Infiniband.• TCP, being well understood, gives us the potential to

harness the TCP stack and modify it to overcome the limitation of the presence of a small buffer in the switch.

Page 18: TCP for Data center networks

RTO Estimation and Minimum Bound

• Jacobson’s TCP RTO Estimator– RTOEstimated = SRTT + (4 * RTTVAR)

• Actual RTO = max(minRTO, RTOEstimated)

• Minimum RTO bound (minRTO) = 200ms– TCP timer granularity– Safety (Allman99)– minRTO (200ms) >> Datacenter RTT (100µs)– 1 TCP Timeout lasts 1000 datacenter RTTs!

Page 19: TCP for Data center networks

RTO Estimation and Minimum Bound

• Jacobson’s TCP RTO Estimator– RTOEstimated = SRTT + (4 * RTTVAR)

• Actual RTO = max(minRTO, RTOEstimated)• Minimum RTO bound (minRTO) = 200ms

– TCP timer granularity– minRTO (200ms) >> Datacenter RTT (100µs)– 1 TCP Timeout lasts 1000 datacenter RTTs!

Page 20: TCP for Data center networks

Evaluation workload

• The test client requests for a block of data, striped across “n” servers.

• Thus each server responds with blocksize/n bytes of data

Page 21: TCP for Data center networks

µsecond Retransmission Timeouts (RTO)

RTO = max( minRTO, f(RTT) )

Does eliminating RTOmin helps avoid TCP incast collapse ?

Page 22: TCP for Data center networks

Simulation result

Reducing the RTOmin in the simulation to us from the current default value of 200ms improves goodput

Page 23: TCP for Data center networks

Real world cluster

Experiments on a real cluster validate the simulation result => reducing the RTOmin improves the goodput.

Page 24: TCP for Data center networks

Real world cluster

Experiments on a real cluster validate the simulation result => reducing the RTOmin improves the goodput.

Page 25: TCP for Data center networks

TCP Requirements for µsecond RTO

• TCP must track RTT in microseconds– Efficient high-resolution kernel timers

• Use HPET for efficient interrupt signaling (HPET is High Precision Event Timer)

• The HPET is a programmable hardware timer that consists of a free-running up counter and several comparators and registers, which modern operating systems can set.

Page 26: TCP for Data center networks

Modifications to the TCP stack

• The minimal modifications required of the TCP stack, to support ”hrtimers” are:– microsecond resolution time accounting to track

RTTs with greater precision.– redefinition of TCP constants.– Replacement of low resolution timers with

hrtimers.

Page 27: TCP for Data center networks

µsecond TCP + no minRTO

For a 48 node cluster providing TCP re transmissions in us eliminates incast collapse for upto 47 servers.

Page 28: TCP for Data center networks

Simulation: Scaling to thousands

In simulation, introducing a randomized component to the RTO desynchronizes retransmission following timeouts and avoids good put degradation for a large number of flows.

Page 29: TCP for Data center networks

Conclusion

• Microsecond granularity TCP timeouts (with no minRTO) improve datacenter application response time and throughput.

• Safe for wide-area communication.• This paper presented a practical, effective,

and safe solution to eliminate TCP incast in data centre environments.– microsecond granularity TCP timeouts.– Randomized re transmissions.

Page 30: TCP for Data center networks

Future Work

• The practical implémentation of the proposed work is shown on about 48 servers in a data centre.

• Its practical implémentation needs to be seen on thousands of machines.

• Narrow down the TCP variables of interest for introducing microsecond granularity to decrease the problem space.