High-Performance Transport Protocols for Data-Intensive World-Wide Grids

NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

High-Performance Transport Protocols for Data-Intensive

World-Wide GridsS. Ravot, Caltech, USA

T. Kelly, University of Cambridge, UKJ.P. Martin-Flatin, CERN, Switzerland

2NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Outline Overview of DataTAG project Problems with TCP in data-intensive Grids

Problem statement Analysis and characterization

Solutions: Scalable TCP GridDT

Future Work

Overview ofDataTAG Project

Member Organizations

http://www.datatag.org/

Project Objectives Build a testbed to experiment with massive

file transfers (TBytes) across the Atlantic

Provide high-performance protocols for gigabit networks underlying data-intensive Grids

Guarantee interoperability between major HEP Grid projects in Europe and the USA

DataTAG Testbed

r04gvaCisco7606

r04chiCisco7609 stm16

(DTag)

r05chi-JuniperM10

r06chi-Alcatel7770

r05gva-JuniperM10

r06gvaAlcatel7770

SURFNET

stm16(Colt)backup+projects

s01gvaExtreme S1i

w01gvaw02gvaw03gvaw04gvaw05gvaw06gvaw20gvav02gvav03gva

w01chiw02chi

v10chiv11chiv12chiv13chi

s01chiExtreme S5i

VTHD/INRIA

stm16(FranceTelecom)

Chicago Geneva

ONS15454

Alcatel 1670 Alcatel 1670

SURFNETCESNET

ONS15454

stm64(GC)

CNAFGEANTcernh7

w03chiw04chiw05chi

3x 3x 2x

w02chi

1000baseSX

SDH/Sonet

1000baseT

10GbaseLX

w06chi

w01bol

CCC tunnel

Edoardo Martelli

Records Beaten Using DataTAG Testbed

Internet2 IPv4 land speed record: February 27, 2003 10,037 km 2.38 Gbit/s for 3,700 s MTU: 9,000 Bytes

Internet2 IPv6 land speed record: May 6, 2003 7,067 km 983 Mbit/s for 3,600 s MTU: 9,000 Bytes

http://lsr.internet2.edu/

Network Research Activities Enhance performance of network protocols

for massive file transfers: Data-transport layer: TCP, UDP, SCTP

QoS: LBE (Scavenger) Equivalent DiffServ (EDS)

Bandwidth reservation: AAA-based bandwidth on demand Lightpaths managed as Grid resources

Monitoring

Rest of this talkRest of this talk

Problems with TCP inData-Intensive Grids

Problem Statement End-user’s perspective:

Using TCP as the data-transport protocol for Grids leads to a poor bandwidth utilization in fast WANs

Network protocol designer’s perspective: TCP is inefficient in high bandwidth*delay

networks because: few TCP implementations have been tuned for gigabit

WANs TCP was not designed with gigabit WANs in mind

Design Problems (1/2) TCP’s congestion control algorithm (AIMD) is

not suited to gigabit networks Due to TCP’s limited feedback mechanisms,

line errors are interpreted as congestion: Bandwidth utilization is reduced when it shouldn’t

RFC 2581 (which gives the formula for increasing cwnd) “forgot” delayed ACKs: Loss recovery time twice as long as it should be

Design Problems (2/2) TCP requires that ACKs be sent at most every

second segment: Causes ACK bursts Bursts are difficult to handle by kernel and NIC

AIMD (1/2) Van Jacobson, SIGCOMM 1988 Congestion avoidance algorithm:

For each ACK in an RTT without loss, increase:

For each window experiencing loss, decrease:

Slow-start algorithm: Increase by one MSS per ACK until ssthresh

iii cwnd

cwndcwnd 11

ii cwndcwnd 21

AIMD (2/2) Additive Increase:

A TCP connection increases slowly its bandwidth utilization in the absence of loss:

forever, unless we run out of send/receive buffers or detect a packet loss

TCP is greedy: no attempt to reach a stationary state Multiplicative Decrease:

A TCP connection reduces its bandwidth utilization drastically whenever a packet loss is detected:

assumption: line errors are negligible, hence packet loss means congestion

Congestion Window (cwnd)

congestioncongestionavoidanceavoidance

slowslowstartstart

Disastrous Effect of Packet Losson TCP in Fast WANs (1/2)

AIMD C=1 Gbit/s MSS=1,460 Bytes

Disastrous Effect of Packet Losson TCP in Fast WANs (2/2)

Long time to recover from a single loss: TCP should react to congestion rather than packet

loss: line errors and transient faults in equipment are no

longer negligible in fast WANs TCP should recover quicker from a loss

TCP is particularly sensitive to packet loss in fast WANs (i.e., when both cwnd and RTT are large)

Characterization of the Problem (1/2)

The responsiveness measures how quickly we go back to using the network link at full capacity after experiencing a loss (i.e., loss recovery time if loss occurs when bandwidth utilization = network link capacity)

2 . inc2 . inc

TCP responsiveness

02000400060008000

1000012000140001600018000

0 50 100 150 200

RTT (ms)

) C= 622 Mbit/sC= 2.5 Gbit/sC= 10 Gbit/s

C . RTTC . RTT22

Characterization of the Problem (2/2)

Capacity RTT # inc Responsiveness

9.6 kbit/s(typ. WAN in 1988)

max: 40 ms 1 0.6 ms

10 Mbit/s(typ. LAN in 1988)

max: 20 ms 8 ~150 ms

100 Mbit/s(typ. LAN in 2003)

max: 5 ms 20 ~100 ms

622 Mbit/s 120 ms ~2,900 ~6 min

2.5 Gbit/s 120 ms ~11,600 ~23 min

10 Gbit/s 120 ms ~46,200 ~1h 30min

inc size = MSS = 1,460 Bytes

Congestion vs. Line Errors

Throughput Required BitLoss Rate

Required PacketLoss Rate

10 Mbit/s 2 10-8 2 10-4

100 Mbit/s 2 10-10 2 10-6

2.5 Gbit/s 3 10-13 3 10-9

10 Gbit/s 2 10-14 2 10-10

At gigabit speed, the loss rate required for packet loss to be ascribed only to congestion is unrealistic with AIMD

RTT=120 ms, MTU=1,500 Bytes, AIMD

Effect of packet loss

0.000001 0.00001 0.0001 0.001 0.01 0.1 1 10Packet Loss frequency (%)

Mathis SIGCOMM'97

WAN (RTT=120ms)

LAN (RTT=0.04 ms)

Single TCP Stream Performance under Periodic Losses

Effect of packet loss

0100200300400500600700800900

0.000001 0.00001 0.0001 0.001 0.01 0.1 1 10

Packet Loss frequency (%)

Mathis SIGCOMM'97

WAN (RTT=120ms)

LAN (RTT=0.04 ms)

Loss rate =0.01%:Loss rate =0.01%:LAN BW LAN BW utilization= 99%utilization= 99%WAN BW WAN BW utilization=1.2%utilization=1.2%

MSS=1,460 Bytes

Solutions

What Can We Do? To achieve higher throughputs over high

bandwidth*delay networks, we can: Fix AIMD Change congestion avoidance algorithm:

Kelly: Scalable TCP Ravot: GridDT

Use larger MTUs Change the initial setting of ssthresh Avoid losses in end hosts

Delayed ACKs with AIMD RFC 2581 (spec. defining TCP congestion control

AIMD algorithm) erred:

Implicit assumption: one ACK per packet In reality: one ACK every second packet with

delayed ACKs Responsiveness multiplied by two:

Makes a bad situation worse in fast WANs Problem fixed by ABC in RFC 3465 (Feb 2003)

Not implemented in Linux 2.4.21

iii cwnd

SMSSSMSScwndcwnd 1

Delayed ACKs with AIMD and ABC

Scalable TCP: Algorithm For cwnd>lwnd, replace AIMD with new

algorithm: for each ACK in an RTT without loss:

cwndi+1 = cwndi + a for each window experiencing loss:

cwndi+1 = cwndi – (b x cwndi) Kelly’s proposal during internship at CERN:

(lwnd,a,b) = (16, 0.01, 0.125) Trade-off between fairness, stability, variance

and convergence

Scalable TCP: lwnd

Scalable TCP: Responsiveness Independent of Capacity

Scalable TCP:Improved Responsiveness

Responsiveness for RTT=200 ms and MSS=1,460 Bytes: Scalable TCP: ~3 s AIMD:

~3 min at 100 Mbit/s ~1h 10min at 2.5 Gbit/s ~4h 45min at 10 Gbit/s

Patch against Linux kernel 2.4.19: http://www-lce.eng.cam.ac.uk/˜ctk21/scalable/

Scalable TCP vs. AIMD:Benchmarking

Number of flows 2.4.19 TCP

2.4.19 TCP + new dev

driverScalable

1 7 16 442 14 39 934 27 60 1358 47 86 140

16 66 106 142

Bulk throughput tests with C=2.5 Gbit/s. Flows transfer 2 GBytes and start again for 20 min.

GridDT: Algorithm Congestion avoidance algorithm:

For each ACK in an RTT without loss, increase:

By modifying A dynamically according to RTT, GridDT guarantees fairness among TCP connections:

iii cwnd

Acwndcwnd 1

RTTRTT

AIMD: RTT Bias

SunnyvaleSunnyvaleStarLightStarLight

CERNCERN

RR RRGE GE SwitchSwitch

Host #1Host #1

POS 2.5POS 2.5 Gbit/sGbit/s1 GE1 GE

1 GE1 GE

Host #2Host #2Host #1Host #1

Host #2Host #2

1 GE1 GE

BottleneckBottleneck

RRPOS 10POS 10 Gbit/sGbit/sRR10GE10GE

Two TCP streams share a 1 Gbit/s bottleneck CERN-Sunnyvale: RTT=181ms. Avg. throughput over a period of 7,000s = 202 Mbit/s CERN-StarLight: RTT=117ms. Avg. throughput over a period of 7,000s = 514 Mbit/s MTU = 9,000 Bytes. Link utilization = 72%

Throughput of two streams with different RTT sharing a 1Gbps bottleneck

0 1000 2000 3000 4000 5000 6000 7000

Time (s)

s) RTT=181ms

Average over the life of the connection RTT=181msRTT=117ms

Average over the life of the connection RTT=117ms

Throughput of two streams with different RTT sharing a 1Gbps bottleneck

0 1000 2000 3000 4000 5000 6000Time (s)

A=7 ; RTT=181ms

Average over the life of the connection RTT=181msB=3 ; RTT=117ms

Average over the life of the connection RTT=117ms

GridDT Fairer than AIMD

CERN-Sunnyvale: RTT = 181 ms. Additive inc. A1 = 7. Avg. throughput = 330 Mbit/s CERN-StarLight: RTT = 117 ms. Additive inc. A2 = 3. Avg. throughput = 388 Mbit/s MTU = 9,000 Bytes. Link utilization 72%

39.2117181 22

RTTRTT

33.237

SunnyvaleSunnyvale

StarLightStarLight

CERNCERN

RR RRGE GE SwitchSwitch

POS 2.5POS 2.5 Gbit/sGbit/s1 GE1 GE

1 GE1 GEHost #2Host #2

Host #1Host #1

Host #2Host #2

1 GE1 GE

1 GE1 GEBottleneckBottleneck

RRPOS 10POS 10 Gbit/sGbit/sRR10GE10GE

Host #1Host #1

A2=3 RTT=117ms

A1=7 RTT=181ms

Advocated by Mathis Experimental environment:

Linux 2.4.21 SysKonnect device driver 6.12 Traffic generated by iperf:

average throughput over the last 5 seconds Single TCP stream RTT = 119 ms Duration of each test: 2 hours Transfers from Chicago to Geneva

MTUs: POS MTU: 9180 Bytes MTU on the NIC: 9000 Bytes

Larger MTUs (1/2)

TCP max: 990 Mbit/s (MTU=9000)TCP max: 990 Mbit/s (MTU=9000)TCP max: 940 Mbit/s (MTU=1500)TCP max: 940 Mbit/s (MTU=1500)

Larger MTUs (2/2)

Related Work Floyd: High-Speed TCP Low: Fast TCP Katabi: XCP Web100 and Net100 projects PFLDnet 2003 workshop:

http://www.datatag.org/pfldnet2003/

Research Directions Compare performance of TCP variants Investigate proposal by Shorten, Leith, Foy

and Kildu More stringent definition of congestion:

Lose more than 1 packet per RTT ACK more than two packets in one go:

Decrease ACK bursts SCTP vs. TCP

High-Performance Transport Protocols for Data-Intensive World-Wide Grids

tcp implementations

datatransport protocol

datatransport layer

high bandwidth

gigabit networksdue

gigabit wanstcp

design problems

poor bandwidth utilization

Documents

Protocols for Wide-Area Data- intensive Applications: Design...

Hanzi Grids Hanzi Grids Hanzi...

Scheduling Distributed Data-Intensive Applications on Global...

DOES DAILY TRACKING IMPROVE CONCORDANCE? Sedation and...

Security and Privacy in Smart Grids - IT Today · OTHER...

University of Michigan (May 8, 2003)Paul Avery1 University.....

Scaling Up Data Intensive Science to Campus Grids

Protocols for Nutrition Support of Neuro Intensive Care ...

Desktop Grids -...

NORDUnet 2003, Reykjavik, Iceland, 26 August 2003...

III3-1 Smart Grids, FACTS, and Micro Grids

Smart Grids And Micro Grids - MTU

Adaptive Grids for Weather and Climate Models · Adaptive.....

1 Scaling Up Data Intensive Science to Campus Grids Douglas....

Smart grids More efficient and reliable grids

Hanzi Grids Hanzi Grids Hanzi Grids - Chinese2017-18 ·...