High performance Throughput

1

High performance Throughput

Les Cottrell – SLACLecture # 5a presented at the 26th International Nathiagali Summer College on Physics

and Contemporary Needs, 25th June – 14th July, Nathiagali, Pakistan

Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also supported by IUPAP

2

How to measure• Selected about a dozen major collaborator sites in

California, Colorado, Illinois, FR, CH, UK over last 9 months– Of interest to SLAC– Can get logon accounts

• Use iperf– Choose window size and # parallel streams– Run for 10 seconds together with ping (loaded)– Stop iperf, run ping (unloaded) for 10 seconds– Change window or number of streams & repeat

• Record streams, window, throughput (Mbits/s), loaded & unloaded ping responses

3

iperf file transfer (2MB) between SLAC and CERN

25 Feb 2000

0

2000

4000

6000

8000

10000

12000

14000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Number of Parallel Streams

Th

rou

gp

ut

(Kb

it/s

ec

)

8 KB16 KB32 KB64 KB100 KB300 KB600 KB1 MB

TCP Window Size

Default window size

SLAC to CERN thruput vs windows & streams

Hi-perf = big windows & multiple streamsImproves ~

linearlywith streams forsmall windows

8kB

16kB32kB100kB

1MB

64kB

4

E.g. thruput vs windows & streamsANL Colorado

IN2P3, FRCERN, CH

Caltech

Win

dow

Mbits/s

StreamsI NFN, IT

Mbits/s

Daresbury, UK

Mbits/s

5

Progress towards goal:100 Mbytes/s Site-to-Site

• Focus on SLAC – Caltech over NTON;• Using NTON wavelength division fibers up

& down W. Coast US;• Replaced Exemplar with 8*OC3 & Suns

with Pentium IIIs & OC12 (622Mbps) • SLAC Cisco 12000 with OC48 (2.4Gbps)

and 2 × OC12;• Caltech Juniper M160 & OC48• ~500 Mbits/s single stream achieved

recently over OC12.

6

SC2000 WAN Challenge• SC2000, Dallas to SLAC RTT ~ 48msec

– SLAC/FNAL booth: Dell PowerEdge PIII 2 * 550MHz with 64bit PCI + Dell 850MHz both running Linux, each with GigE, connected to Cat 6009 with 2GigE bonded to Extreme SC2000 floor switch

– NTON: OC48 to GSR to Cat 5500 Gig E to Sun E4500 4*460MHz and Sun E4500 6*336MHz

• Internet 2: 300 Mbits/s

• NTON 960Mbits/s• Details:

– www-iepm.slac.stanford.edu/monitoring/bulk/sc2k.html

http://www-iepm.slac.stanford.edu/monitoring/bulk/sc2k.html









7

Iperf throughput conclusions 1/2• Can saturate bottleneck links• For a given iperf measurement, streams share throughput

equally.• For small window sizes throughput increases linearly with

number of streams• Predicted optimum window sizes can be large (> Mbyte)• Need > 1 stream to get optimum performance• Can get close to max thruput with small (<=32Mbyte) with

sufficient (5-10) streams• Improvements of 5 to 60 in thruput by using multiple

streams & larger windows• Loss not sensitive to throughput

8

Iperf thruput conclusions 2/2• For fixed streams*window product, streams are

more effective than window size:

• There is an optimum number of streams above which performance flattens out

• See www-iepm.slac.stanford.edu/monitoring/bulk/

4.6Mbits/s864kBCaltech

1.7Mbits/s2256kBCaltech

26.8Mbits/s864kBCERN

9.45Mbits/2256kBCERN

ThroughputStreamsWindowSite

http://www-iepm.slac.stanford.edu/monitoring/bulk/









9

Network Simulator (ns-2)• From UCB, simulates network

– Choice of stack (Reno, Tahoe, Vegas, SACK…)– RTT, bandwidth, flows, windows, queue lengths …

• Compare with measured results– Agrees well– Confirms observations (e.g. linear growth in throughput

for small window sizes as increase number of flows)

10

Agreement of ns2 with observed

11

Ns-2 thruput & loss predict

•Indicates on unloaded link can get 70% of available bandwidth without causing noticeable packet loss

•Can get over 80-90% of available bandwidth

•Can overdrive: no extra throughput BUT extra loss

90%

12

Simulator benefits• No traffic on network (nb throughput can use 90%)• Can do what if experiments• No need to install iperf servers or have accounts• No need to configure host to allow large windows• BUT

– Need to estimate simulator parameters, e.g.• RTT use ping or synack• Bandwidth, use pchar, pipechar etc., moderately accurate

• AND its not the real thing– Need to validate vs. observed data– Need to simulate cross-traffic etc

13

Impact of cross-traffic

on Iperf between SLAC & GSFC/

Maryland

SCP

HTTP

bbftp

iperfAll TCP traffic

Iperf port traffic

To SLAC From SLAC

14

Impact on Others• Make ping measurements with & without iperf

loading– Loss loaded(unloaded)– RTT

15

Impact of applying QoS• Defined 3 classes of service, application marked packets:

– Scavenger service (1%), Best effort, & Priority service (30%)– Used DiffServ features in Cisco 7507 with DS3 link

• Appears to work as expected

Measurements madeby Dave Hartzell, of GreatPlains net, May 01

16

Improvements for major International BaBar sites

Throughput improvements of 1 to 16 times in a year

Links are being improved: ESnet, PHYnet, GARR, Janet, TEN-155Improvements to come:IN2P3 => 155Mbps RAL => 622Mbps

17

Gigabit/second networking

• The start of a new era:– Very rapid progress towards 10Gbps networking in both the Local

(LAN) and Wide area (WAN) networking environments are being made.

– 40Gbps is in sight on WANs, but what after?– The success of the LHC computing Grid critically depends on the

availability of Gbps links between CERN and LHC regional centers.

• What does it mean?– In theory:

• 1GB file transferred in 11 seconds over a 1Gbps circuit (*)• 1TB file transfer would still require 3 hours • and 1PB file transfer would require 4 months

– In practice:• major transmission protocol issues will need to be addressed

(*) according to the 75% empirical rule

CERN

18

Very high speed file transfer (1)

– High performance switched LAN assumed:• requires time & money.

– High performance WAN also assumed: • also requires money but is becoming possible.• very careful engineering mandatory.

– Will remain very problematic especially over high bandwidth*delay paths:

• Might force the use Jumbo Frames because of interactions between TCP/IP and link error rates.

– Could possibly conflict with strong security requirements

CERN

19

CERN Very high speed file transfer (2)

• Following formula proposed by Matt Mathis/PSC (“The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm”) to approximate the maximum TCP throughput under periodic packet loss:

(MSS/RTT)*(1/sqrt(p))• where MSS is the maximum segment size, 1460 bytes, in

practice,and “p” is the packet loss rate. • Are TCP's "congestion avoidance" algorithms

compatible with high speed, long distance networks.– The "cut transmit rate in half on single packet loss and

then increase the rate additively (1 MSS by RTT)" algorithm may simply not work.

– New TCP/IP adaptations may be needed in order to better cope with “lfn”, e.g. TCP Vegas

20

Acceptable link error rates

70 ms Round trip latencyMTU (Bytes) 1500 9128 (U.S. transcon: LA - Boston)

10 1E+07 4E-04 2E-0230 3E+07 5E-05 2E-03

100 1E+08 4E-06 2E-04300 3E+08 5E-07 2E-05

1,000 1E+09 4E-08 2E-063,000 3E+09 5E-09 2E-07

10,000 1E+10 4E-10 2E-08

170 ms Round trip latency

(Transatlantic: LA - CERN)10 1E+07 7E-05 3E-0330 3E+07 8E-06 3E-04

100 1E+08 7E-07 3E-05300 3E+08 8E-08 3E-06

1,000 1E+09 7E-09 3E-073,000 3E+09 8E-10 3E-08

10,000 1E+10 7E-11 3E-09

Throughput (Mbps)

Throughput (Mbps)

CERN

21

Very high speed file transfer (tentative conclusions)

• Tcp/ip fairness only exist between similar flows, i.e.• similar duration,• similar RTTs.

• Tcp/ip congestion avoidance algorithms need to be revisited (e.g. Vegas rather than Reno/NewReno)– faster recovery after loss, selective acknowledgment.

• Current ways of circumventing the problem, e.g.– Multi-stream & parallel socket

• just bandages or the practical solution to the problem?

• Web100, a 3MUSD NSF project, might help enormously!• better TCP/IP instrumentation (MIB), will allow read/write to internal TCP parameters• self-tuning• tools for measuring performance• improved FTP implementation• applications can tune stack

• Non-Tcp/ip based transport solution, use of Forward Error Corrections (FEC), Early Congestion Notifications (ECN) rather than active queue management techniques (RED/WRED)?

CERN

22

Optimizing streams• Choose # streams to optimize throughput/impact

– Measure RTT from Web100

– App controls # streams

23

WAN thruput conclusions• High FTP performance across WAN links is possible

–Even with 20-30Mbps bottleneck can do > 100Gbytes/day

• OS must support big windows selectable by application• Need multiple parallel streams• Loss is important in particular interval between losses• Compression looks promising, but needs cpu power

• Can get close to max thruput with small (<=32Mbyte) with sufficient (5-10) streams

• Improvements of 5 to 60 in thruput by using multiple streams & larger windows

• Impacts others users, need Less than Best Effort QoS service

24

More Information• This talk:

– www.slac.stanford.edu/grp/scs/net/talk/slac-wan-perf-apr01.htm

• IEPM/PingER home site– www-iepm.slac.stanford.edu/

• Transfer tools:– http://hepwww.rl.ac.uk/Adye/talks/010402-ftp/html/sld015.htm

• TCP Tuning:– www.ncne.nlanr.net/training/presentations/tcp-tutorial.ppt

http://www.slac.stanford.edu/grp/scs/net/talk/slac-wan-perf-apr01.htm
















http://www-iepm.slac.stanford.edu/







25

High Speed Bulk Throughput• Driven by:

– Data intensive science, e.g. data grids– HENP data rates, e.g. BaBar 300TB/year,

collection doubling yearly, i.e. PBytes in couple of years

– Data rate from experiment ~ 20MBytes/s ~ 200GBytes/d

– Multiple regional computer centers (e.g. Lyon-FR, RAL-UK, INFN-IT, LBNL-CA, LLNL-CA, Caltech-CA) need copies of data

– Boeing 747 high throughput, BUT poor latency (~ 2 weeks) & very people intensive

• So need high-speed networks and ability to utilize– High speed today = few hundred

GBytes/day (100GB/d ~ 10Mbits/s)

Data vol Moore’s law

High performance Throughput

Documents

file size

secondschange window

unloaded ping responsesslac

ping loadedstop iperf

parallel streamsrun

doemics field work proposal

throughput mbitss

dozen major collaborator