Summer School, Brasov, Romania, July 2005, R. Hughes- Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications TCP/IP on High Performance Networks Richard Hughes-Jones University of Manchester
Jan 05, 2016
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
1
TCP/IP and Other Transports for High Bandwidth Applications
TCP/IP on High Performance Networks
Richard Hughes-Jones University of Manchester
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
2
The Bandwidth Challenge at SC2003 The peak aggregate bandwidth from the 3 booths was 23.21Gbits/s 1-way link utilisations of >90% 6.6 TBytes in 48 minutes
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
3
Multi-Gigabit flows at SC2003 BW Challenge Three Server systems with 10 Gigabit Ethernet NICs Used the DataTAG altAIMD stack 9000 byte MTU Send mem-mem iperf TCP streams From SLAC/FNAL booth in Phoenix to:
Pal Alto PAIX rtt 17 ms , window 30 MB Shared with Caltech booth 4.37 Gbit HighSpeed TCP I=5% Then 2.87 Gbit I=16% Fall when 10 Gbit on link
3.3Gbit Scalable TCP I=8% Tested 2 flows sum 1.9Gbit I=39%
Chicago Starlight rtt 65 ms , window 60 MB Phoenix CPU 2.2 GHz 3.1 Gbit HighSpeed TCP I=1.6%
Amsterdam SARA rtt 175 ms , window 200 MB Phoenix CPU 2.2 GHz 4.35 Gbit HighSpeed TCP I=6.9%
Very Stable Both used Abilene to Chicago
10 Gbits/s throughput from SC2003 to PAIX
0
1
2
3
4
5
6
7
8
9
10
11/19/0315:59
11/19/0316:13
11/19/0316:27
11/19/0316:42
11/19/0316:56
11/19/0317:11
11/19/0317:25 Date & Time
Throughput
Gbits/s
Router to LA/PAIXPhoenix-PAIX HS-TCPPhoenix-PAIX Scalable-TCPPhoenix-PAIX Scalable-TCP #2
10 Gbits/s throughput from SC2003 to Chicago & Amsterdam
0
1
2
3
4
5
6
7
8
9
10
11/19/0315:59
11/19/0316:13
11/19/0316:27
11/19/0316:42
11/19/0316:56
11/19/0317:11
11/19/0317:25 Date & Time
Throughput
Gbits/s
Router traffic to Abilele
Phoenix-Chicago
Phoenix-Amsterdam
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
4
SCINet
Collaboration at SC2004 Setting up the BW Bunker
The BW Challenge at the SLAC Booth
Working with S2io, Sun, Chelsio
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
5
The Bandwidth Challenge – SC2004 The peak aggregate bandwidth from the booths was 101.13Gbits/s That is 3 full length DVDs per second ! 4 Times greater that SC2003 ! Saturated TEN 10Gigabit Ethernet waves SLAC Booth: Sunnyvale to Pittsburgh, LA to Pittsburgh and Chicago
to Pittsburgh (with UKLight).
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
6
Just a Well Engineered End-to-End Connection
End-to-End “no loss” environment
NO contention, NO sharing on the end-to-end path
Processor speed and system bus characteristics
TCP Configuration – window size and frame size (MTU)
Tuned PCI-X bus
Tuned Network Interface Card driver
A single TCP connection on the end-to-end path
Memory-to-Memory transfer
no disk system involved
No real user application (but did file transfers!!)
Not a typical User or Campus situation BUT …
So what’s the matter with TCP – Did we cheat?
InternetInternet
Regional
Regional
Regional
Regional
Campus
Campus
Campus
Campus
Client
Server
Campus
Campus
Campus
Campus
Client
Server
UK LightUK Light
From Robin Tasker
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
7
TCP (Reno) – What’s the problem?
TCP has 2 phases: Slowstart
Probe the network to estimate the Available BWExponential growth
Congestion AvoidanceMain data transfer phase – transfer rate glows “slowly”
AIMD and High Bandwidth – Long Distance networksPoor performance of TCP in high bandwidth wide area networks is due
in part to the TCP congestion control algorithm. For each ack in a RTT without loss:
cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss:
cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½
Packet loss is a killer !!
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
8
TCP (Reno) – Details Time for TCP to recover its throughput from 1 lost packet given by:
for rtt of ~200 ms:
MSS
RTTC
*2
* 2
2 min
0.00010.0010.010.1
110
1001000
10000100000
0 50 100 150 200rtt ms
Tim
e t
o r
eco
ver
sec
10Mbit100Mbit1Gbit2.5Gbit10Gbit
UK 6 ms Europe 20 ms USA 150 ms
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
9
Investigation of new TCP Stacks The AIMD Algorithm – Standard TCP (Reno)
For each ack in a RTT without loss:
cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss:
cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ High Speed TCP
a and b vary depending on current cwnd using a table a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner
for the network path b decreases less aggressively and, as a consequence, so does the cwnd. The effect is
that there is not such a decrease in throughput. Scalable TCP
a and b are fixed adjustments for the increase and decrease of cwnd a = 1/100 – the increase is greater than TCP Reno b = 1/8 – the decrease on loss is less than TCP Reno Scalable over any link speed.
Fast TCP
Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput.
HSTCP-LP, H-TCP, BiC-TCP
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
10
Packet Loss with new TCP Stacks TCP Response Function
Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel
MB-NG rtt 6ms DataTAG rtt 120 ms
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
11
Packet Loss and new TCP Stacks TCP Response Function
UKLight London-Chicago-London rtt 177 ms 2.6.6 Kernel
Agreement withtheory good
sculcc1-chi-2 iperf 13Jan05
1
10
100
1000
100100010000100000100000010000000100000000Packet drop rate 1 in n
TC
P A
chie
vable
thro
ughput
Mbit/
s
A0 1500
A1 HSTCP
A2 Scalable
A3 HTCP
A5 BICTCP
A8 Westwood
A7 Vegas
A0 Theory
Series10
Scalable Theory
sculcc1-chi-2 iperf 13Jan05
0
100
200
300
400
500
600
700
800
900
1000
100100010000100000100000010000000100000000Packet drop rate 1 in n
TC
P A
chie
vable
thro
ughput
Mbit/
s
A0 1500
A1 HSTCP
A2 Scalable
A3 HTCP
A5 BICTCP
A8 Westwood
A7 Vegas
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
12
Topology of the MB – NG Network
KeyGigabit Ethernet2.5 Gbit POS Access
MPLS Admin. Domains
UCL Domain
Edge Router Cisco 7609
man01
man03
Boundary Router Cisco 7609
Boundary Router Cisco 7609
RAL Domain
Manchester Domain
lon02
man02
ral01
UKERNADevelopment
Network
Boundary Router Cisco 7609
ral02
ral02
lon03
lon01
HW RAID
HW RAID
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
13
SC2004 UKLIGHT Overview
MB-NG 7600 OSRManchester
ULCC UKLight
UCL HEP
UCL network
K2
Ci
Chicago Starlight
Amsterdam
SC2004
Caltech BoothUltraLight IP
SLAC Booth
Cisco 6509
UKLight 10GFour 1GE channels
UKLight 10G
Surfnet/ EuroLink 10GTwo 1GE channels
NLR LambdaNLR-PITT-STAR-10GE-16
K2
K2 Ci
Caltech 7600
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
14
High Throughput Demonstrations
Manchester (Geneva)
man03lon01
2.5 Gbit SDHMB-NG Core
1 GEth1 GEth
Cisco GSRCisco GSRCisco7609
Cisco7609
London (Chicago)
Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz
Send data with TCPDrop Packets
Monitor TCP with Web100
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
15
Drop 1 in 25,000 rtt 6.2 ms Recover in 1.6 s
High Performance TCP – MB-NG
Standard HighSpeed Scalable
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
16
High Performance TCP – DataTAG Different TCP stacks tested on the DataTAG Network rtt 128 ms Drop 1 in 106
High-SpeedRapid recovery
ScalableVery fast recovery
StandardRecovery would
take ~ 20 mins
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
17
FAST demo via OMNInet and Datatag
J. Mambretti, F. Yeh (Northwestern)
OMNInett
NortelPassport 8600
NortelPassport 8600
Photonic Switch
NU-E (Leverone)Workstations
2 x GE
StarLight-Chicago
CalTechCisco 7609
2 x GE
Photonic Switch
Alcatel1670
10GE
10GE
Alcatel1670
2 x GE2 x GE
OC-48
DataTAG
2 x GE
Workstations CERN -Geneva
San Diego
FAST display
CERNCisco 7609
7,000 km
A. Adriaanse, C. Jin, D. Wei (Caltech)
S. Ravot (Caltech/CERN)
FAST DemoCheng Jin, David Wei
Caltech
Layer 2 path
Layer 2/3 path
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
18
FAST TCP vs newReno
Channel #1 : newRenoChannel #1 : newReno Channel #2: FASTChannel #2: FAST
Utilization: 70%Utilization: 70%
Utilization:Utilization:
90%90%
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
19
Is TCP fair?
a look at
Round Trip Times & Max Transfer Unit
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
20
MTU and Fairness
Two TCP streams share a 1 Gb/s bottleneck RTT=117 ms MTU = 3000 Bytes ; Avg. throughput over a period of 7000s = 243 Mb/s MTU = 9000 Bytes; Avg. throughput over a period of 7000s = 464 Mb/s Link utilization : 70,7 %
Starlight (Chi)Starlight (Chi)CERN (GVA)CERN (GVA)
RR RRGbE GbE SwitchSwitch
Host #1Host #1POS 2.5POS 2.5 GbpsGbps1 GE1 GE
1 GE1 GE
Host #2Host #2
Host #1Host #1
Host #2Host #2
1 GE1 GE
1 GE1 GE
BottleneckBottleneck
Throughput of two streams with different MTU sizes sharing a 1 Gbps bottleneck
0
100
200
300
400
500
600
700
800
900
1000
0 1000 2000 3000 4000 5000 6000
Time(s)
Thro
ughput
(Mbps)
MTU = 3000 Byte
Average over the life of the connection MTU = 3000 Byte
MTU = 9000 Byte
Average over the life of the connection MTU = 9000 Byte
Sylvain Ravot DataTag 2003
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
21
RTT and Fairness
SunnyvaleSunnyvaleStarlight (Chi)Starlight (Chi)
CERN (GVA)CERN (GVA)
RR RRGbE GbE SwitchSwitch
Host #1Host #1
POS 2.5POS 2.5 Gb/sGb/s1 GE1 GE
1 GE1 GE
Host #2Host #2
Host #1Host #1
Host #2Host #2
1 GE1 GE
1 GE1 GE
BottleneckBottleneck
RRPOS 10POS 10 Gb/sGb/sRR10GE10GE
Two TCP streams share a 1 Gb/s bottleneck CERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s CERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s MTU = 9000 bytes Link utilization = 71,6 %
Throughput of two streams with different RTT sharing a 1Gbps bottleneck
0
100
200
300
400
500
600
700
800
900
1000
0 1000 2000 3000 4000 5000 6000 7000
Time (s)
Thro
ughput
(Mbps)
RTT=181ms
Average over the life of the connection RTT=181ms
RTT=117ms
Average over the life of the connection RTT=117ms
Sylvain Ravot DataTag 2003
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
22
Is TCP fair?
Do TCP Flows Share the Bandwidth ?
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
23
Chose 3 paths from SLAC (California) Caltech (10ms), Univ Florida (80ms), CERN (180ms)
Used iperf/TCP and UDT/UDP to generate traffic
Each run was 16 minutes, in 7 regions
Test of TCP Sharing: Methodology (1Gbit/s)
Ping 1/s
Iperf or UDT
ICMP/ping traffic
TCP/UDPbottleneck
iperf
SLACCaltech/UFL/CERN
2 mins 4 mins
Les Cottrell PFLDnet 2005
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
24
Low performance on fast long distance paths AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in congestion) Net effect: recovers slowly, does not effectively use available bandwidth, so poor
throughput Unequal sharing
TCP Reno single stream
Congestion has a dramatic effect
Recovery is slow
Increase recovery rate
SLAC to CERN
RTT increases when achieves best throughput
Les Cottrell PFLDnet 2005
Remaining flows do not take up slack when flow removed
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
25
Fast
As well as packet loss, FAST uses RTT to detect congestion RTT is very stable: σ(RTT) ~ 9ms vs 37±0.14ms for the others
SLAC-CERN
Big drops in throughput which take several seconds to recover from
2nd flow never gets equal share of bandwidth
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
26
Hamilton TCP One of the best performers
Throughput is high Big effects on RTT when achieves best throughput Flows share equally
Appears to need >1 flow toachieve best throughput
Two flows share equally
SLAC-CERN
> 2 flows appears less stable
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
27
SC2004 & Transfers with UKLight
A Taster for Lambda & Packet Switched Hybrid Networks
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
28
Transatlantic Ethernet: TCP Throughput Tests
Supermicro X5DPE-G2 PCs Dual 2.9 GHz Xenon CPU FSB 533 MHz 1500 byte MTU 2.6.6 Linux Kernel Memory-memory TCP throughput Standard TCP
Wire rate throughput of 940 Mbit/s
First 10 sec
Work in progress to study: Implementation detail Advanced stacks Effect of packet loss Sharing
0
500
1000
1500
2000
0 20000 40000 60000 80000 100000 120000 140000
time ms
TCPA
chiv
e M
bit/s
0
200000000
400000000
600000000
800000000
1000000000
1200000000
1400000000
Cwnd
InstaneousBWAveBWCurCwnd (Value)
0
500
1000
1500
2000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
time ms
TCPA
chiv
e M
bit/s
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
Cwnd
InstaneousBWAveBWCurCwnd (Value)
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
29
SC2004 Disk-Disk bbftp (work in progress)
bbftp file transfer program uses TCP/IP UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 Gbyte file Web100 plots:
Standard TCP Average 825 Mbit/s (bbcp: 670 Mbit/s)
Scalable TCP Average 875 Mbit/s (bbcp: 701 Mbit/s
~4.5s of overhead)
Disk-TCP-Disk at 1Gbit/sis here!
0
500
1000
1500
2000
2500
0 5000 10000 15000 20000
time msT
CP
Ach
ive M
bit
/s
050000001000000015000000200000002500000030000000350000004000000045000000
Cw
nd
InstaneousBW
AveBW
CurCwnd (Value)
0
500
1000
1500
2000
2500
0 5000 10000 15000 20000
time ms
TC
PA
ch
ive M
bit
/s
050000001000000015000000200000002500000030000000350000004000000045000000
Cw
nd
InstaneousBWAveBWCurCwnd (Value)
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
30
Super Computing Bandwidth Challenge gives opportunity to make world-wide High performance tests.
Land Speed Record shows what can be achieved with state of the art kit Standard TCP not optimum for high throughput long distance links Packet loss is a killer for TCP
Check on campus links & equipment, and access links to backbones Users need to collaborate with the Campus Network Teams Dante Pert
New stacks are stable give better response & performance Still need to set the TCP buffer sizes ! Check other kernel settings e.g. window-scale maximum Watch for “TCP Stack implementation Enhancements”
Host is critical think Server quality not Supermarket PC Motherboards NICs, RAID controllers and Disks matter
NIC should use 64 bit 133 MHz PCI-X 66 MHz PCI can be OK but 32 bit 33 MHz is too slow for Gigabit rates
Worry about the CPU-Memory bandwidth as well as the PCI bandwidth Data crosses the memory bus at least 3 times
Separate the data transfers – use motherboards with multiple 64 bit PCI-X buses Choose a modern high throughput RAID controller
Consider SW RAID0 of RAID5 HW controllers Users are now able to perform sustained 1 Gbit/s transfers
Summary, Conclusions & Thanks
MB - NG
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
31
More Information Some URLs UKLight web site: http://www.uklight.ac.uk MB-NG project web site: http://www.mb-ng.net/ DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup:
http://www.hep.man.ac.uk/~rich/net Motherboard and NIC Tests:
http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/
TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html
TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004
PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ Dante PERT http://www.geant2.net/server/show/nav.00d00h002
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
32
Any Questions?
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
33
Backup Slides
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
34
10 Gigabit Ethernet: UDP Throughput Tests 1500 byte MTU gives ~ 2 Gbit/s Used 16144 byte MTU max user length 16080 DataTAG Supermicro PCs Dual 2.2 GHz Xenon CPU FSB 400 MHz PCI-X mmrbc 512 bytes wire rate throughput of 2.9 Gbit/s
CERN OpenLab HP Itanium PCs Dual 1.0 GHz 64 bit Itanium CPU FSB 400 MHz PCI-X mmrbc 512 bytes wire rate of 5.7 Gbit/s
SLAC Dell PCs giving a Dual 3.0 GHz Xenon CPU FSB 533 MHz PCI-X mmrbc 4096 bytes wire rate of 5.4 Gbit/s
an-al 10GE Xsum 512kbuf MTU16114 27Oct03
0
1000
2000
3000
4000
5000
6000
0 5 10 15 20 25 30 35 40Spacing between frames us
Rec
v W
ire
rate
Mb
its/
s
16080 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
35
10 Gigabit Ethernet: Tuning PCI-X 16080 byte packets every 200 µs Intel PRO/10GbE LR Adapter PCI-X bus occupancy vs mmrbc
Measured times Times based on PCI-X times from
the logic analyser Expected throughput ~7 Gbit/s Measured 5.7 Gbit/s
0
5
10
15
20
25
30
35
40
45
50
0 1000 2000 3000 4000 5000Max Memory Read Byte Count
PC
I-X
Tra
nsfe
r tim
e u
s
0
1
2
3
4
5
6
7
8
9
PC
I-X
Tra
nsfe
r ra
te G
bit/s
Measured PCI-X transfer time usexpected time usrate from expected time Gbit/s Max throughput PCI-X
Kernel 2.6.1#17 HP Itanium Intel10GE Feb04
0
2
4
6
8
10
0 1000 2000 3000 4000 5000Max Memory Read Byte Count
PC
I-X
Tra
nsfe
r tim
e
us
measured Rate Gbit/srate from expected time Gbit/s Max throughput PCI-X
mmrbc1024 bytes
mmrbc2048 bytes
mmrbc4096 bytes5.7Gbit/s
mmrbc512 bytes
CSR Access
PCI-X Sequence
Data Transfer
Interrupt & CSR Update
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
36
10 Gigabit Ethernet: SC2004 TCP Tests Sun AMD opteron compute servers v20z Chelsio TOE Tests between Linux 2.6.6. hosts
10 Gbit ethernet link from SC2004 to CENIC/NLR/Level(3) PoP in Sunnyvale Two 2.4GHz AMD 64 bit Opteron processors with 4GB of RAM at SC2004 1500B MTU, all Linux 2.6.6 in one direction 9.43G i.e. 9.07G goodput and the reverse direction 5.65G i.e. 5.44G goodput Total of 15+G on wire.
10 Gbit ethernet link from SC2004 to ESnet/QWest PoP in Sunnyvale One 2.4GHz AMD 64 bit Opteron each end 2MByte window, 16 streams, 1500B MTU, all Linux 2.6.6 in one direction 7.72Gbit/s i.e. 7.42 Gbit/s goodput 120mins (6.6Tbits shipped)
S2io NICs with Solaris 10 in 4*2.2GHz Opteron cpu v40z to one or more S2io or Chelsio NICs with Linux 2.6.5 or 2.6.6 in 2*2.4GHz V20Zs LAN 1 S2io NIC back to back: 7.46 Gbit/s LAN 2 S2io in V40z to 2 V20z : each NIC ~6 Gbit/s total 12.08 Gbit/s