Slide: 1 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes- Jones Manchester 1 Using TCP/IP on High Bandwidth Long Distance Optical Networks Real Applications on Real Networks Richard Hughes-Jones University of Manchester www.hep.man.ac.uk /~rich/ then “Talks” then look for “Rank”
55
Embed
Slide: 1 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 1 Using TCP/IP on High Bandwidth Long.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Slide: 1Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 1
Using TCP/IP on High Bandwidth Long Distance
Optical Networks Real Applications on Real Networks
Richard Hughes-Jones University of Manchester
www.hep.man.ac.uk/~rich/ then “Talks” then look for “Rank”
Slide: 2Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 2
SCINet
Bandwidth Challenge at SC2004 Setting up the BW Bunker
The BW Challenge at the SLAC Booth
Working with S2io, Sun, Chelsio
Slide: 3Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 3
The Bandwidth Challenge – SC2004 The peak aggregate bandwidth from the booths was 101.13Gbits/s That is 3 full length DVDs per second ! 4 Times greater that SC2003 ! (with its 4.4 Gbit transatlantic flows) Saturated TEN 10Gigabit Ethernet waves SLAC Booth: Sunnyvale to Pittsburgh, LA to Pittsburgh and Chicago
to Pittsburgh (with UKLight).
Slide: 4Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 4
TCP has been around for ages
and it just works fine
So
What’s the Problem?
The users complain about the Network!
Slide: 5Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 5
TCP – provides reliability Positive acknowledgement (ACK) of each received segment
Sender keeps record of each segment sent Sender awaits an ACK – “I am ready to receive byte 2048 and beyond” Sender starts timer when it sends segment – so can re-transmit
Segment n
ACK of Segment nRTT
Time
Sender Receiver
Sequence 1024Length 1024
Ack 2048
Segment n+1
ACK of Segment n +1RTT
Sequence 2048Length 1024
Ack 3072
Inefficient – sender has to wait
Slide: 6Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 6
Flow Control: Sender – Congestion Window Uses Congestion window, cwnd, a sliding window to control the data flow
Byte count giving highest byte that can be sent with out without an ACK Transmit buffer size and Advertised Receive buffer size important. ACK gives next sequence no to receive AND
The available space in the receive buffer Timer kept for each packet
Unsent Datamay be transmitted immediately
Sent Databuffered waiting ACK
TCP Cwnd slides Data to be sent,waiting for windowto open.Application writes here
Data sent and ACKed
Sending hostadvances markeras data transmitted
Received ACKadvances trailing edge
Receiver’s advertisedwindow advances leading edge
Slide: 7Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 7
How it works: TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size, the higher the throughput
Throughput = Window size / Round-trip Time exponentially increase the congestion window size until a packet is lost
cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTT*log2 (W)
Rate doubles each RTT
CWND
slow start: exponential
increase
congestion avoidance: linear increase
packet loss
time
retransmit: slow start
again
timeout
Slide: 8Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 8
additive increase: starting from the rough estimate, linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 /MTU for each ACK – linear increase in ratecwnd -> cwnd + a / cwnd - Additive Increase, a=1
TCP takes packet loss as indication of congestion ! multiplicative decrease: cut the congestion window size aggressively if
a packet is lost Standard TCP reduces cwnd by 0.5cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ Slow start to Congestion avoidance transition determined by ssthresh
Packet loss is a killer
CWNDslow start:
exponential increase
congestion avoidance: linear increase
packet loss
time
retransmit: slow start
again
timeout
How it works: TCP AIMD Congestion Avoidance
Slide: 9Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 9
TCP (Reno) – Details of problem The time for TCP to recover its throughput from 1 lost 1500 byte packet is given by:
for rtt of ~200 ms:
MSS
RTTC
*2
* 2
2 min
0.00010.0010.010.1
110
1001000
10000100000
0 50 100 150 200rtt ms
Tim
e t
o r
eco
ver
sec
10Mbit100Mbit1Gbit2.5Gbit10Gbit
UK 6 ms Europe 25 ms USA 150 ms1.6 s 26 s 28min
Slide: 10Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 10
TCP: Simple Tuning - Filling the Pipe Remember, TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on:
Bandwidth end to end, i.e. min(BWlinks) AKA bottleneck bandwidth
Round Trip Time (RTT)
The number of bytes in flight to fill the entire path: Bandwidth*Delay Product BDP = RTT*BW Can increase bandwidth by
orders of magnitude
Windows also used for flow controlRTT
Time
Sender Receiver
ACK
Segment time on wire = bits in segment/BW
Slide: 11Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 11
Investigation of new TCP Stacks The AIMD Algorithm – Standard TCP (Reno)
For each ack in a RTT without loss:
cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss:
cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ High Speed TCP
a and b vary depending on current cwnd using a table a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner
for the network path b decreases less aggressively and, as a consequence, so does the cwnd. The effect is
that there is not such a decrease in throughput. Scalable TCP
a and b are fixed adjustments for the increase and decrease of cwnd a = 1/100 – the increase is greater than TCP Reno b = 1/8 – the decrease on loss is less than TCP Reno Scalable over any link speed.
Fast TCP
Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput.
HSTCP-LP, Hamilton-TCP, BiC-TCP
Slide: 12Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 12
Lets Check out this
theory about new TCP stacks
Does it matter ?
Does it work?
Slide: 13Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 13
Packet Loss with new TCP Stacks TCP Response Function
Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel
MB-NG rtt 6ms DataTAG rtt 120 ms
Slide: 14Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 14
High Throughput Demonstration
Manchester (Geneva)
man03lon01
2.5 Gbit SDHMB-NG Core
1 GEth1 GEth
Cisco GSRCisco GSRCisco7609
Cisco7609
London (Chicago)
Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz
Send data with TCPDrop Packets
Monitor TCP with Web100
Slide: 15Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 15
High Performance TCP – DataTAG Different TCP stacks tested on the DataTAG Network rtt 128 ms Drop 1 in 106
High-SpeedRapid recovery
ScalableVery fast recovery
StandardRecovery would
take ~ 20 mins
Slide: 16Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 16
Throughput for real users
Transfers in the UK for BaBar using
MB-NG and SuperJANET4
Slide: 17Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 17
Topology of the MB – NG Network
KeyGigabit Ethernet2.5 Gbit POS Access
MPLS Admin. Domains
UCL Domain
Edge Router Cisco 7609
man01
man03
Boundary Router Cisco 7609
Boundary Router Cisco 7609
RAL Domain
Manchester Domain
lon02
man02
ral01
UKERNADevelopment
Network
Boundary Router Cisco 7609
ral02
ral02
lon03
lon01
HW RAID
HW RAID
Slide: 18Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 18
Topology of the Production Network
KeyGigabit Ethernet2.5 Gbit POS Access10 Gbit POS
man01
RAL Domain
Manchester Domain
ral01
HW RAID
HW RAID routers switches
3 routers2 switches
Slide: 19Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 19
Slide: 27Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 27
RAID0 6disks 1 Gbyte Write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
Network & Disk Interactions (work in progress) Hosts:
Supermicro X5DPE-G2 motherboards dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size
Measure memory to RAID0 transfer rates with & without UDP traffic
R0 6d 1 Gbyte udp Write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
R0 6d 1 Gbyte udp9000 write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
Disk write1735 Mbit/s
Disk write +1500 MTU UDP
1218 Mbit/sDrop of 30%
Disk write +9000 MTU UDP
1400 Mbit/sDrop of 19%
RAID0 6disks 1 Gbyte Write 64k 3w8506-8
y = -1.017x + 178.32
y = -1.0479x + 174.440
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
8k
64k
R0 6d 1 Gbyte udp Write 64k 3w8506-8
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
Slide: 39Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 39
UDP Performance: 3 Flows on GÉANT
Packet Loss & Re-ordering Jodrell: 2.0 GHz Xeon
Loss 0 – 12% Reordering significant
Medicina: 800 MHz PIII Loss ~6% Reordering in-significant
Torun: 2.4 GHz Xeon Loss 6 - 12% Reordering in-significant
Torun 14Jun04
0
1
2
3
4
5
0 500 1000 1500 2000Time 10s
num
re-
orde
red
020000
400006000080000100000
120000140000
num
lost
re-ordered
num_lost
jbgig1-jivegig1_14Jun05
0
500
1000
1500
2000
0 500 1000 1500 2000Time 10s
num
re-
orde
red
0
50000
100000
150000
num
lost
re-ordered
num_lost
Medicina 14Jun05
0
1
2
3
4
5
0 500 1000 1500 2000Time 10s
num
re-
orde
red
0
10000
20000
30000
40000
50000
60000
70000
num
lost
re-ordered
num_lost
Slide: 40Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 40
18 Hour Flows on UKLightJodrell – JIVE, 26 June 2005
Throughput: Jodrell: JIVE
2.4 GHz dual Xeon – 2.4 GHz dual Xeon
960-980 Mbit/s
Traffic through SURFnet
Packet Loss Only 3 groups with 10-150 lost packets
each No packets lost the rest of the time
Packet re-ordering None
man03-jivegig1_26Jun05
0
200
400
600
800
1000
0 1000 2000 3000 4000 5000 6000 7000
Time 10s steps
Recv w
ire r
ate
Mbit/s
w10
man03-jivegig1_26Jun05
900910920930940950
960970980990
1000
5000 5050 5100 5150 5200
Time 10sR
ecv w
ire r
ate
Mbit/s
w10
man03-jivegig1_26Jun05
1
10
100
1000
0 1000 2000 3000 4000 5000 6000 7000
Time 10s steps
Packet
Loss
w10
Slide: 41Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 41
The End Hosts themselves The performance of Motherboards, NICs, RAID controllers and Disks matter Plenty of CPU power is required to sustain Gigabit transfers for the TCP/IP stack as well and
the application Packets can be lost in the IP stack due to lack of processing power
New TCP stacks are stable give better response & performance Still need to set the tcp buffer sizes ! Check other kernel settings e.g. window-scale Take care on difference between the Protocol and The Implementation
Packet loss is a killer Check on campus links & equipment, and access links to backbones
Applications architecture & implementation is also important The work is applicable to other areas including:
Interaction between HW, protocol processing, and disk sub-system complex
Summary & Conclusions
MB - NG
Slide: 42Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 42
More Information Some URLs Real-Time Remote Farm site http://csr.phys.ualberta.ca/real-time UKLight web site: http://www.uklight.ac.uk DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup:
http://www.hep.man.ac.uk/~rich/ (Software & Tools) Motherboard and NIC Tests:
http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/ (Publications)
TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html
TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004http:// www.hep.man.ac.uk/~rich/ (Publications)
Slide: 43Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 43
Any Questions?
Slide: 44Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 44
Backup Slides
Slide: 45Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 45
Multi-Gigabit flows at SC2003 BW Challenge Three Server systems with 10 Gigabit Ethernet NICs Used the DataTAG altAIMD stack 9000 byte MTU Send mem-mem iperf TCP streams From SLAC/FNAL booth in Phoenix to:
Pal Alto PAIX rtt 17 ms , window 30 MB Shared with Caltech booth 4.37 Gbit HighSpeed TCP I=5% Then 2.87 Gbit I=16% Fall when 10 Gbit on link
3.3Gbit Scalable TCP I=8% Tested 2 flows sum 1.9Gbit I=39%
Chicago Starlight rtt 65 ms , window 60 MB Phoenix CPU 2.2 GHz 3.1 Gbit HighSpeed TCP I=1.6%
Amsterdam SARA rtt 175 ms , window 200 MB Phoenix CPU 2.2 GHz 4.35 Gbit HighSpeed TCP I=6.9%
Very Stable Both used Abilene to Chicago
10 Gbits/s throughput from SC2003 to PAIX
0
1
2
3
4
5
6
7
8
9
10
11/19/0315:59
11/19/0316:13
11/19/0316:27
11/19/0316:42
11/19/0316:56
11/19/0317:11
11/19/0317:25 Date & Time
Throughput
Gbits/s
Router to LA/PAIXPhoenix-PAIX HS-TCPPhoenix-PAIX Scalable-TCPPhoenix-PAIX Scalable-TCP #2
10 Gbits/s throughput from SC2003 to Chicago & Amsterdam
0
1
2
3
4
5
6
7
8
9
10
11/19/0315:59
11/19/0316:13
11/19/0316:27
11/19/0316:42
11/19/0316:56
11/19/0317:11
11/19/0317:25 Date & Time
Throughput
Gbits/s
Router traffic to Abilele
Phoenix-Chicago
Phoenix-Amsterdam
Slide: 46Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 46
UDP/IP packets sent between back-to-back systems Processed in a similar manner to TCP/IP Not subject to flow control & congestion avoidance algorithms Used UDPmon test program
Latency Round trip times measured using Request-Response UDP frames Latency as a function of frame size
Slide: 54Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 54
TCP Fast Retransmit & Recovery Duplicate ACKs are due to lost segments or segments out of order. Fast Retransmit: If the receiver transmits 3 duplicate ACKs
(i.e. it received 3 additional segments without getting the one expected) Transmitting host sends the missing segment
Set ssthresh to 0.5*cwnd – so enter congestion avoidance phaseSet cwnd = (0.5*cwnd +3 ) – the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK
no need to go into “slow start” again
At steady state, CWND oscillates around the optimal window size With a retransmission timeout, slow start is triggered again
CWND
slow start: exponential
increase
congestion avoidance: linear increase
packet loss
time
retransmit: slow start
again
timeoutCWND
slow start: exponential
increase
congestion avoidance: linear increase
packet loss
time
retransmit: slow start
again
timeout
Slide: 55Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 55
Packet Loss and new TCP Stacks TCP Response Function
UKLight London-Chicago-London rtt 177 ms 2.6.6 Kernel
Agreement withtheory good
Some new stacksgood at high loss rates
sculcc1-chi-2 iperf 13Jan05
1
10
100
1000
100100010000100000100000010000000100000000Packet drop rate 1 in n
TC
P A
chie
vable
thro
ughput
Mbit/
s
A0 1500
A1 HSTCP
A2 Scalable
A3 HTCP
A5 BICTCP
A8 Westwood
A7 Vegas
A0 Theory
Series10
Scalable Theory
sculcc1-chi-2 iperf 13Jan05
0
100
200
300
400
500
600
700
800
900
1000
100100010000100000100000010000000100000000Packet drop rate 1 in n