MB-NG Review – 24 April 2004 Richard Hughes-Jones The University of Manchester, UK MB-NG Review High Performance Network Demonstration 21 April 2004
Mar 31, 2015
MB-NG Review – 24 April 2004
Richard Hughes-JonesThe University of Manchester, UK
MB-NG Review
High Performance Network Demonstration
21 April 2004
2MB-NG Review, April 2004
It works ?So what’s the Problem with TCP
TCP has 2 phases: Slowstart & Congestion Avoidance AIMD and High Bandwidth – Long Distance networksPoor performance of TCP in high bandwidth wide area networks is duein part to the TCP congestion control algorithm - cwnd congestion window
For each ack in a RTT without loss:cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss:cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½
Time to recover from 1 packet loss ~100 ms rtt:
3MB-NG Review, April 2004
Investigation of new TCP Stacks
High Speed TCPa and b vary depending on current cwnd using a table
a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner for the network path
b decreases less aggressively and, as a consequence, so does the cwnd. The effect is that there is not such a decrease in throughput.
Scalable TCP a and b are fixed adjustments for the increase and decrease of cwnd
a = 1/100 – the increase is greater than TCP Reno b = 1/8 – the decrease on loss is less than TCP Reno Scalable over any link speed.
Fast TCPUses round trip time as well as packet loss to indicate congestion with
rapid convergence to fair equilibrium for throughput. HSTCP-LP High Speed (Low Priority) – backs off if rtt increases BiC-TCP – Additive increase large cwnd; binary search small cwnd H-TCP – after congestion standard then switch to high performance ●●●
4MB-NG Review, April 2004
Comparison of TCP Stacks TCP Response Function
Throughput vs Loss Rate – steeper: faster recovery Drop packets in kernel
MB-NG rtt 6ms DataTAG rtt 120 ms
5MB-NG Review, April 2004
Multi-Gigabit flows at SC2003 BW Challenge
Three Server systems with 10 GigEthernet NICs Used the DataTAG altAIMD stack 9000 byte MTU Send mem-mem iperf TCP streams From SLAC/FNAL booth in
Phoenix to:
Chicago Starlight rtt 65 ms window 60 MB Phoenix CPU 2.2 GHz 3.1 Gbit hstcp I=1.6%
Amsterdam SARA rtt 175 ms window 200 MB Phoenix CPU 2.2 GHz 4.35 Gbit hstcp I=6.9%
New TCP stacks are very Stable Both used Abilene to Chicago
10 Gbits/s throughput from SC2003 to Chicago & Amsterdam
0
1
2
3
4
5
6
7
8
9
10
11/19/0315:59
11/19/0316:13
11/19/0316:27
11/19/0316:42
11/19/0316:56
11/19/0317:11
11/19/0317:25 Date & Time
Thr
ough
put
Gbi
ts/s
Router traffic to Abilele
Phoenix-Chicago
Phoenix-Amsterdam
6MB-NG Review, April 2004
Transfer Applications – Throughput [1] 2Gbyte file transferred RAID0 disks Manc – UCL GridFTP See alternate 600/800 Mbit and zero
Apache web server + curl-based client See steady 720 Mbit
7MB-NG Review, April 2004
Transfer Applications – Throughput [2] 2Gbyte file transferred RAID5 - 4disks Manc – RAL bbcp Mean 710 Mbit/s
GridFTP See many zeros
Mean ~710
Mean ~620
8MB-NG Review, April 2004
Topology of the MB – NG Network
KeyGigabit Ethernet2.5 Gbit POS Access
MPLS Admin. Domains
UCL Domain
Edge Router Cisco 7609
man01
man03
Boundary Router Cisco 7609
Boundary Router Cisco 7609
RAL Domain
Manchester Domain
lon02
man02
ral01
UKERNADevelopment
Network
Boundary Router Cisco 7609
ral02
ral02
lon03
lon01
9MB-NG Review, April 2004
High Throughput DemoManchester
man03lon01
2.5 Gbit SDHMB-NG Core
1 GEth1 GEth
Cisco GSR
Cisco GSR
Cisco7609
Cisco7609
London
Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz
Send data with TCPDrop Packets
Monitor TCP with Web100
10MB-NG Review, April 2004
Standard to HS-TCP No loss, but output queue filled by sender
11MB-NG Review, April 2004
HS-TCP to Scalable No loss, but output queue filled by sender
12MB-NG Review, April 2004
Standard, HS-TCP, Scalable Drop 1 in 25,000
13MB-NG Review, April 2004
Standard Reno TCP Drop 1 in 106
14MB-NG Review, April 2004
Focus on Helping Real Users: Throughput CERN -SARA
Standard TCP txlen 100 25 Jan03
0
100
200
300
400
500
1043509370 1043509470 1043509570 1043509670 1043509770
Time
I/f
Rat
e M
bits
/s
00.20.40.60.811.21.41.61.82
Re
cv. R
ate
Mb
its/s
Out Mbit/s In Mbit/s
Hispeed TCP txlen 2000 26 Jan03
0
100
200
300
400
500
1043577520 1043577620 1043577720 1043577820 1043577920Time
I/f
Rat
e M
bits
/s
00.20.40.60.811.21.41.61.82
Rec
v. R
ate
Mbi
ts/s
Out Mbit/s
In Mbit/s
Using the GÉANT Backup Link 1 GByte disk-disk transfers Blue is the Data Red is the TCP ACKs
Standard TCP Average Throughput 167 Mbit/s Users see 5 - 50 Mbit/s!
High-Speed TCP Average Throughput 345 Mbit/s
Scalable TCP Average Throughput 340 Mbit/s
Technology link to EU Projects: DataGrid DataTAG & GÉANT
Scalable TCP txlen 2000 27 Jan03
0
100
200
300
400
500
1043678800 1043678900 1043679000 1043679100 1043679200Time
II/f
Rat
e M
bits
/s
00.20.40.60.811.21.41.61.82
Re
cv. R
ate
Mb
its/s
Out Mbit/s
In Mbit/s
15MB-NG Review, April 2004
BaBar Case Study: Host, PCI & RAID Controller Performance
RAID0 (striped) & RAID5 (stripped with redundancy) 3Ware 7506 Parallel 66 MHz 3Ware 7505 Parallel 33 MHz 3Ware 8506 Serial ATA 66 MHz ICP Serial ATA 33/66 MHz Tested on Dual 2.2 GHz Xeon Supermicro P4DP8-G2 motherboard Disk: Maxtor 160GB 7200rpm 8MB Cache Read ahead kernel tuning: /proc/sys/vm/max-readahead
Disk – Memory Read Speeds Memory - Disk Write Speeds
16MB-NG Review, April 2004
Topology of the MB – NG Network
KeyGigabit Ethernet2.5 Gbit POS Access
MPLS Admin. Domains
UCL Domain
Edge Router Cisco 7609
man01
man03
Boundary Router Cisco 7609
Boundary Router Cisco 7609
RAL Domain
Manchester Domain
lon02
man02
ral01
UKERNADevelopment
Network
Boundary Router Cisco 7609
ral02
ral02
lon03
lon01
HW RAID
HW RAID
17MB-NG Review, April 2004
BaBar Data: Throughput on MB–NG kit
RAID5 - 4disks RAL - Manc Includes small files ~Kbytes bbftp 1 stream with compression
bbftp 6 streams
bbftp 1 stream no compression 10 * 2 G byte files – each peak is a 20 G byte transfer
bbftp 1 streamFiles ≥ 1 Mbyte
With bb diag
18MB-NG Review, April 2004
Helping Real UsersRadio Astronomy VLBI
PoC with NRNs & GEANT 1024 Mbit/s 24 on 7 NOW
19MB-NG Review, April 2004
1472 byte Packets man -> JIVE FWHM 22 µs (B2B 3 µs )
VLBI Project: Throughput Jitter & 1-way Delay
1472 bytes w=50 jitter Gnt5-DwMk5 28Oct03
0
2000
4000
6000
8000
10000
0 20 40 60 80 100 120 140
Jitter us
N(t
)
1472 bytes w12 Gnt5-DwMk5 21Oct03
0
2000
4000
6000
8000
10000
12000
2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000Packet No.
1-w
ay d
elay
us
1472 bytes w12 Gnt5-DwMk5 21Oct03
0
2000
4000
6000
8000
10000
12000
0 1000 2000 3000 4000 5000Packet No.
1-w
ay d
elay
us
1-way Delay – note the packet loss (points with zero 1 –way delay)
Gnt5-DwMk5 11Nov03/DwMk5-Gnt5 13Nov03-1472bytes
0
200
400
600
800
1000
1200
0 5 10 15 20 25 30 35 40Spacing between frames us
Recv W
ire r
ate
Mbits/s
Gnt5-DwMk5
DwMk5-Gnt5
1472 byte Packets Manchester -> Dwingeloo JIVE
20MB-NG Review, April 2004
Case Study: ATLAS LHC
Tests streaming built Events from Level3 Trigger to remote compute farm in real time
500 Mbit to 1 Gbit CERN – Man Investigation of use of new high performance TCPs
Testing concepts in the ATLAS Offline Computing model More Mesh than Star:
CERN Tier0 to Tier 1s Tier 2s to all Tier 1s
Tests planned over production networks: Lancaster-Manchester NNW SuperJANET4 Lancaster-Manchester to CERN
21MB-NG Review, April 2004
22MB-NG Review, April 2004
Scalable TCP DataTAG Drop 1 in 106
23MB-NG Review, April 2004
HS-TCP DataTAG Drop 1 in 106
24MB-NG Review, April 2004
Standard Reno TCP DataTAG Drop 1 in 106 Transition highspeed to Standard TCP @ 520s
25MB-NG Review, April 2004
Summary Multi-Gigabit transfers are possible and stable Demonstrated that new TCP stacks help
performance
DataTAG has made major contributions to understanding of high-speed networking
There has been significant technology transfer between DataTAG and other projects
Now reaching out to real users.
But still much research to do: Achieve performance – Protocol vs implementation issues Stability / Sharing issues Optical transports & hybrid networks
26MB-NG Review, April 2004
10 Gigabit: Tuning PCI-X
mmrbc1024 bytes
mmrbc2048 bytes
mmrbc4096 bytes5.7Gbit/s
mmrbc512 bytes
CSR Access
PCI-X Sequence
Data Transfer
Interrupt & CSR Update
16080 byte packets every 200 µs
Intel PRO/10GbE LR Adapter PCI-X bus occupancy vs mmrbc
Measured times Times based on PCI-X times
from the logic analyser Expected throughput ~7 Gbit/s
0
5
10
15
20
25
30
35
40
45
50
0 1000 2000 3000 4000 5000Max Memory Read Byte Count
PC
I-X
Tra
nsfe
r tim
e u
s
0
1
2
3
4
5
6
7
8
9
PC
I-X
Tra
nsfe
r ra
te G
bit/s
Measured PCI-X transfer time usexpected time usrate from expected time Gbit/s Max throughput PCI-X
Kernel 2.6.1#17 HP Itanium Intel10GE Feb04
0
2
4
6
8
10
0 1000 2000 3000 4000 5000Max Memory Read Byte Count
PC
I-X
Tra
nsfe
r tim
e
us
measured Rate Gbit/srate from expected time Gbit/s Max throughput PCI-X
27MB-NG Review, April 2004
DataTAG Testbed
28MB-NG Review, April 2004
BaBar Case Study: Disk Performance
BaBar Disk Server Tyan Tiger S2466N
motherboard 1 64bit 66 MHz PCI bus Athlon MP2000+ CPU AMD-760 MPX chipset 3Ware 7500-8 RAID5 8 * 200Gb Maxtor IDE
7200rpm disks Note the VM parameter
readahead max
Disk to memory (read)Max throughput 1.2 Gbit/s 150 MBytes/s)
Memory to disk (write)Max throughput 400 Mbit/s 50 MBytes/s)[not as fast as Raid0]
29MB-NG Review, April 2004
RAID Controller PerformanceR
AID
0R
AID
5
Read Speed Write Speed
30MB-NG Review, April 2004
BaBar: Serial ATA Raid Controllers RAID5 3Ware 66 MHz PCI
Read Throughput raid5 4 3Ware 66MHz SATA disk
0
200
400
600
800
1000
1200
1400
1600
0 200 400 600 800 1000 1200 1400 1600 1800 2000
File size MBytes
Mb
it/s
readahead max 31readahead max 63readahead max 127readahead max 256readahead max 512readahead max 1200
ICP 66 MHz PCI
Write Throughput raid5 4 3Ware 66MHz SATA disk
0
200
400
600
800
1000
1200
1400
1600
1800
0 200 400 600 800 1000 1200 1400 1600 1800 2000
File size MBytes
Mb
it/s
readahead max 31readahead max 63readahead max 127readahead max 256readahead max 516readahead max 1200
Read Throughput raid5 4 ICP 66MHz SATA disk
0
100
200
300
400
500
600
700
800
900
0 200 400 600 800 1000 1200 1400 1600 1800 2000
File size MBytes
Mb
it/s
readahead max 31readahead max 63readahead max 127readahead max 256readahead max 512readahead max 1200
Write Throughput raid5 4 ICP 66MHz SATA disk
0
200
400
600
800
1000
1200
1400
1600
0 200 400 600 800 1000 1200 1400 1600 1800 2000
File size MBytes
Mb
it/s
readahead max 31readahead max 63readahead max 127readahead max 256readahead max 512readahead max 1200
31MB-NG Review, April 2004
Measure the time between lost packets in the time series of packets sent.
Lost 1410 in 0.6s Is it a Poisson process? Assume Poisson is
stationary λ(t) = λ
Use Prob. Density Function:
P(t) = λ e-λt
Mean λ = 2360 / s[426 µs]
Plot log: slope -0.0028expect -0.0024
Could be additional process involved
VLBI Project: Packet Loss Distributionpacket loss distribution 12b bin=12us
0
10
20
30
40
50
60
70
80
12 72 132
192
252
312
372
432
492
552
612
672
732
792
852
912
972
Time between lost frames (us)
Num
ber
in B
in
Measured
Poisson
packet loss distribution 12b
y = 41.832e-0.0028x
y = 39.762e-0.0024x
1
10
100
0 500 1000 1500 2000
Time between frames (us)
Num
ber
in B
in
32MB-NG Review, April 2004
The performance of the end host / disks BaBar Case Study: RAID BW & PCI Activity
3Ware 7500-8 RAID5 parallel EIDE 3Ware forces PCI bus to 33 MHz BaBar Tyan to MB-NG SuperMicro
Network mem-mem 619 Mbit/s Disk – disk throughput bbcp
40-45 Mbytes/s (320 – 360 Mbit/s) PCI bus effectively full! User throughput ~ 250 Mbit/s
Read from RAID5 Disks Write to RAID5 Disks