Top Banner
Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high- performance network and disk sub-systems Richard Hughes-Jones, Stephen Dallison The University of Manchester MB - NG
33

Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 1Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

1

Investigating the interaction between high-performance

network and disk sub-systems

Richard Hughes-Jones, Stephen DallisonThe University of Manchester

MB - NG

Page 2: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 2Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

2

Introduction AIMD and High Bandwidth – Long Distance networks the

assumption that packet loss means congestion is well known Focus

Data moving applications with different TCP stacks and network environments

The interaction between network hardware, protocol stack and disk sub-system

Almost a user view

We studied Different TCP stacks:

standard, HSTCP, Scalable, H-TCP, BIC, Westward Several Applications:

bbftp, bbcp, Apache, gridftp 3 Networks:

MB-NG, SuperJANET4, UKLight RAID0 and RAID5 controllers

Page 3: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 3Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

3

Topology of the MB – NG Network

KeyGigabit Ethernet2.5 Gbit POS Access

MPLS Admin. Domains

UCL Domain

Edge Router Cisco 7609

man01

man03

Boundary Router Cisco 7609

Boundary Router Cisco 7609

RAL Domain

Manchester Domain

lon02

man02

ral01

UKERNADevelopment

Network

Boundary Router Cisco 7609

ral02

ral02

lon03

lon01

HW RAID

HW RAID

Page 4: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 4Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

4

Topology of the Production Network

KeyGigabit Ethernet2.5 Gbit POS Access10 Gbit POS

man01

RAL Domain

Manchester Domain

ral01

HW RAID

HW RAID routers switches

3 routers2 switches

Page 5: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 5Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

5

SC2004 UKLIGHT Overview

MB-NG 7600 OSRManchester

ULCC UKLight

UCL HEP

UCL network

K2

Ci

Chicago Starlight

Amsterdam

SC2004

Caltech BoothUltraLight IP

SLAC Booth

Cisco 6509

UKLight 10GFour 1GE channels

UKLight 10G

Surfnet/ EuroLink 10GTwo 1GE channels

NLR LambdaNLR-PITT-STAR-10GE-16

K2

K2 Ci

Caltech 7600

Page 6: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 6Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

6

Packet Loss with new TCP Stacks TCP Response Function

Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel

MB-NG rtt 6ms DataTAG rtt 120 ms

Page 7: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 7Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

7

Packet Loss and new TCP Stacks TCP Response Function

UKLight London-Chicago-London rtt 177 ms 2.6.6 Kernel

Agreement withtheory good

sculcc1-chi-2 iperf 13Jan05

1

10

100

1000

100100010000100000100000010000000100000000Packet drop rate 1 in n

TC

P A

chie

vabl

e th

roug

hput

M

bit/s

A0 1500

A1 HSTCP

A2 Scalable

A3 HTCP

A5 BICTCP

A8 Westwood

A7 Vegas

A0 Theory

Series10

Scalable Theory

sculcc1-chi-2 iperf 13Jan05

0

100

200

300

400

500

600

700

800

900

1000

100100010000100000100000010000000100000000Packet drop rate 1 in n

TC

P A

chie

vabl

e th

roug

hput

M

bit/s

A0 1500

A1 HSTCP

A2 Scalable

A3 HTCP

A5 BICTCP

A8 Westwood

A7 Vegas

Page 8: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 8Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

8

iperf Throughput + Web100 SuperMicro on MB-NG network HighSpeed TCP Linespeed 940 Mbit/s DupACK ? <10 (expect ~400)

BaBar on Production network Standard TCP 425 Mbit/s DupACKs 350-400 – re-transmits

Page 9: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 9Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

9

End Systems: NICs & Disks

Page 10: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 10Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

10

End Hosts & NICs SuperMicro P4DP6

Latency

Throughput

Bus Activity

Use UDP packets to characterise Host & NIC

SuperMicro P4DP6 motherboardDual Xenon 2.2GHz CPU400 MHz System bus66 MHz 64 bit PCI bus

gig6-7 Intel pci 66 MHz 27nov02

0

200

400

600

800

1000

0 5 10 15 20 25 30 35 40Transmit Time per frame us

Recv

Wire

rate

M

bits/

s

50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes

64 bytes Intel 64 bit 66 MHz

0

100

200

300

400

500

600

700

800

900

170 190 210

Latency us

N(t)

512 bytes Intel 64 bit 66 MHz

0

100

200

300

400

500

600

700

800

170 190 210Latency us

N(t)

1024 bytes Intel 64 bit 66 MHz

0

100

200

300

400

500

600

700

800

190 210 230

Latency us

N(t)

1400 bytes Intel 64 bit 66 MHz

0

100

200

300

400

500

600

700

800

190 210 230

Latency us

N(t)

Intel 64 bit 66 MHz

y = 0.0093x + 194.67

y = 0.0149x + 201.75

0

50

100

150

200

250

300

0 500 1000 1500 2000 2500 3000Message length bytes

Late

ncy u

s

Send PCI

Receive PCI

1400 bytes to NIC

1400 bytes to memory

PCI Stop Asserted

Page 11: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 11Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

11

RAID Controller Performance RAID5 (stripped with redundancy) 3Ware 7506 Parallel 66 MHz 3Ware 7505 Parallel 33 MHz 3Ware 8506 Serial ATA 66 MHz ICP Serial ATA 33/66 MHz Tested on Dual 2.2 GHz Xeon Supermicro P4DP8-G2 motherboard Disk: Maxtor 160GB 7200rpm 8MB Cache Read ahead kernel tuning: /proc/sys/vm/max-readahead = 512

Rates for the same PC RAID0 (stripped) Read 1040 Mbit/s, Write 800 Mbit/s

Disk – Memory Read Speeds Memory - Disk Write Speeds

Page 12: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 12Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

12

SC2004 RAID Controller Performance Supermicro X5DPE-G2 motherboards loaned from Boston Ltd. Dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus Configured as RAID0 64k byte stripe size Six 74.3 GByte Western Digital Raptor WD740 SATA disks

75 Mbyte/s disk-buffer 150 Mbyte/s buffer-memory Scientific Linux with 2.6.6 Kernel + altAIMD patch (Yee) + packet loss patch Read ahead kernel tuning: /sbin/blockdev --setra 16384 /dev/sda

RAID0 (stripped) 2 GByte file: Read 1500 Mbit/s, Write 1725 Mbit/s

Disk – Memory Read Speeds Memory - Disk Write Speeds

RAID0 6disks Read 3w8506-8

0

2000

4000

6000

8000

10000

12000

0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0 160.0 180.0 200.0

Buffer size kbytes

Thro

ughp

ut M

bit/s

1 G Byte Read

0.5 G Byte Read

2 G Byte Read

RAID0 6disks Write 3w8506-8 16

0

500

1000

1500

2000

2500

0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0 160.0 180.0 200.0

Buffer size kbytes

Thro

ughp

ut M

bit/s

1 G Byte write

0.5 G Byte write

2 G Byte write

Page 13: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 13Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

13

Data Transfer Applications

Page 14: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 14Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

14

bbftp: Host & Network Effects 2 Gbyte file RAID5 Disks:

1200 Mbit/s read 600 Mbit/s write

Scalable TCP

BaBar + SuperJANET Instantaneous 220 - 625 Mbit/s

SuperMicro + SuperJANET Instantaneous

400 - 665 Mbit/s for 6 sec Then 0 - 480 Mbit/s

SuperMicro + MB-NG Instantaneous

880 - 950 Mbit/s for 1.3 sec Then 215 - 625 Mbit/s

Page 15: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 15Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

15

bbftp: What else is going on? Scalable TCP

BaBar + SuperJANET

SuperMicro + SuperJANET

Congestion window – dupACK Variation not TCP related?

Disk speed / bus transfer Application

Page 16: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 16Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

16

Applications: Throughput Mbit/s HighSpeed TCP 2 GByte file RAID5 SuperMicro + SuperJANET

bbcp

bbftp

Apachie

Gridftp

Previous work used RAID0(not disk limited)

Page 17: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 17Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

17

Average Transfer Rates Mbit/sApp TCP Stack SuperMicro on

MB-NGSuperMicro on

SuperJANET4

BaBar on

SuperJANET4

SC2004 on UKLight

Iperf Standard 940 350-370 425 940

HighSpeed 940 510 570 940

Scalable 940 580-650 605 940

bbcp Standard 434 290-310 290

HighSpeed 435 385 360

Scalable 432 400-430 380

bbftp Standard 400-410 325 320 825

HighSpeed 370-390 380

Scalable 430 345-532 380 875

apache Standard 425 260 300-360

HighSpeed 430 370 315

Scalable 428 400 317

Gridftp Standard 405 240

HighSpeed 320

Scalable 335

New stacksgive more

throughput

Rate decreases

Page 18: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 18Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

18

Sc2004 & Transfers with UKLight

Page 19: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 19Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

19

SC2004 Disk-Disk bbftp (work in progress)

bbftp file transfer program uses TCP/IP UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 Gbyte file Web100 plots:

Standard TCP Average 825 Mbit/s (bbcp: 670 Mbit/s)

Scalable TCP Average 875 Mbit/s (bbcp: 701 Mbit/s

~4.5s of overhead)

Disk-TCP-Disk at 1Gbit/s

0

500

1000

1500

2000

2500

0 5000 10000 15000 20000

time ms

TC

PA

chiv

e M

bit

/s050000001000000015000000200000002500000030000000350000004000000045000000

Cw

nd

InstaneousBW

AveBW

CurCwnd (Value)

0

500

1000

1500

2000

2500

0 5000 10000 15000 20000

time ms

TC

PA

chiv

e M

bit

/s

050000001000000015000000200000002500000030000000350000004000000045000000

Cw

nd

InstaneousBWAveBWCurCwnd (Value)

Page 20: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 20Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

20

SC2004 Disk-Disk bbftp (work in progress)

UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 Gbyte file Web100 plots:

HS TCP

Don’t believe this is a protocol problem !

0

500

1000

1500

2000

2500

0 5000 10000 15000 20000 25000 30000 35000 40000 45000

time ms

TC

PA

chiv

e M

bit

/s

050000001000000015000000200000002500000030000000350000004000000045000000

Cw

nd

InstaneousBW

CurCwnd (Value)

0

200

400

600

800

1000

1200

0 5000 10000 15000 20000 25000 30000 35000 40000 45000time ms

Nu

m.

Du

p A

CK

s

0

5000000

1000000015000000

20000000

25000000

3000000035000000

40000000

45000000

Cw

nd

DupAcksIn (Delta)CurCwnd (Value)

0

0.2

0.4

0.6

0.8

1

1.2

0 5000 10000 15000 20000 25000 30000 35000 40000 45000time ms

Nu

m.

Tim

eo

uts

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

40000000

45000000

Cw

nd

Timeouts (Delta)CurCwnd (Value)

1

10

100

1000

0 5000 10000 15000 20000 25000 30000 35000 40000 45000time ms

nu

m O

ther R

ed

ucti

on

s

0

5000000

1000000015000000

20000000

25000000

3000000035000000

40000000

45000000

Cw

nd

OtherReductions (Delta)

CurCwnd (Value)

Page 21: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 21Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

21

RAID0 6disks 1 Gbyte Write 64k 3w8506-8

0

500

1000

1500

2000

0.0 20.0 40.0 60.0 80.0 100.0Trial number

Thro

ughput

Mbit/s

Network & Disk Interactions (work in progress) Hosts:

Supermicro X5DPE-G2 motherboards dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size

Measure memory to RAID0 transfer rates with & without UDP trafficRAID0 6disks 1 Gbyte Write 64k 3w8506-8

y = -1.017x + 178.32

y = -1.0479x + 174.440

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

8k

64k

R0 6d 1 Gbyte udp Write 64k 3w8506-8

0

500

1000

1500

2000

0.0 20.0 40.0 60.0 80.0 100.0Trial number

Thro

ughput

Mbit/s

R0 6d 1 Gbyte udp9000 write 64k 3w8506-8

0

500

1000

1500

2000

0.0 20.0 40.0 60.0 80.0 100.0Trial number

Thro

ughput

Mbit/s

R0 6d 1 Gbyte udp Write 64k 3w8506-8

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

8k64ky=178-1.05x

R0 6d 1 Gbyte udp9000 write 8k 3w8506-8 07Jan05 16384

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

% c

pu

syste

m m

od

e L

3+

4

8k

64k

y=178-1.05x

Disk write1735 Mbit/s

Disk write +1500 MTU UDP

1218 Mbit/sDrop of 30%

Disk write +9000 MTU UDP

1400 Mbit/sDrop of 19%

% CPU kernel mode

Page 22: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 22Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

22

Network & Disk Interactions Disk Write

mem-disk: 1735 Mbit/s Tends to be in 1 die

Disk Write + UDP 1500 mem-disk : 1218 Mbit/s Both dies at ~80%

Disk Write + CPU mem mem-disk : 1341 Mbit/s 1 CPU at ~60% other 20% Large user mode usage Below Cut = hi BW Hi BW = die1 used

Disk Write + CPUload mem-disk : 1334 Mbit/s 1 CPU at ~60% other 20% All CPUs saturated in

user mode

RAID0 6disks 1 Gbyte Write 64k 3w8506-8

y = -1.017x + 178.32

y = -1.0479x + 174.440

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

8k

64k

R0 6d 1 Gbyte udp Write 64k 3w8506-8

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

8k64ky=178-1.05x

RAID0 6disks 1 Gbyte Write 8k 3w8506-8 26 Dec04 16384

y = -1.0215x + 215.63

y = -1.0529x + 206.46

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

% c

pu

sys

tem

mo

de

L3+

4 8k total CPU

64k total CPU

R0 6d 1 Gbyte udp Write 8k 3w8506-8 26 Dec04 16384

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

% c

pu

sys

tem

mo

de

L3+

4

8k totalCPU

64k totalCPU

y=178-1.05x

R0 6d 1 Gbyte membw write 8k 3w8506-8 04Jan05 16384

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

% c

pu

sys

tem

mo

de

L3+

4

8k

64k

y=178-1.05xcut equn

R0 6d 1 Gbyte membw write 8k 3w8506-8 04Jan05 16384

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

% c

pu

sys

tem

mo

de

L3+

4

8k totalCPU64k totalCPUy=178-1.05xcut equn 2

R0 6d 1 Gbyte cpuload Write 8k 3w8506-8 3Jan05 16384

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

% c

pu

sys

tem

mo

de

L3+

4

8k

64k

y=178-1.05x

R0 6d 1 Gbyte cpuload Write 8k 3w8506-8 3Jan05 16384

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

% c

pu

sys

tem

mo

de

L3+

4

8k total CPU

64k total CPU

y=178-1.05x

Total CPU load

Kernel CPU load

R0 6d 1 Gbyte membw write 64k 3w8506-8 04Jan05 16384

0

500

1000

1500

2000

2500

0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0

Trial number

Th

rou

gh

pu

t M

bit

/s

Series1

L3+L4<cut

Page 23: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 23Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

23

Host is critical: Motherboards NICs, RAID controllers and Disks matter The NICs should be well designed:

NIC should use 64 bit 133 MHz PCI-X (66 MHz PCI can be OK)NIC/drivers: CSR access / Clean buffer management / Good interrupt handling

Worry about the CPU-Memory bandwidth as well as the PCI bandwidthData crosses the memory bus at least 3 times

Separate the data transfers – use motherboards with multiple 64 bit PCI-X buses32 bit 33 MHz is too slow for Gigabit rates64 bit 33 MHz > 80% used

Choose a modern high throughput RAID controllerConsider SW RAID0 of RAID5 HW controllers

Need plenty of CPU power for sustained 1 Gbit/s transfers Packet loss is a killer

Check on campus links & equipment, and access links to backbones New stacks are stable give better response & performance

Still need to set the tcp buffer sizes ! Check other kernel settings e.g. window-scale,

Application architecture & implementation is also important Interaction between HW, protocol processing, and disk sub-system complex

Summary, Conclusions & Thanks

MB - NG

Page 24: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 24Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

24

More Information Some URLs

UKLight web site: http://www.uklight.ac.uk MB-NG project web site: http://www.mb-ng.net/ DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup:

http://www.hep.man.ac.uk/~rich/net Motherboard and NIC Tests:

www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004

TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html

TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004

Page 25: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 25Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

25

Backup Slides

Page 26: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 26Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

26

High Throughput Demonstrations

Manchester (Geneva)

man03lon01

2.5 Gbit SDHMB-NG Core

1 GEth1 GEth

Cisco GSR

Cisco GSR

Cisco7609

Cisco7609

London (Chicago)

Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz

Send data with TCPDrop Packets

Monitor TCP with Web100

Page 27: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 27Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

27

Drop 1 in 25,000 rtt 6.2 ms Recover in 1.6 s

High Performance TCP – MB-NG

Standard HighSpeed Scalable

Page 28: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 28Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

28

High Performance TCP – DataTAG Different TCP stacks tested on the DataTAG Network rtt 128 ms Drop 1 in 106

High-SpeedRapid recovery

ScalableVery fast recovery

StandardRecovery would

take ~ 20 mins

Page 29: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 29Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

29

SC2004 RAID Controller Performance Supermicro X5DPE-G2 motherboards Dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus

Configured as RAID0 64k byte stripe size six 74.3 GByte Western Digital Raptor WD740 SATA disks

75 Mbyte/s disk-buffer 150 Mbyte/s buffer-memory Scientific Linux with 2.4.20 Kernel + altAIMD patch (Yee) + packet loss patch Read ahead kernel tuning: /proc/sys/vm/max-readahead = 512

RAID0 (stripped) 2Gbyte file: Read 1460 Mbit/s, Write 1320 Mbit/s

Disk – Memory Read Speeds Memory - Disk Write Speeds

RAID0 6disks Read 3w8506-8

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0.0 200.0 400.0 600.0 800.0 1000.0 1200.0 1400.0 1600.0 1800.0 2000.0

File size Mbytes

Thro

ughp

ut M

bit/s

Mbit/s 64-64 r

Mbit/s 512-512 r

Mbit/s 2048-2048 r

RAID0 6disks Write 3w8506-8 16 Oct 04

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0.0 200.0 400.0 600.0 800.0 1000.0 1200.0 1400.0 1600.0 1800.0 2000.0

File size Mbytes

Thro

ughp

ut M

bit/s

Mbit/s 64-64 w

Mbit/s 512-512 w

Mbit/s 2048-2048 w

Page 30: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 30Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

30

The performance of the end host / disks BaBar Case Study: RAID BW & PCI Activity

3Ware 7500-8 RAID5 parallel EIDE 3Ware forces PCI bus to 33 MHz BaBar Tyan to MB-NG SuperMicro

Network mem-mem 619 Mbit/s Disk – disk throughput bbcp

40-45 Mbytes/s (320 – 360 Mbit/s) PCI bus effectively full! User throughput ~ 250 Mbit/s

Read from RAID5 Disks Write to RAID5 Disks

Page 31: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 31Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

31

Gridftp Throughput + Web100 RAID0 Disks:

960 Mbit/s read 800 Mbit/s write

Throughput Mbit/s:

See alternate 600/800 Mbit and zero Data Rate: 520 Mbit/s

Cwnd smooth No dup Ack / send stall /

timeouts

Page 32: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 32Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

32

http data transfers HighSpeed TCP Same Hardware RAID0 Disks Bulk data moved by web servers Apachie web server

out of the box! prototype client - curl http library 1Mbyte TCP buffers 2Gbyte file Throughput ~720 Mbit/s Cwnd - some variation No dup Ack / send stall / timeouts

Page 33: Slide: 1 Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester 1 Investigating the interaction between high-performance network and disk.

Slide: 33Richard Hughes-Jones PFLDnet2005 Lyon Feb 05 R. Hughes-Jones Manchester

33

Bbcp & GridFTP Throughput RAID5 - 4disks Manc – RAL 2Gbyte file transferred bbcp Mean 710 Mbit/s

GridFTP See many zeros

Mean ~710

Mean ~620

DataTAG altAIMD kernel in BaBar & ATLAS