National Energy Research Scientific Computing Center (NERSC) Network performance tuning Eli Dart, Network engineer NERSC Center Division, LBNL Supercomputing.

National Energy Research Scientific Computing Center (NERSC)

Network performance tuning

Eli Dart, Network engineerNERSC Center Division, LBNL

Supercomputing Conference, Nov. 2003

What is Network Tuning?

• Get the most out of the network– Architecture– Hosts– Applications

• Required for HPC center to function• Many HPC projects have high-

performance networking requirements• Projects span multiple sites, multiple

networks

Performance Tuning is Complex

• Users and network operators each see a portion of the system

• Different parts of the system interact in complex ways

• Local optimization can often lead to global performance degradation

NERSC Center

• High bandwidth ESnet connection (2xGigE)

• Jumbo-clean (9000 byte packets) production services soon

• Peak traffic load doubles every year• Major computational and storage systems

– Seaborg (IBM SP, 10TFLOPs)– PDSF (Linux cluster – HEP, Astrophysics, etc)– HPSS (Multi-Petabyte mass storage system)

NERSC data traffic profile

Bulk data matters

• Primary driver for NERSC users• Differing transfer profiles

– Large number of files– Large files

• Increasing needs — increasing difficulty– Requirements increasing for foreseeable

future– Tuning will only become more critical

Tuning makes a difference

Site Improvement

ORNL 20x

Fermi 7x

Washington U 5x

BNL 2.5x

PPPL 2.5x

Tuning Goals

• Make it go fast!– Oversimplification, underestimates the problem– Potentially complex engineering tradeoffs

• The real list of goals:– Overcome TCP’s timidity– Clean up the network– Understand the application

• Instrumentation and analysis are key – Know your traffic– Know your infrastructure

The big one – TCP

• The problem is congestion control/avoidance– Packet loss is interpreted as congestion– Response to early congestion collapse

• TCP is timid– Packet loss is not always congestion– Slow recovery from loss events

• Exponential backoff in response to packet loss• Linear recovery – potentially very slow

– Attempts to allow everyone to use the network

Two TCP camps

• Abandon TCP– Interoperability/coexistence is key– Support issues (research-ware)– Production quality concerns (vendor support, etc)

• Fix/help TCP– Modifications/extensions to TCP

• FAST, Vegas, etc• Web100/Net100

– Reduce packet loss in the network– Host tuning– Shares some problems with the Abandon camp

Today, we work with TCP

• Exists today

• Widely deployed/supported

• Significant performance gains with tuning

• Evaluating both camps for the future

Common problems – Hosts

• TCP buffer sizing

• Disk speed

• Interrupt coalescence

• Per-socket hashing for link aggregation

Common Problems – Network

• It all comes down to packet loss• Network architecture

– Micro-congestion affecting TCP slow start– Misbehaving or misconfigured devices

• Old gear– Performance dependent on CPU– Internal bottlenecks– Flaky interfaces, buggy software

Solutions – host tuning

• TCP buffer size– Don’t just crank it up– Different parameters for different purposes– Use modern software (ncftp, grid tools, etc)– Use per-destination parameters if at all possible

• Think about what you’re doing– You can’t fill a GigE with an IDE disk– Fast host interfaces can slow you down– Netperf/Iperf often isn’t production data transfer

Solutions – packet loss

• Avoid finger pointing– Network operator sees link with headroom

(everything’s fine)– User sees poor performance (the network is broken)– Only way to tell is to look at the traffic

• Identify and fix points of micro-congestion– TCP and ATM don’t get along (different design goals)– “Impedance mismatches” in the network– Network device tuning to handle packet bursts

• Clean up error-prone links

Switch fan-in Micro-congestion

• Switch has small queues• Unable to handle line-rate bursts in

the presence of background traffic

1 Gbps

Core Switch

Source A

Source B Dest Site

Nominal Backgroundtraffic load

Line rate burst fromTCP session startup

Burst minus backgroundtraffic (dropped packets)

1 Gbps

500 Kbps

Routers don’t have this problem

• Routers have much larger queues• Able to absorb traffic bursts without

dropping packets (within reason)

1 Gbps

Core Router

Source A

Source B Dest Site

Nominal Backgroundtraffic load

Line rate burst fromTCP session startup

Excess traffic queuesNo dropped packets

1Gbps

500 Kbps

TCP slow start packet loss consequences

Transfer speed

Recovery time Data sent

during recovery

10Mbps 8.6 sec 5.5MB

100Mbps 86 sec 537MB

200Mbps 3 min 4.8GB

500Mbps 7 min 13.4GB

1Gbps 14 min 53.5GB

The most important thing

• Communication, communication, communication

• A 15-minute call can solve more problems than a month of email

• Users may not understand the problem or jump to conclusions

• System administrators are often caught in the middle

• Networking staff may not know that a problem exists

Case study – BNL and NERSC

• Goal – enable large, recurring data transfers from BNL to NERSC

• Periodic transfers of 1-4 Terabytes• Per-stream throughput less than 1MB/sec

(requires large numbers of parallel streams)

• Aggregate performance boost is of primary concern (multi-stream transfers OK, within reason)


• Initial analysis by users indicated TCP window scaling was not being used

• First step – turn on window scaling• Isolated troubleshooting by both sides

indicates that window scaling has not been enabled at remote site

• Finger pointing as each side tells the other to fix their stupid machines


• Coordinated troubleshooting (simultaneous tcpdumps at both sites plus conference call) reveals both sides have window scaling enabled

• Something in the network is turning off window scaling

• BNL has a Cisco PIX Firewall on their border.

• Test network verifies PIX is disabling window scaling at packet level


• Bug filed with Cisco• Cisco knew about this behavior,

didn’t document it• Fixed in version 6.1 and later of PIX

software• BNL was then able to saturate their

OC-3 when transferring data to NERSC

Case study – Conclusions

• Network devices misbehave in strange and undocumented ways

• Multi-site, Coordinated troubleshooting necessary– Communication is key– Finger pointing doesn’t help– Impossible to see the whole problem from any one

vantage point in the network– Solving this problem required NERSC, BNL and

ESnet staff working in concert– Leadership/ownership required for coordination of

these efforts – for NERSC users, that’s us

Main Points

• Network performance tuning is a complex, multi-dimensional problem

• Coordinated effort is required• Performance gains have significant impact

on science – it’s worth the time and effort• NERSC users should involve us in their

troubleshooting efforts ([email protected])

Resources

• NERSC– NERSC consultants:

• [email protected]• http://hpcf.nersc.gov/help/

– Networking group: [email protected]

• Other resources– Esnet Performance centers: https://performance.es.net/– PSC Tuning guide:

http://www.psc.edu/networking/perf_tune.html– Jumbo frames

• http://www.abilene.iu.edu/JumboMTU.html• http://www.uoregon.edu/~joe/jumbo-clean-gear.html

Thanks for listening

• Questions?

National Energy Research Scientific Computing Center (NERSC) Network performance tuning Eli Dart, Network engineer NERSC Center Division, LBNL Supercomputing.

Documents

hpc center

network tuning

timidpacket loss

network operators

loss eventsexponential

multiple sites

congestionslow recovery

total bytesbytes