National Energy Research Scientific Computing Center (NERSC) Network performance tuning Eli Dart, Network engineer NERSC Center Division, LBNL
Jan 17, 2016
National Energy Research Scientific Computing Center (NERSC)
Network performance tuning
Eli Dart, Network engineerNERSC Center Division, LBNL
Supercomputing Conference, Nov. 2003
What is Network Tuning?
• Get the most out of the network– Architecture– Hosts– Applications
• Required for HPC center to function• Many HPC projects have high-
performance networking requirements• Projects span multiple sites, multiple
networks
Performance Tuning is Complex
• Users and network operators each see a portion of the system
• Different parts of the system interact in complex ways
• Local optimization can often lead to global performance degradation
NERSC Center
• High bandwidth ESnet connection (2xGigE)
• Jumbo-clean (9000 byte packets) production services soon
• Peak traffic load doubles every year• Major computational and storage systems
– Seaborg (IBM SP, 10TFLOPs)– PDSF (Linux cluster – HEP, Astrophysics, etc)– HPSS (Multi-Petabyte mass storage system)
NERSC data traffic profile
Bulk data matters
• Primary driver for NERSC users• Differing transfer profiles
– Large number of files– Large files
• Increasing needs — increasing difficulty– Requirements increasing for foreseeable
future– Tuning will only become more critical
Tuning makes a difference
Site Improvement
ORNL 20x
Fermi 7x
Washington U 5x
BNL 2.5x
PPPL 2.5x
Tuning Goals
• Make it go fast!– Oversimplification, underestimates the problem– Potentially complex engineering tradeoffs
• The real list of goals:– Overcome TCP’s timidity– Clean up the network– Understand the application
• Instrumentation and analysis are key – Know your traffic– Know your infrastructure
The big one – TCP
• The problem is congestion control/avoidance– Packet loss is interpreted as congestion– Response to early congestion collapse
• TCP is timid– Packet loss is not always congestion– Slow recovery from loss events
• Exponential backoff in response to packet loss• Linear recovery – potentially very slow
– Attempts to allow everyone to use the network
Two TCP camps
• Abandon TCP– Interoperability/coexistence is key– Support issues (research-ware)– Production quality concerns (vendor support, etc)
• Fix/help TCP– Modifications/extensions to TCP
• FAST, Vegas, etc• Web100/Net100
– Reduce packet loss in the network– Host tuning– Shares some problems with the Abandon camp
Today, we work with TCP
• Exists today
• Widely deployed/supported
• Significant performance gains with tuning
• Evaluating both camps for the future
Common problems – Hosts
• TCP buffer sizing
• Disk speed
• Interrupt coalescence
• Per-socket hashing for link aggregation
Common Problems – Network
• It all comes down to packet loss• Network architecture
– Micro-congestion affecting TCP slow start– Misbehaving or misconfigured devices
• Old gear– Performance dependent on CPU– Internal bottlenecks– Flaky interfaces, buggy software
Solutions – host tuning
• TCP buffer size– Don’t just crank it up– Different parameters for different purposes– Use modern software (ncftp, grid tools, etc)– Use per-destination parameters if at all possible
• Think about what you’re doing– You can’t fill a GigE with an IDE disk– Fast host interfaces can slow you down– Netperf/Iperf often isn’t production data transfer
Solutions – packet loss
• Avoid finger pointing– Network operator sees link with headroom
(everything’s fine)– User sees poor performance (the network is broken)– Only way to tell is to look at the traffic
• Identify and fix points of micro-congestion– TCP and ATM don’t get along (different design goals)– “Impedance mismatches” in the network– Network device tuning to handle packet bursts
• Clean up error-prone links
Switch fan-in Micro-congestion
• Switch has small queues• Unable to handle line-rate bursts in
the presence of background traffic
1 Gbps
Core Switch
Source A
Source B Dest Site
Nominal Backgroundtraffic load
Line rate burst fromTCP session startup
Burst minus backgroundtraffic (dropped packets)
1 Gbps
500 Kbps
Routers don’t have this problem
• Routers have much larger queues• Able to absorb traffic bursts without
dropping packets (within reason)
1 Gbps
Core Router
Source A
Source B Dest Site
Nominal Backgroundtraffic load
Line rate burst fromTCP session startup
Excess traffic queuesNo dropped packets
1Gbps
500 Kbps
TCP slow start packet loss consequences
Transfer speed
Recovery time Data sent
during recovery
10Mbps 8.6 sec 5.5MB
100Mbps 86 sec 537MB
200Mbps 3 min 4.8GB
500Mbps 7 min 13.4GB
1Gbps 14 min 53.5GB
The most important thing
• Communication, communication, communication
• A 15-minute call can solve more problems than a month of email
• Users may not understand the problem or jump to conclusions
• System administrators are often caught in the middle
• Networking staff may not know that a problem exists
Case study – BNL and NERSC
• Goal – enable large, recurring data transfers from BNL to NERSC
• Periodic transfers of 1-4 Terabytes• Per-stream throughput less than 1MB/sec
(requires large numbers of parallel streams)
• Aggregate performance boost is of primary concern (multi-stream transfers OK, within reason)
Case study – BNL and NERSC
• Initial analysis by users indicated TCP window scaling was not being used
• First step – turn on window scaling• Isolated troubleshooting by both sides
indicates that window scaling has not been enabled at remote site
• Finger pointing as each side tells the other to fix their stupid machines
Case study – BNL and NERSC
• Coordinated troubleshooting (simultaneous tcpdumps at both sites plus conference call) reveals both sides have window scaling enabled
• Something in the network is turning off window scaling
• BNL has a Cisco PIX Firewall on their border.
• Test network verifies PIX is disabling window scaling at packet level
Case study – BNL and NERSC
• Bug filed with Cisco• Cisco knew about this behavior,
didn’t document it• Fixed in version 6.1 and later of PIX
software• BNL was then able to saturate their
OC-3 when transferring data to NERSC
Case study – Conclusions
• Network devices misbehave in strange and undocumented ways
• Multi-site, Coordinated troubleshooting necessary– Communication is key– Finger pointing doesn’t help– Impossible to see the whole problem from any one
vantage point in the network– Solving this problem required NERSC, BNL and
ESnet staff working in concert– Leadership/ownership required for coordination of
these efforts – for NERSC users, that’s us
Main Points
• Network performance tuning is a complex, multi-dimensional problem
• Coordinated effort is required• Performance gains have significant impact
on science – it’s worth the time and effort• NERSC users should involve us in their
troubleshooting efforts ([email protected])
Resources
• NERSC– NERSC consultants:
• [email protected]• http://hpcf.nersc.gov/help/
– Networking group: [email protected]
• Other resources– Esnet Performance centers: https://performance.es.net/– PSC Tuning guide:
http://www.psc.edu/networking/perf_tune.html– Jumbo frames
• http://www.abilene.iu.edu/JumboMTU.html• http://www.uoregon.edu/~joe/jumbo-clean-gear.html
Thanks for listening
• Questions?