GPCNeT: Designing a Benchmark Suite for Inducing and Measuring Contention in HPC Networks *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan Kumaran, Glenn Lockwood, Scott Parker, Steven Warren, Nathan Wichmann, Nicholas J. Wright SC 19 - Denver, CO (*primary authors contributed equally)
52
Embed
Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
GPCNeT: Designing a Benchmark Suite for Inducing and Measuring Contention in HPC Networks
*Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan Kumaran, Glenn Lockwood, Scott Parker, Steven Warren, Nathan Wichmann, Nicholas J. Wright
SC 19 - Denver, CO (*primary authors contributed equally)
The HPC and Data Center community needs a standard set of benchmarks for characterizing network performance under load.
1. Motivate/introduce GPCNeT: network congestion benchmark2. Describe design of the GPCNeT3. Comparison GPCNeT to congestion seen in production4. Architectural/Site evaluations:
– 4 different DoE Labs– 3 different network architectures – Including Slingshot network with advanced congestion control
Summary of Contributions
2
Sample of work at SC 13-19 focused on network congestion:• There Goes the Neighborhood: Performance Degradation Due to Nearby Jobs. SC13• Network Endpoint Congestion Control for Fine-Grained Communication. SC15• Evaluating HPC Networks via Simulation of Parallel Workloads. SC16• Watch Out for the Bully! Job Interference Study on Dragonfly Network. SC16• Run-to-run Variability on Xeon Phi Based Cray XC Systems. SC17• Mitigating Inter-Job Interference Using Adaptive Flow-Aware Routing SC18• Understanding Congestion in High Performance Interconnection Networks Using Sampling. SC19• Mitigating Network Noise on Dragonfly Networks through Application-Aware Routing. SC19• …….
Despite the importance, there is no standard benchmark to measure network performance under congestion.
Network Congestion is Trending
3
“Tests like ping pong latency are like trying to understand your commute into NYC by driving the route alone at 4am.” – Steve Scott
Best Case Performance is Rare
4
Ping Pong on a quiet system Doing an FFT with congestion
GPCNeT default is aggressive and stresses the system
7 Systems (4 DoE Production, 3 Cray Testbeds)
• Theta, Edison, Sierra, Summit• System size from 128 to 5.5k nodes• Aries, EDR IB and Slingshot Networks• Fully populated with GPCNeT defaults• Report mean and P99 normalized to baseline
GPCNeT Architectural Comparisons
29
EDR IB100%50%
128
Impact of Congestion on Modern Systems
Slowdown (multiplier) compared to mean baseline (log-scale)