1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1 , Martin Burtscher 2 , John D. McCalpin 3 , Byoung-Do Kim 3 , Stephen W. Keckler 1,4 , James C. Browne 1 1 University of Texas, 2 Texas State, 3 Texas Advanced Computing Center, 4 NVIDIA
39
Embed
Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications
Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications. Jeff Diamond 1 , Martin Burtscher 2 , John D. McCalpin 3 , Byoung -Do Kim 3 , Stephen W. Keckler 1,4 , James C. Browne 1. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Evaluation and Optimization of Multicore Performance Bottlenecks in
Supercomputing Applications
Jeff Diamond1, Martin Burtscher2, John D. McCalpin3, Byoung-Do Kim3,
Stephen W. Keckler1,4, James C. Browne1
1University of Texas, 2Texas State, 3Texas Advanced Computing Center, 4NVIDIA
2
Trends In Supercomputers
3
Is multicorean issue?
4
The Problem: Multicore Scalability
5
The Problem: Multicore Scalability
6
Optimizations Differ in Multicore
Base code vs Multicore Optimized code
7
Paper Contributions
Studies multicore related bottlenecks Identifies performance measurement challenges
unique to multicore systems Presents systematic approach to multicore
performance analysisDemonstrates principles of optimization
8
Talk Outline
IntroductionApproach: An HPC Case StudyMulticore Measurement IssuesOptimization ExampleConclusion
9
Approach: An HPC Case Study
Examine a real HPC application Major functions add variety
What is a typical HPC application?Many exhibit low arithmetic intensity
Typical of explicit / iterative solvers, stencilsFinite volume / elements / differencesMolecular dynamics, particle simulations, graph
search, Sparse MM, etc.
10
Application: HOMME High Order Method Modeling Environment 3-D Atmospheric Simulation from NCAR Required for NSF acceptance testing Excellent scaling, highly optimized Arithmetic Intensity typical of stencil codes
IntroductionApproach: An HPC Case StudyMulticore Measurement IssuesOptimization ExampleConclusion
12
Multicore Performance BottlenecksSINGLE CHIP
SINGLE DIMM
PRIVATEL1/L2 Cache
SHAREDL3 CACHE
SHAREDOFF-CHIP BW
SHARED DRAMPAGE CACHES
NODE
LOCAL DRAM
L3
L2 L2
L2 L2
L1 L1
L1 L1
13
Disturbances Persist Longer
14
Measurement Implications
15
Measurements Must Be Lightweight
Duration of major HOMME functions
Action CyclesRead Counter 9
Read Four Counters 30Call Function 40PAPI READ 400System Call 5,000
TLB Page Initialization 25,000
Function Duration Calls Per Second % Exec Time2,000 cycles or less 100,000 20%
2,000 to 10,000 cycles 20,000 10%10K to 200K cycles 1,600 15%200K to 1M cycles 200 15%1M to 10M cycles - 0%10M or more cycle 4 35%
16
Multicore Measurement Issues
Performance issues in shared memory systemContext SensitiveNondeterministicHighly non local
Measurement disturbance is significantAccessing memory or delaying core Hard to “bracket” measurement effectsDisturbances can last billions of cyclesBottlenecks can be “bursty”
Conclusion – need multiple tools
17
Talk Outline
IntroductionApproach: An HPC Case StudyMulticore Measurement IssuesOptimization ExampleConclusion
18
Multicore Performance BottlenecksSINGLE CHIP
SINGLE DIMM
SHAREDL3 CACHE
SHAREDOFF-CHIP BW
SHARED DRAMPAGE CACHES
NODE
LOCAL DRAM
L3
L2 L2
L2 L2
L1 L1
L1 L1
19
Measurement Approach
Find important functionsCompare performance counters at min/max core density Identify key multicore bottleneck:
L3 capacity – L3 miss rates increase with density Off-chip BW – BW usage at min density greater than share DRAM contention – DRAM page miss rates increase with
densityFor small and medium functions, follow up with light