Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC Dissecting On-node Memory Performance with MemAxes Petascale Tools Workshop 2014 Alfredo Gimenez * , Todd Gamblin † , Martin Schulz † , Peer-Timo Bremer † , Barry Rountree † , Abhinav Bhatele † , Ilir Jusufi * , and Bernd Hammann * Madison, WI August 4-7, 2014 † LLNL * UC Davis
14
Embed
Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Lawrence Livermore National Laboratory LLNL-PRES-XXXXXX
LLNL-PRES-657922This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Dissecting On-node Memory Performance with MemAxes
Petascale Tools Workshop 2014
Alfredo Gimenez*, Todd Gamblin†, Martin Schulz†, Peer-Timo Bremer†, Barry Rountree†,
Abhinav Bhatele†, Ilir Jusufi*, and Bernd Hammann*
Madison, WIAugust 4-7, 2014
† LLNL* UC Davis
Lawrence Livermore National LaboratoryLLNL-PRES-657922
Memory Access Sampling• Recent hardware additions allow us to precisely
sample events, including memory accesses• Intel PEBS, AMD IBS
• Memory access samples contain:• The instruction pointer• The address accessed• How many core clock cycles elapsed during the access• Where in the memory hierarchy the address was resolved
(e.g. L1 cache, Local RAM, Remote RAM)
• We need a way to meaningfully interpretthese samples
Lawrence Livermore National LaboratoryLLNL-PRES-657922
Can get thesefrom tools
Need help from app
Adding Context• Can better understand memory references with
appropriate context
• Contexts include:– The code– The node hardware topology– Calling context (call path)– The application (e.g. fluid dynamics)
• Other work by Liu & Mellor-Crummey has looked at mapping latency & access patterns to particular variables, call paths, and access patterns.
Lawrence Livermore National LaboratoryLLNL-PRES-657922
We can already get coarse-grained application context for some codes
• Physics data is available in data structures
• Time steps are easy to mark in the code
• Per-process performance– easy to get– just turn on counters at the
beginning of the run– read them periodically.
• What if we want finer-grained attribution?– How to tie measurements to data
structures?– How to slice and dice the data?
Aluminum
FLOP/s per MPI process
Lawrence Livermore National LaboratoryLLNL-PRES-657922
Node topology is easy to get, but not shown clearly.
• PEBS provides metadata for node topology
• Want to highlight connections clearly to show:– Load distribution– Bandwidth– Resource contention
• Existing visualization from hwloc (right)– Does not scale– Clutters connections between
components
Lawrence Livermore National LaboratoryLLNL-PRES-657922
We have developed a measurement tool for collecting detailed context
*SMT: (Semantic Memory Tree) data structure used to mapcallbacks sampled instruction operands
• Use PEBS sampling for hardware information• Supplement with application instrumentation for
mapping addresses to physical coordinates
*
Lawrence Livermore National LaboratoryLLNL-PRES-657922
Currently the developer has to instrument the application manually• Add calls to get metadata for allocated objects:
1. Label string2. Start and end addresses3. Size of each element4. Number of elements5. Callback to map address to physical coordinates
• Metadata must be provided by the programmer– Could easily be implemented in libraries– Lots of common mesh libraries would be interesting for this.
Lawrence Livermore National LaboratoryLLNL-PRES-657922
Instrumentation
Specify DataObjects
Add additional semantic attributes and define attribution function (optional)
Lawrence Livermore National LaboratoryLLNL-PRES-657922
Lawrence Livermore National LaboratoryLLNL-PRES-657922
Lagrangian Hydrodynamics: LULESH
2D 3D
3D with mappedperformance data
Lawrence Livermore National LaboratoryLLNL-PRES-657922
We have developed MemAxes, a tool for analyzing on-node memory performance
• Measurement component samples memory instructions• We map latency information onto A) source code, B) node topology • C) Pie chart shows percent of total latency selected• D) Parallel coordinates view allows exploration of correlations
Lawrence Livermore National LaboratoryLLNL-PRES-657922
Linked views clearly show on-nodelocality problems
PIPER
• Parallel coordinates view shows correlation between array index and core id in LULESH
• Linked node topology view shows data motion for highlighted memory operations
• A contiguous chunk of an array is initially split between threads on four cores
• Using an optimized affinity scheme, we improve locality
• Performance improved by 10%
Default thread affinity with poor locality
Optimized thread affinity with good locality
Lawrence Livermore National LaboratoryLLNL-PRES-657922
Hyperion Thread/Core Binding
Improved cache usage44% less access cycles10% total speedup
Lawrence Livermore National LaboratoryLLNL-PRES-657922
Future work• Back-port perf_events API to production TOSS 2 kernel
– Currently unable to do fine-grained memory sampling on production machines due to PMU access limits
– Affects some Intel thread tools as well
• More detailed architecture mapping– Sandy Bridge LLC ring interconnect information?– Other node architecture features?
• Instrument AMR libraries for proper context attribution– Study per-patch memory behavior– Study blocking behavior of solvers
• How to query large instruction traces effectively?