PMaC Performance Modeling and Characterization A Static Binary Instrumentation Threading Model for Fast Memory Trace Collection Michael Laurenzano 1 , Joshua Peraza 1 , Laura Carrington 1 , Ananta Tiwari 1 , William A. Ward 2 , Roy Campbell 2 1 Performance Modeling and Characterization (PMaC) Laboratory, San Diego Supercomputer Center 2 High Performance Computing Modernization Program (HPCMP), United States Department of Defense
23
Embed
A Static Binary Instrumentation Threading Model for Fast Memory Trace Collection
A Static Binary Instrumentation Threading Model for Fast Memory Trace Collection. Michael Laurenzano 1 , Joshua Peraza 1 , Laura Carrington 1 , Ananta Tiwari 1 , William A. Ward 2 , Roy Campbell 2 1 Performance Modeling and Characterization ( PMaC ) Laboratory, San Diego Supercomputer Center - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
PMaCPerformance Modeling and Characterization
A Static Binary Instrumentation Threading Model for Fast Memory
Trace Collection
Michael Laurenzano1, Joshua Peraza1, Laura Carrington1, Ananta Tiwari1, William A. Ward2, Roy Campbell2
1Performance Modeling and Characterization (PMaC) Laboratory, San Diego Supercomputer Center
2High Performance Computing Modernization Program (HPCMP), United States Department of Defense
PMaCPerformance Modeling and Characterization
Memory-driven HPC· Many HPC applications are memory bound
0000c000 <foo>: c000: 48 89 7d f8 mov %rdi,-0x8(%rbp) c004: 5e pop %rsi c005: 75 f8 jne 0xc004 c007: c9 leaveq c008: c3 retq
0000c000 <foo>: c000: // compute -0x8(%rbp) and copy it to a buffer c008: 48 89 7d f8 mov %rdi,-0x8(%rbp) c00c: // compute (%rsp) and copy it to a buffer c014: 5e pop %rsi c015: 75 f8 jne 0xc00c c017: c9 leaveq c018: c3 retq
PMaCPerformance Modeling and Characterization
Enter Multithreaded Apps
· All threads use a single buffer?– Don’t need to know which thread is executing
· A buffer for each thread?– Faster. No concurrency operations needed– More interesting. Per-thread behavior != average thread
behavior
· PEBIL uses the latter– Fast method for computing location of thread-local data– Cache that location in a register if possible
0000c000 <foo>: c000: // compute -0x8(%rbp) and copy it to a buffer c008: 48 89 7d f8 mov %rdi,-0x8(%rbp) c00c: // compute (%rsp) and copy it to a buffer c014: 5e pop %rsi c015: 75 f8 jne 0xc00c c017: c9 leaveq c018: c3 retq
PMaCPerformance Modeling and Characterization
Thread-local Instrumentation Data in PEBIL· Provide a large table to each process (2M)
– Each entry is a small pool of memory (32 bytes)
· Must be VERY fast– Get thread id (1 instruction)– Simple hash of thread id (2 instructions)– Index table with hashed id (1 instruction)
· Assume no collisions (so far so good)
Hash Function
thread 1 id
thread 2 id
thread 3 id
thread 4 id
Thread-local memory pools
thread 4’s memory pool
PMaCPerformance Modeling and Characterization
Caching Thread-local Data· Cache the address of thread-local data
– Dead registers are known at instrumentation time– Is there 1 register in a function which is dead everywhere?
· Compute thread-local data address only at function [re]entry· Should use smaller scopes! (loops, blocks)
Significant reductions
PMaCPerformance Modeling and Characterization
Other x86/Linux Binary InstrumentationTool Name Static or
DynamicThread-local Data Access Threading
OverheadRuntime overhead
Pin1 Dynamic Register stolen from program, program JIT-compiled around that lost register
Very low Medium
Dyninst2 Either Compute thread ID (layered function call) at every point
High Varies
PEBIL3 Static Table + fast hash function (4 instructions), cache result in dead registers
Low Low
1Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. Luk, C., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Vijay Janapa Reddi, and Hazelwood, K. ACM SIGPLAN Conference on Programming Language Design and Implementation, 2005.2An API for Runtime Code Patching. Buck, B. and Hollingsworth, J. International Journal of High Performance Computing Applications, 2000.3PEBIL: Efficient Static Binary Instrumentation for Linux. Laurenzano, M., Tikir, M., Carrington, L. and Snavely, A. International Symposium on the Performance Analysis of Systems and Software, 2010.
· Classic test in binary instrumentation literature· Increment a counter each time a basic block is executed· Per-block, per-process, per-thread counters
· Memory address tracing– Fill a process/thread-local buffer with memory
addresses, then discard those addresses– Interval-based sampling
· Take the first 10% of each billion memory accesses· Toggle instrumentation on/off when moving between
sampling/non-sampling
PMaCPerformance Modeling and Characterization
Methodology· 2 quad-core Xeon X3450, 2.67GHz
– 32K L1 and 256K L2 cache per core, 8M L3 per processor
· NAS Parallel Benchmarks– 2 sets: OpenMP and MPI, gcc/GOMP and gcc/mpich– 8 threads/processes: CG, DC (omp only), EP, FT, IS, LU, MG– 4 threads/processes: BT, SP
· Dyninst 7.0 (dynamic)– Timing started when instrumented app begins running
· Pin 2.12· PEBIL 2.0
PMaCPerformance Modeling and Characterization
Basic Block Counting (MPI)· All results are average of 3 runs· Slowdown relative to un-instrumented run
Interval-based Sampling· Extract useful information from a subset of the
memory address stream– Simple approach: the first 10% of every billion addresses
· In practice we use a window 100x as small– Obvious: avoid processing addresses (e.g., just collect and
throw away)– Not so obvious: avoid collecting addresses
· Instrumentation tools can disable/re-enable instrumentation– PEBIL: binary on/off. Very lightweight, but limited– Pin and Dyninst: arbitrary removal/reinstrumentation.
Heavyweight, but versatile– Sampling only requires on/off functionality
PMaCPerformance Modeling and Characterization
Sampled Memory Tracing (MPI)
· PEBIL always improves, and significantly· Pin usually, but not always improves
– Amount and complexity of code re-instrumented during each interval probably drives this