Allen D. Malony 1 , Scott Biersdorff 2 , Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon An Experimental Approach to Performance Measurement of Heterogeneous Parallel Applications using CUDA ShangkarMayanglambam 3 3 Qualcomm Corporation
32
Embed
Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Allen D. Malony1, Scott Biersdorff2, Wyatt Spear2
1Department of Computer and Information Science2Performance Research Laboratory
University of Oregon
An Experimental Approach to Performance Measurement of Heterogeneous Parallel
Applications using CUDA
ShangkarMayanglambam3
3Qualcomm Corporation
Measurement of Heterogeneous Applications using CUDAICS 2010
Motivation
Heterogeneous parallel systems are highly relevant today Heterogeneous hardware technology more accessible
Performance is the main driving concern Heterogeneity is an important (the?) path to extreme scale
Heterogeneous software technology required for performance More sophisticated parallel programming environments Integrated parallel performance tools
support heterogeneous performance model and perspectives
2
Measurement of Heterogeneous Applications using CUDAICS 2010
Implications for Parallel Performance Tools Current status quo is somewhat comfortable
Mostly homogeneous parallel systems and software Shared-memory multithreading – OpenMP Distributed-memory message passing – MPI
Parallel computational models are relatively stable (simple) Corresponding performance models are relatively tractable Parallel performance tools can keep up and evolve
Heterogeneity creates richer computational potential Results in greater performance diversity and complexity
Heterogeneous systems will utilize more sophisticated programming and runtime environments
Performance tools have to support richer computation models and more versatile performance perspectives
Measurement of Heterogeneous Applications using CUDAICS 2010
Heterogeneous Performance Views
Want to create performance views that capture heterogeneous concurrency and execution behavior Reflect interactions between heterogeneous components Capture performance semantics relative to computation model Assimilate performance for all execution paths for shared view
Existing parallel performance tools are CPU(host)-centric Event-based sampling (not appropriate for accelerators) Direct measurement (through instrumentation of events)
What perspective does the host have of other components? Determines the semantics of the measurement data Determines assumptions about behavior and interactions
Performance views may have to work with reduced data
Measurement of Heterogeneous Applications using CUDAICS 2010
Task-based Performance View Consider the “task” abstraction for GPU accelerator scenario Host regards external execution as a task
Tasks operate concurrently withrespect to the host
Requires support for trackingasynchronous execution
Host creates measurementperspective for external task Maintains local and remote performance data Tasks may have limited measurement support May depend on host for performance data I/O Performance data might be received from external task
How to create a view of heterogeneous external performance?
Measurement of Heterogeneous Applications using CUDAICS 2010
CUDA Performance Perspective
CUDA enables programming of kernels for GPU acceleration GPU acceleration acts as an external tasks Performance measurement appears straightforward Execution model complicates performance measurement
Synchronous and asynchronous operation with respect to host Overlapping of data transfer and kernel execution Multiple GPU devices and multiple streams per device
Different acceleration kernels used in parallel application Multiple application sections Multiple application threads/processes See performance in context:
temporal, spatial, (host) thread/process
Measurement of Heterogeneous Applications using CUDAICS 2010
TAU and TAUcuda
TAU performance system Robust, scalable integrated performance
framework and toolkit Parallel profiling and tracing Shared and distributed parallel systems Open source and portable
TAUcuda Extension to support CUDA
performance measurement Goal is to leverage TAU's infrastructure
and analysis capabilities in TAUcuda development Deliver heterogeneous parallel performance support
7
TAU Architecture
Measurement of Heterogeneous Applications using CUDAICS 2010
TAUcuda Performance Measurement (Version 1) Build on CUDA event interface
Allow “events” to be placed in streams and processed events are timestamped by CUDA driver
CUDA driver reports GPU timing in event structure Events are reported back to CPU when requested
use begin and end events to calculate intervals CUDA kernel invocations are asynchronous
CPU does not see actual CUDA “end” event Want to associate TAU event context with CUDA events
Get top of TAU event stack at begin (TAU context)
S. Mayanglambam, A. Malony, M. Sottile, "Performance Measurement of Applications with GPU Acceleration using CUDA," ParCo 2009, Lyon, France, September 2009.
Measurement of Heterogeneous Applications using CUDAICS 2010
could not see memory transfer or CUDA system execution CUDA system architecture
Implemented by CUDA libraries driver and device (cuXXX) libraries runtime (cudaYYY) library
Tools support (Parallel Nsight (Nexus), CUDA Profiler) not intended to integrate with other HPC performance tools
TAUcuda (v2) built on experimental Linux CUDA driver Linux CUDA driver R190.86 supports a callback interface!!!
Measurement of Heterogeneous Applications using CUDAICS 2010
TAUcuda Architecture
TAUevents
TAUcudaevents
Measurement of Heterogeneous Applications using CUDAICS 2010
TAU and TAUcuda Performance Events
TAU measures events during execution Events are made visible as a result of code instrumentation Records event begin and end for profiling and tracing TAU events are measured by the CPU when they happen
TAU can not measure events on the GPU TAUcuda events are measured by CUDA and the GPU device TAUcuda events occur asynchronously to TAU events
TAUcuda is integrated with TAU measurement infrastructure Must transform TAUcuda events into TAU events Associate TAUcuda events with application CPU operation
samples the TAU context to link to application call site
Measurement of Heterogeneous Applications using CUDAICS 2010
TAUcuda Instrumentation
Normal application software composition No performance measurement enabled
12
Measurement of Heterogeneous Applications using CUDAICS 2010
TAUcuda Instrumentation
Includes only CPU-level instrumentation (TAU events)
13
TAU events
Measurement of Heterogeneous Applications using CUDAICS 2010
TAUcuda Instrumentation
14
TAUcuda events TAU events
Measurement of Heterogeneous Applications using CUDAICS 2010
CUDA Linux Driver Library Tools API
Experimental CUDA driver library provides callback support Exposes all driver routines through callback interface
subscribe to events via cuToolsApi_ETI_Core interface table Exposes functions to retrieve GPU performance information
TAUcuda intercepts only events of interest in callback handler API routines
Measurement of Heterogeneous Applications using CUDAICS 2010
CUDA Kernel Launch and Memory Transfer cuToolsApi_CBID_EnterGeneric callback occurs for cuXXX()
routines that invoke GPU kernel launch and memory transfer CUDA system manages these operations and make
measurements in association with the GPU device Keeps information in an internal buffer
How to associate "enter" with asynchronous future "exit"? TAUcuda Event Handler creates a call record:
event name call ID operation type API routine nameTAU contextCUDA context GPU device GPU stream
TAUcuda Event Handler calls into the TAU system to retrieve current TAU event stack (TAU context) during EnterGeneric
Profile callbacks will return performance data at later time TAUcuda then generates TAU events (profile or trace)
17
Measurement of Heterogeneous Applications using CUDAICS 2010
CUDA Runtime Library Instrumentation
NVIDIA does not implement callbacksfor runtime library Only provides header files (no source) for the runtime library
Instrument with TAU's library wrapping tool, tau_wrap Parses header files Automatically generates a new library (Magic!) Redefines the library routines of interest Wrapped routines are instrumented with TAU entry/exit Original routines called with the appropriate arguments
CUDA runtime library performance measured by TAU TAU enter and exit events for all cudaYYY()
18
Measurement of Heterogeneous Applications using CUDAICS 2010
TAUcuda Profiling and Tracing
Keep a profile or trace for every GPU device stream Profiling
Calculate flat profile for each kernel and memory transfer Done at time of Profile callback
Tracing Must use TAU clock for timestamp Kernel and memory timestamp reported with GPU clock Must synchronize CPU and GPU clocks Save a TAUcuda trace for every GPU device stream
can not insert into TAU's runtime trace buffer (Why?) Kernel / memory transfer start/stop are asynchronous
Offline trace merging, clock correction, and translation
19
Measurement of Heterogeneous Applications using CUDAICS 2010
Running with TAU / TAUcuda
To run an CUDA application with TAUcuda, all of the necessary libraries must be dynamically linked
TAUcuda works with unmodified CUDA application binaries Use scripts for di erent scenarios:ff
TAUcuda produces profiles or traces in the current working directory in sub-folders to distinguish them from TAU performance output TAUCuda profiles are in di erent metric sub-folders:ff
Measurement of Heterogeneous Applications using CUDAICS 2010
CUDA Linpack Trace
25
MPI communication (yellow)CUDA memory transfer (white)
Measurement of Heterogeneous Applications using CUDAICS 2010
NAMD and TAU / TAUcuda
Demonstrate TAUcuda with scientific application NAMD is a molecular dynamics application
Written using Charm++ parallel object-oriented language Charm++ and NAMD run on large-scale HPC clusters NAMD has been accelerated with CUDA
TAU integrated in Charm++ (ICPP 2009 paper) Now apply TAUcuda to observe influence of GPU execution
Observe the effect of CUDA acceleration Show scaling results for GPU cluster execution
Measurement of Heterogeneous Applications using CUDAICS 2010
NAMD Profile (4 processes, 4 GPUs)
27
Measurement of Heterogeneous Applications using CUDAICS 2010
NAMD GPU Scaling (4–64 GPUs)
Strong scaling experiments on Eureka cluster Use TAU PerfExplorer to compare
Measurement of Heterogeneous Applications using CUDAICS 2010
SHOC Stencil2D (512 iterations, 4 CPUxGPU)
Scalable HeterOgenerous Computing benchmark suite CUDA / OpenCL kernels and microbenchmarks (ORNL)
29
CUDA memory transfer (white)
Measurement of Heterogeneous Applications using CUDAICS 2010
HMPP SGEMM (CAPS Entreprise)
30
Host Process
Transfer Kernel
Compute Kernel
Host Process
Transfer Kernel
Compute Kernel
Measurement of Heterogeneous Applications using CUDAICS 2010
Conclusions
Heterogeneous parallel systems will require parallel performance tools that integrate performance perspectives
Need to rely on hardware and software support in heterogeneous components to access performance
Experimental Linux CUDA driver provided by NVIDIA facilitiates access to CUDA / GPU performance information
TAUcuda merges with TAU (CPU) performance data TAU/TAUcuda provides powerful scalable heterogeneous
performance measurement and analysis NVIDIA is incorporating performance tools requirements in
next-generation driver/device libraries TAUopencl is in development (working prototype)
31
Measurement of Heterogeneous Applications using CUDAICS 2010
Support Acknowledgements
Department of Energy (DOE) Office of Science
ASC/NNSA
Department of Defense (DoD) HPC Modernization Office (HPCMO)
NSF Software Development for Cyberinfrastructure (SDCI) Research Centre Juelich Argonne National Laboratory Technical University Dresden ParaTools, Inc. NVIDIA