Top Banner
Allen D. Malony 1 , Scott Biersdorff 2 , Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon An Experimental Approach to Performance Measurement of Heterogeneous Parallel Applications using CUDA ShangkarMayanglambam 3 3 Qualcomm Corporation
32

Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Jan 16, 2016

Download

Documents

Damon Henderson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Allen D. Malony1, Scott Biersdorff2, Wyatt Spear2

1Department of Computer and Information Science2Performance Research Laboratory

University of Oregon

An Experimental Approach to Performance Measurement of Heterogeneous Parallel

Applications using CUDA

ShangkarMayanglambam3

3Qualcomm Corporation

Page 2: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

Motivation

Heterogeneous parallel systems are highly relevant today Heterogeneous hardware technology more accessible

Multicore processors (e.g., 4-core, 6-core, 8-core, ...) Manycore (throughput) accelerators (e.g., Tesla, Fermi) High-performance engines (e.g., Cell BE, Larrabee) Special purpose components (e.g., FPGAs)

Performance is the main driving concern Heterogeneity is an important (the?) path to extreme scale

Heterogeneous software technology required for performance More sophisticated parallel programming environments Integrated parallel performance tools

support heterogeneous performance model and perspectives

2

Page 3: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

Implications for Parallel Performance Tools Current status quo is somewhat comfortable

Mostly homogeneous parallel systems and software Shared-memory multithreading – OpenMP Distributed-memory message passing – MPI

Parallel computational models are relatively stable (simple) Corresponding performance models are relatively tractable Parallel performance tools can keep up and evolve

Heterogeneity creates richer computational potential Results in greater performance diversity and complexity

Heterogeneous systems will utilize more sophisticated programming and runtime environments

Performance tools have to support richer computation models and more versatile performance perspectives

Page 4: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

Heterogeneous Performance Views

Want to create performance views that capture heterogeneous concurrency and execution behavior Reflect interactions between heterogeneous components Capture performance semantics relative to computation model Assimilate performance for all execution paths for shared view

Existing parallel performance tools are CPU(host)-centric Event-based sampling (not appropriate for accelerators) Direct measurement (through instrumentation of events)

What perspective does the host have of other components? Determines the semantics of the measurement data Determines assumptions about behavior and interactions

Performance views may have to work with reduced data

Page 5: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

Task-based Performance View Consider the “task” abstraction for GPU accelerator scenario Host regards external execution as a task

Tasks operate concurrently withrespect to the host

Requires support for trackingasynchronous execution

Host creates measurementperspective for external task Maintains local and remote performance data Tasks may have limited measurement support May depend on host for performance data I/O Performance data might be received from external task

How to create a view of heterogeneous external performance?

Page 6: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

CUDA Performance Perspective

CUDA enables programming of kernels for GPU acceleration GPU acceleration acts as an external tasks Performance measurement appears straightforward Execution model complicates performance measurement

Synchronous and asynchronous operation with respect to host Overlapping of data transfer and kernel execution Multiple GPU devices and multiple streams per device

Different acceleration kernels used in parallel application Multiple application sections Multiple application threads/processes See performance in context:

temporal, spatial, (host) thread/process

Page 7: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

TAU and TAUcuda

TAU performance system Robust, scalable integrated performance

framework and toolkit Parallel profiling and tracing Shared and distributed parallel systems Open source and portable

TAUcuda Extension to support CUDA

performance measurement Goal is to leverage TAU's infrastructure

and analysis capabilities in TAUcuda development Deliver heterogeneous parallel performance support

7

TAU Architecture

Page 8: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

TAUcuda Performance Measurement (Version 1) Build on CUDA event interface

Allow “events” to be placed in streams and processed events are timestamped by CUDA driver

CUDA driver reports GPU timing in event structure Events are reported back to CPU when requested

use begin and end events to calculate intervals CUDA kernel invocations are asynchronous

CPU does not see actual CUDA “end” event Want to associate TAU event context with CUDA events

Get top of TAU event stack at begin (TAU context)

S. Mayanglambam, A. Malony, M. Sottile, "Performance Measurement of Applications with GPU Acceleration using CUDA," ParCo 2009, Lyon, France, September 2009.

Page 9: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

TAUcuda Performance Measurement (Version 2)

Overcome TAUcuda (v1) deficiencies Required source code instrumentation Event interface only perspectives

could not see memory transfer or CUDA system execution CUDA system architecture

Implemented by CUDA libraries driver and device (cuXXX) libraries runtime (cudaYYY) library

Tools support (Parallel Nsight (Nexus), CUDA Profiler) not intended to integrate with other HPC performance tools

TAUcuda (v2) built on experimental Linux CUDA driver Linux CUDA driver R190.86 supports a callback interface!!!

Page 10: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

TAUcuda Architecture

TAUevents

TAUcudaevents

Page 11: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

TAU and TAUcuda Performance Events

TAU measures events during execution Events are made visible as a result of code instrumentation Records event begin and end for profiling and tracing TAU events are measured by the CPU when they happen

TAU can not measure events on the GPU TAUcuda events are measured by CUDA and the GPU device TAUcuda events occur asynchronously to TAU events

TAUcuda is integrated with TAU measurement infrastructure Must transform TAUcuda events into TAU events Associate TAUcuda events with application CPU operation

samples the TAU context to link to application call site

Page 12: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

TAUcuda Instrumentation

Normal application software composition No performance measurement enabled

12

Page 13: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

TAUcuda Instrumentation

Includes only CPU-level instrumentation (TAU events)

13

TAU events

Page 14: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

TAUcuda Instrumentation

14

TAUcuda events TAU events

Page 15: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

CUDA Linux Driver Library Tools API

Experimental CUDA driver library provides callback support Exposes all driver routines through callback interface

subscribe to events via cuToolsApi_ETI_Core interface table Exposes functions to retrieve GPU performance information

TAUcuda intercepts only events of interest in callback handler API routines

cuToolsApi_CBID_EnterGenericcuToolsApi_CBID_ExitGeneric

Measurement (context synchronization, GPU buffer overflow)cuToolsApi_CBID_ProfileLaunchcuToolsApi_CBID_ProfileMemory

Call TAU event creation / measurement routines (enter, exit)

15

Page 16: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

CUDA Driver Library Routines Intercepted

LaunchcuLaunch(); cuLaunchGrid();cuLaunchGridAsync();

Memory transfercuMemcpyHtoD(); cuMemcpyHtoDAsync();cuMemcpy2D(); cuMemcpy2DUnaligned(); cuMemcpy2DAsync(); cuMemcpy3D();cuMemcpy3DAsync(); cuMemcpyAtoA();cuMemcpyAtoD(); cuMemcpyAtoH();cuMemcpyAtoHAsync(); cuMemcpyDtoA();cuMemcpyDtoD(); cuMemcpyDtoH();cuMemcpyDtoHAsync(); cuMemcpyHtoA();cuMemcpyHtoAAsync();

16

Page 17: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

CUDA Kernel Launch and Memory Transfer cuToolsApi_CBID_EnterGeneric callback occurs for cuXXX()

routines that invoke GPU kernel launch and memory transfer CUDA system manages these operations and make

measurements in association with the GPU device Keeps information in an internal buffer

How to associate "enter" with asynchronous future "exit"? TAUcuda Event Handler creates a call record:

event name call ID operation type API routine nameTAU contextCUDA context GPU device GPU stream

TAUcuda Event Handler calls into the TAU system to retrieve current TAU event stack (TAU context) during EnterGeneric

Profile callbacks will return performance data at later time TAUcuda then generates TAU events (profile or trace)

17

Page 18: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

CUDA Runtime Library Instrumentation

NVIDIA does not implement callbacksfor runtime library Only provides header files (no source) for the runtime library

Instrument with TAU's library wrapping tool, tau_wrap Parses header files Automatically generates a new library (Magic!) Redefines the library routines of interest Wrapped routines are instrumented with TAU entry/exit Original routines called with the appropriate arguments

CUDA runtime library performance measured by TAU TAU enter and exit events for all cudaYYY()

18

Page 19: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

TAUcuda Profiling and Tracing

Keep a profile or trace for every GPU device stream Profiling

Calculate flat profile for each kernel and memory transfer Done at time of Profile callback

Tracing Must use TAU clock for timestamp Kernel and memory timestamp reported with GPU clock Must synchronize CPU and GPU clocks Save a TAUcuda trace for every GPU device stream

can not insert into TAU's runtime trace buffer (Why?) Kernel / memory transfer start/stop are asynchronous

Offline trace merging, clock correction, and translation

19

Page 20: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

Running with TAU / TAUcuda

To run an CUDA application with TAUcuda, all of the necessary libraries must be dynamically linked

TAUcuda works with unmodified CUDA application binaries Use scripts for di erent scenarios:ff

taucuda profiler.sh / taucuda mpirun.sh (Profiling) taucuda tracer.sh / taucuda mpirun tracer.sh (Tracing)

TAUcuda produces profiles or traces in the current working directory in sub-folders to distinguish them from TAU performance output TAUCuda profiles are in di erent metric sub-folders:ff

gpu_elapsed_timegpu_memory_transfergpu_shared_memory

20

Page 21: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

TAUcuda Experimentation Environments University of Oregon

Linux workstation Dual quad core Intel Xeon GTX 280

GPU cluster (Mist) Four dual quad core Intel Xeon server nodes Two NVIDIA S1070 Tesla servers (4 Tesla GPUs per S1070)

Argonne National Laboratory (Eureka) 100 dual quad core NVIDIA Quadro Plex S4 200 Quadro FX5600 (2 per S4)

University of Illinois at Urbana-Champaign GPU cluster (AC cluster)

32 nodes with one S1070 (4 GPUs per node)

Page 22: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

CUDA SDK Transpose (256 x 4096 matrix)

22

CPU profile

GPU profile

cu eventscuda events

kernel

...

Page 23: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

CUDA SDK OceanFFT (profile, trace)

23

CP

UG

PU

kernels

Jumpshot trace visualizer

Page 24: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

CUDA Linpack Profile (4 processes, 4 GPUs)

Measure performance of heterogeneous parallel applications GPU-accelerated Linpack benchmark (M. Fatica, NVIDIA)

24

Page 25: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

CUDA Linpack Trace

25

MPI communication (yellow)CUDA memory transfer (white)

Page 26: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

NAMD and TAU / TAUcuda

Demonstrate TAUcuda with scientific application NAMD is a molecular dynamics application

Written using Charm++ parallel object-oriented language Charm++ and NAMD run on large-scale HPC clusters NAMD has been accelerated with CUDA

TAU integrated in Charm++ (ICPP 2009 paper) Now apply TAUcuda to observe influence of GPU execution

Observe the effect of CUDA acceleration Show scaling results for GPU cluster execution

Page 27: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

NAMD Profile (4 processes, 4 GPUs)

27

Page 28: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

NAMD GPU Scaling (4–64 GPUs)

Strong scaling experiments on Eureka cluster Use TAU PerfExplorer to compare

Page 29: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

SHOC Stencil2D (512 iterations, 4 CPUxGPU)

Scalable HeterOgenerous Computing benchmark suite CUDA / OpenCL kernels and microbenchmarks (ORNL)

29

CUDA memory transfer (white)

Page 30: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

HMPP SGEMM (CAPS Entreprise)

30

Host Process

Transfer Kernel

Compute Kernel

Host Process

Transfer Kernel

Compute Kernel

Page 31: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

Conclusions

Heterogeneous parallel systems will require parallel performance tools that integrate performance perspectives

Need to rely on hardware and software support in heterogeneous components to access performance

Experimental Linux CUDA driver provided by NVIDIA facilitiates access to CUDA / GPU performance information

TAUcuda merges with TAU (CPU) performance data TAU/TAUcuda provides powerful scalable heterogeneous

performance measurement and analysis NVIDIA is incorporating performance tools requirements in

next-generation driver/device libraries TAUopencl is in development (working prototype)

31

Page 32: Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Measurement of Heterogeneous Applications using CUDAICS 2010

Support Acknowledgements

Department of Energy (DOE) Office of Science

ASC/NNSA

Department of Defense (DoD) HPC Modernization Office (HPCMO)

NSF Software Development for Cyberinfrastructure (SDCI) Research Centre Juelich Argonne National Laboratory Technical University Dresden ParaTools, Inc. NVIDIA