NIH BTRC for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, U. Illinois at Urbana-Champaign Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign http://www.ks.uiuc.edu/~johns/ Scaling to Petascale Institute, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign
39
Embed
Scaling in a Heterogeneous Environment with GPUs: GPU ... · • 2017: 13 of the top of 14 Green500 systems ... Quantum Chemistry Abinit Gaussian GAMESS NWChem Material Science CP2K
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
Scaling in a Heterogeneous
Environment with GPUs:
GPU Architecture, Concepts, and Strategies
John E. Stone
Theoretical and Computational Biophysics Group
Beckman Institute for Advanced Science and Technology
University of Illinois at Urbana-Champaign
http://www.ks.uiuc.edu/~johns/
Scaling to Petascale Institute,
National Center for Supercomputing Applications,
University of Illinois at Urbana-Champaign
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
Agenda: Scaling in a Heterogeneous
Environment With GPUs
• GPU architecture, concepts, and strategies
• OpenACC
• OpenACC Hands-On Lab
• CUDA Programming 1
• CUDA Hands-On Lab
• CUDA Programming 2
• GPU Optimization and Scaling with Profiling and
Debugging
• Open Hands-on Lab
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
Administrativa: QwikLab Accounts • Participants that have not created & verified their QwikLab
accounts should do so ASAP to be ready for today’s hands-on:
https://nvlabs.qwiklab.com/
• If you are a “walk-in” participant, your site handler will need to
request access by email to Justin Luitjens, including which site
you’re located at, and your email address.
• If you still don’t have access or we reach max capacity, buddy up
with another participant until you get access.
• You should have an access code to run the QwikLab courses:
– “Accelerating Applications with GPU-Accelerated Libraries in C/C++”
– “OpenACC – 2X in 4 steps”
– “Accelerating Applications with CUDA C/C++”
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
GPU Computing • GPUs evolved from graphics toward general purpose
data-parallel workloads
• GPUs are commodity devices, omnipresent in modern computers (~million sold per week)
• Massively parallel hardware, well suited to throughput-oriented workloads, streaming data far too large for CPU caches
• Programming tools allow software to be written in various dialects of familiar C/C++/Fortran and integrated into legacy software
• GPU algorithms are often multicore-friendly due to attention paid to data locality and data-parallel work decomposition
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
What Makes GPUs Compelling?
• Massively parallel hardware architecture:
– Tens of wide SIMD-oriented stream processing compute
units (“SMs” in NVIDIA nomenclature)
– Tens of thousands of threads running on thousands of
ALUs, special fctn units
– Large register files, fast on-chip and die-stacked memory
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
OpenACC Directives: Open, Simple, Portable
• Open Standard
• Easy, Compiler-Driven Approach
main() {
…
<serial code>
…
#pragma acc kernels
{
<compute intensive code>
}
…
}
Compiler
Hint CAM-SE Climate
6x Faster on GPU
Top Kernel: 50% of Runtime
Courtesy NVIDIA
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
Directive-Based Parallel
Programming with OpenACC • Annotate loop nests in existing code with
#pragma compiler directives:
– Annotate opportunities for parallelism
– Annotate points where host-GPU memory transfers
are best performed, indicate propagation of data
• Evolve original code structure to improve
efficacy of parallelization
– Eliminate false dependencies between loop iterations
– Revise algorithms or constructs that create excess data
movement
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
Process for Writing CUDA Kernels
• Data-parallel loop nests are unrolled into a
large batch of independent work items
that can execute concurrently
• Work items are mapped onto GPU
hardware threads using multidimensional
grids and blocks of threads that execute on
stream processing units (SMs)
• Programmer manages data placement in
GPU memory systems, access patterns, and
data dependencies
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
CUDA Grid, Block, Thread Decomposition
Padding arrays can optimize global memory performance
1-D, 2-D, or 3-D Grid of Thread Blocks:
0,0 0,1
1,0 1,1
…
… …
…
…
1-D, 2-D, or 3-D Computational Domain
1-D, 2-D, 3-D thread block:
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
Overview of Throughput-Oriented
GPU Hardware Architecture
• GPUs have small on-chip caches
• Main memory latency (several hundred clock cycles!) is
tolerated through hardware multithreading – overlap
memory transfer latency with execution of other work
• When a GPU thread stalls on a memory operation, the
hardware immediately switches context to a ready thread
• Effective latency hiding requires saturating the GPU with
lots of work – tens of thousands of independent work
items
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
Avoid Output Conflicts,
Conversion of Scatter to Gather
• Many CPU codes contain algorithms that “scatter” outputs to memory, to reduce arithmetic
• Scattered output can create bottlenecks for GPU performance due write conflicts among hundreds or thousands of threads
• On the GPU, it is often better to:
– do more arithmetic, in exchange for regularized output memory write patterns
– convert “scatter” algorithms to “gather” approaches
– Use data “privatization” to reduce the scope of potentially conflicting outputs, and to leverage special on-chip memory systems and data reduction instructions
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
GPU Technology Conference Presentations:
See the latest announcements about GPU
hardware, libraries, and programming tools
• http://www.gputechconf.com/
• http://www.gputechconf.com/attend/sessions
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
Bonus Material
If Time Allows
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
Peak Arithmetic Performance Trend
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
Peak Memory Bandwidth Trend
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
Multi-GPU NUMA Architectures:
Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations
Michael J. Hallock, John E. Stone, Elijah Roberts, Corey Fry, and Zaida Luthey-Schulten.
Journal of Parallel Computing, 40:86-99, 2014.
http://dx.doi.org/10.1016/j.parco.2014.03.009
• Example of a “balanced”
PCIe topology
• NUMA: Host threads should
be pinned to the CPU that is
“closest” to their target GPU
• GPUs on the same PCIe I/O
Hub (IOH) can use CUDA
peer-to-peer transfer APIs
• Intel: GPUs on different
IOHs can’t use peer-to-peer
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
GPU PCI-Express DMA
Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations
Michael J. Hallock, John E. Stone, Elijah Roberts, Corey Fry, and Zaida Luthey-Schulten.
Journal of Parallel Computing, 40:86-99, 2014.
http://dx.doi.org/10.1016/j.parco.2014.03.009
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
Multi-GPU NUMA Architectures:
Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations
Michael J. Hallock, John E. Stone, Elijah Roberts, Corey Fry, and Zaida Luthey-Schulten.
Journal of Parallel Computing, 40:86-99, 2014.
http://dx.doi.org/10.1016/j.parco.2014.03.009
• Direct GPU-to-GPU peer
DMA operations are more
performant than other
approaches, particularly for
moderate sized transfers
• They perform even better
with NVLink peer-to-peer
GPU interconnections
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
IBM S822LC w/ NVLink 1 .0
“Minsky”
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
Overlapping CPU Work with GPU Work
• Host CPU thread
launches GPU action,
e.g. a “kernel”, DMA
memory copy, etc. on
the GPU
• GPU action runs to
completion
• Host synchronizes with
completed GPU action
CPU GPU
CPU code running
CPU waits for GPU, ideally doing
something productive
CPU code running
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
Single CUDA Execution “Stream”
• Host CPU thread
launches a CUDA
“kernel”, a memory
copy, etc. on the GPU
• GPU action runs to
completion
• Host synchronizes
with completed GPU
action
CPU GPU
CPU code running
CPU waits for GPU, ideally doing
something productive
CPU code running
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
Multiple CUDA Streams:
Overlapping Compute and DMA Operations
Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations
Michael J. Hallock, John E. Stone, Elijah Roberts, Corey Fry, and Zaida Luthey-Schulten.
Journal of Parallel Computing, 40:86-99, 2014.
http://dx.doi.org/10.1016/j.parco.2014.03.009
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
Using the CPU to Optimize GPU Performance
• GPU performs best when the work evenly divides
into the number of threads/processing units
• Optimization strategy:
– Use the CPU to “regularize” the GPU workload
– Use fixed size bin data structures, with “empty” slots
skipped or producing zeroed out results
– Handle exceptional or irregular work units on the CPU;
GPU processes the bulk of the work concurrently
– On average, the GPU is kept highly occupied, attaining
a high fraction of peak performance
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
Time-Averaged Electrostatics Analysis on
NCSA Blue Waters
Preliminary performance for VMD time-averaged electrostatics w/ Multilevel Summation Method on the NCSA Blue Waters Early Science System
NCSA Blue Waters Node Type Seconds per trajectory
frame for one compute
node
Cray XE6 Compute Node:
32 CPU cores (2xAMD 6200 CPUs)
9.33
Cray XK6 GPU-accelerated Compute Node:
16 CPU cores + NVIDIA X2090 (Fermi) GPU
2.25
Speedup for GPU XK6 nodes vs. CPU XE6 nodes XK6 nodes are 4.15x
faster overall
Tests on XK7 nodes indicate MSM is CPU-bound with
the Kepler K20X GPU.
Performance is not much faster (yet) than Fermi X2090
Need to move spatial hashing, prolongation,
interpolation onto the GPU…
In progress….
XK7 nodes 4.3x faster
overall
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
Multilevel Summation on the GPU
Computational steps CPU (s) w/ GPU (s) Speedup
Short-range cutoff 480.07 14.87 32.3
Long-range anterpolation 0.18
restriction 0.16
lattice cutoff 49.47 1.36 36.4
prolongation 0.17
interpolation 3.47
Total 533.52 20.21 26.4
Performance profile for 0.5 Å map of potential for 1.5 M atoms.
Hardware platform is Intel QX6700 CPU and NVIDIA GTX 280.
Accelerate short-range cutoff and lattice cutoff parts
Multilevel summation of electrostatic potentials using graphics processing units. D. Hardy, J. Stone, K. Schulten. J. Parallel Computing, 35:164-177, 2009.
NIH BTRC for Macromolecular Modeling and Bioinformatics
http://www.ks.uiuc.edu/
Beckman Institute, U. Illinois at Urbana-Champaign
Avoiding Shared Memory Bank Conflicts: Array of Structures (AOS) vs.