-
Characterization and exploitation of nested parallelism
and concurrent kernel execution to accelerate high
performance applications
A Dissertation Presented
by
Fanny Nina Paravecino
to
The Department of Electrical and Computer Engineering
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in
Computer Engineering
Northeastern University
Boston, Massachusetts
March 2017
-
NORTHEASTERN UNIVERSITYGraduate School of Engineering
Dissertation Signature Page
Dissertation Ti-
tle:
Characterization and exploitation of nested parallelism and
concurrent
kernel execution to accelerate high performance applications
Author: Fanny Nina Paravecino NUID: 001160686
Department: Electrical and Computer Engineering
Approved for Dissertation Requirements of the Doctor of
Philosophy Degree
Dissertation Advisor
Dr. David KaeliSignature Date
Dissertation Committee Member
Dr. Qianqian FangSignature Date
Dissertation Committee Member
Dr. Ningfang MiSignature Date
Dissertation Committee Member
Dr. Norm RubinSignature Date
Department Chair
Dr. Miriam LeeserSignature Date
Associate Dean of Graduate School:
Dr. Sara Wadia-FascettiSignature Date
-
To the science and the pursuit of answers through research.
ii
-
Contents
List of Figures vi
List of Tables viii
List of Programs x
List of Acronyms xi
Acknowledgments xiii
Abstract of the Dissertation xiv
1 Introduction 1
1.1 Parallel Programming . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 2
1.1.1 Advanced Parallel Features . . . . . . . . . . . . . . . .
. . . . . . 3
1.2 Characterization of Advanced Parallel Features . . . . . . .
. . . . . . . . 4
1.3 Challenges in Exploiting Parallel Execution Features . . . .
. . . . . . . . 5
1.3.1 Nested Parallelism Challenges . . . . . . . . . . . . . .
. . . . . . 6
1.3.2 Concurrent Kernel Execution Challenges . . . . . . . . . .
. . . . 7
1.3.3 Benchmark Suite . . . . . . . . . . . . . . . . . . . . .
. . . . . . 8
1.4 Contributions of the Thesis . . . . . . . . . . . . . . . .
. . . . . . . . . . 9
1.5 Organization of Thesis . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 11
iii
-
2 Background 12
2.1 CUDA Model . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 13
2.1.1 Divergence . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 15
2.2 GPU Computing Architecture . . . . . . . . . . . . . . . . .
. . . . . . . 17
2.2.1 Fermi Architecture . . . . . . . . . . . . . . . . . . . .
. . . . . . 18
2.2.2 Kepler Architecture . . . . . . . . . . . . . . . . . . .
. . . . . . . 18
2.2.3 Maxwell Architecture . . . . . . . . . . . . . . . . . . .
. . . . . 22
2.2.4 Pascal Architecture . . . . . . . . . . . . . . . . . . .
. . . . . . . 22
3 Related work 24
3.1 Characterization of GPUs . . . . . . . . . . . . . . . . . .
. . . . . . . . . 24
3.1.1 Modern GPUs . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 25
3.2 Multiple Levels of Concurrency . . . . . . . . . . . . . . .
. . . . . . . . 26
3.2.1 Nested Parallelism . . . . . . . . . . . . . . . . . . . .
. . . . . . 26
3.2.2 Concurrent Kernel Execution . . . . . . . . . . . . . . .
. . . . . . 27
4 Characterization of advanced parallel features 29
4.1 Nested Parallelism . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 29
4.1.1 Control Flow Instructions . . . . . . . . . . . . . . . .
. . . . . . 31
4.1.2 Parallel Recursion . . . . . . . . . . . . . . . . . . . .
. . . . . . 33
4.1.3 Child Kernel Launching and Synchronization . . . . . . . .
. . . . 35
4.1.4 Memory Overhead . . . . . . . . . . . . . . . . . . . . .
. . . . . 38
4.2 Concurrent Kernel Execution . . . . . . . . . . . . . . . .
. . . . . . . . . 39
4.2.1 Resource Contention . . . . . . . . . . . . . . . . . . .
. . . . . . 39
5 Exploitation of advanced parallel features 44
5.1 Dependent Nested Loop Workloads . . . . . . . . . . . . . .
. . . . . . . 45
5.1.1 Selective Matrix Addition . . . . . . . . . . . . . . . .
. . . . . . 45
5.2 Parallel Recursive Workloads . . . . . . . . . . . . . . . .
. . . . . . . . . 46
iv
-
5.2.1 Breadth-First Search Algorithm . . . . . . . . . . . . . .
. . . . . 47
5.2.2 Prim algorithm . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 51
6 Validation with real-world applications 61
6.1 Connected Component Labeling . . . . . . . . . . . . . . . .
. . . . . . . 61
6.2 Level-Set Segmentation . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 62
6.2.1 Initialization . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 63
6.2.2 Evolution . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 64
6.2.3 Finalization . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 65
6.3 Summary of Analysis for Real-world Applications . . . . . .
. . . . . . . 67
7 Summary 70
7.1 Contributions of the thesis . . . . . . . . . . . . . . . .
. . . . . . . . . . 70
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 72
Bibliography 73
v
-
List of Figures
2.1 Layers of abstraction between software application and GPU
hardware. . . 13
2.2 The CUDA model: a kernel, grid, and threads per block. . . .
. . . . . . . 14
2.3 The CUDA memory hierarchy. . . . . . . . . . . . . . . . . .
. . . . . . . 15
2.4 Branch divergence in the GPU . . . . . . . . . . . . . . . .
. . . . . . . . 17
2.5 Work flow of the Grid Management Unit to dispatch, pause,
and hold
pending and suspended grids. . . . . . . . . . . . . . . . . . .
. . . . . . . 20
2.6 Dynamic Parallelism . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 21
2.7 Hyper-Q . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 22
4.1 Control flow graphs of Program 4.1 for Kepler GTX Titan and
Maxwell
GTX Titan Ti. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 32
4.2 Execution time of Sequential, non-nested parallelism and
nested parallelism
kernels on GTX Titan - Kepler architecture (lower is better). .
. . . . . . . 37
4.3 Execution time of non-nested parallelism and nested
parallelism across four
GPUs (2 Kepler and 2Maxwell GPUs). . . . . . . . . . . . . . . .
. . . . . 38
4.4 Execution time of sequential execution of kernels versus
concurrent kernel
execution for two different GPUs while varying input size (lower
is better). 40
4.5 Execution time of sequential execution of kernels versus
concurrent kernel
execution for two different GPUs with persistent threads
execution (lower is
better). . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 42
vi
-
4.6 Resource utilization for non-persistent thread kernels using
different input
data sets for the Maxwell GTX Titan X (lower is better). . . . .
. . . . . . 43
4.7 Resource utilization for persistent thread kernels using
different input data
sets for Maxwell GTX Titan X (lower is better). . . . . . . . .
. . . . . . . 43
5.1 Speedup evaluation of nested parallelism implementation
compared to non-
nested parallelism implementation for Selective Matrix Add for
the Kepler
GTX Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 47
5.2 Graph representation using an adjacency list. . . . . . . .
. . . . . . . . . 48
5.3 BFS operations while traversing a graph with six vertices,
starting at source
vertex 0. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 51
5.4 BFS speedup analysis of naive nested parallelism and
optimized nested
parallelism versus non-nested parallelism implementation on
Kepler GTX
Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 55
5.5 MST Tree of graph G = (V,E), where V = {0, 1, 2, 3, 4, 5},
starting at
source vertex 0. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 55
5.6 Prim algorithm step by step work flow. Given a graph G =
(V,E) with
an initial source vertex 0; find Minimum Spanning Tree (MST)
using Prim
algorithm, where iteration 0 is described as the initialization
of a MST tree
with a source vertex 0. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 59
6.1 Speedup comparison of the nested parallelism and non-nested
parallelism
implementations, running CCL on a Kepler GTX Titan. . . . . . .
. . . . . 63
6.2 Speedup comparison of the nested parallelism and non-nested
parallelism
implementations, running Level-Set segmentation on a Kepler GTX
Titan. . 68
vii
-
List of Tables
2.1 NVIDIA GPU technology evolution [25] . . . . . . . . . . . .
. . . . . . . 12
2.2 Fermi chip GF110 versus Kepler chip GK110 [41] . . . . . . .
. . . . . . 19
2.3 A comparison of the features available on the four
generations of NVIDIA
GPUs considered in this thesis [25]. . . . . . . . . . . . . . .
. . . . . . . 23
4.1 Irregular Applications from two different GPU benchmarks
which exhibit
control flow dependent nested loops. . . . . . . . . . . . . . .
. . . . . . . 30
4.2 Recursive applications which exhibit parallel recursion. . .
. . . . . . . . . 30
5.1 Irregular and recursive applications, with potential for
exploiting advanced
parallel features in modern GPUs. . . . . . . . . . . . . . . .
. . . . . . . 44
5.2 Dynamic metrics for non-nested parallelism Selective Matrix
Addition for
different input sets on Kepler GTX Titan. . . . . . . . . . . .
. . . . . . . 45
5.3 Execution time of selective Matrix Add with different input
sets for non-
nested parallelism and nested parallelism implementations on
Kepler GTX
Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 46
5.4 Runtime execution analysis of Breadth-First Search with
different input sets
from the DIMACS Challenge Ninth [87] and Tenth [88] for naive
nested
parallelism implementation on Kepler GTX Titan. . . . . . . . .
. . . . . . 54
viii
-
5.5 Runtime execution analysis of Breadth-First Search with
different input sets
from DIMACS Challenge Ninth [87] and Tenth [88] for optimized
nested
parallelism implementation on Kepler GTX Titan. . . . . . . . .
. . . . . . 54
5.6 Runtime execution analysis of Prim’s algorithm with
different input sets
from the DIMACS Challenge Ninth [87] and Tenth [88] for
non-nested
parallelism implementation on Kepler GTX Titan. . . . . . . . .
. . . . . . 56
5.7 Runtime execution analysis of Prim algorithm with different
input sets from
the DIMACS Challenge Ninth [87] and Tenth [88] for optimized
nested
parallelism implementation on the Kepler GTX Titan. . . . . . .
. . . . . . 60
6.1 Dynamic metrics for non-nested parallelism of CCL for
different input sets
on a Kepler GTX Titan. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 62
6.2 Dynamic metrics for nested parallelism of CCL for different
input sets on a
Kepler GTX Titan. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 62
6.3 Execution time of selective Matrix Add with different input
sets for non-
nested parallelism and nested parallelism implementations on the
Kepler
GTX Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 62
ix
-
List of Programs
4.1 Micro-benchmark Kernel with irregular nested loop execution.
. . . . . . . 31
4.2 Fibonacci recursive scheme. . . . . . . . . . . . . . . . .
. . . . . . . . . 34
4.3 Fibonacci parallel recursive scheme in CUDA. . . . . . . . .
. . . . . . . 35
4.4 Micro-benchmark Kernel with irregular nested loop execution.
. . . . . . . 41
5.1 Graph input using DIMACS challenge structure for file
storing. . . . . . . 49
5.2 Breadth-first Search (BFS) recursive implementation on a
CPU, graph is
a global variable which contains vertex array and edge array. .
. . . . . . . 50
5.3 BFS non-recursive implementation on GPU. . . . . . . . . . .
. . . . . . 52
5.4 BFS optimized nested parallelism implementation on GPU. . .
. . . . . . 53
5.5 Non-nested parallelism implementation of Prim’s algorithm on
a GPU. . . . 57
5.6 Optimized nested parallelism implementation of Prim’s
algorithm on a GPU. 58
x
-
List of Acronyms
GPGPU General Purpose computing on Graphic Processor Units.
Definition associated
to the use of graphics processing units (GPU) to perform
computation in application
traditionally handled by central processing unit (CPU).
GPU Graphics Processing Unit. Definition associated to the
graphics processor unit in the
system.
CCL Connected Component Labeling. Algorithm refers to image
segmentation process
using points connected by similarity function.
LSS Level Set Segmentation.
SC Spectral Clustering.
SIMD Single-Instruction Multiple Data.
SIMT Single Instruction Multiple Thread.
API Application Programming Interfaces.
SIMT Single Instruction Multiple Thread.
ISA Instruction Set Architecture.
TLP Thread-level Parallelism.
PTX Parallel Thread Execution.
xi
-
MPI Message Passing Interface.
CUDA NVIDIA’s Compute Unified Device Architecture Framework.
OpenCL Open Compute Language.
SM Streaming Multiprocessor.
ECC Error Correcting Codes.
CTA Cooperative Thread Arrays.
PT Persistent Threads.
PDE Partial Differential Equation.
PDEs Partial Differential Equations.
BFS Breadth-first Search.
MST Minimum Spanning Tree.
xii
-
Acknowledgments
It would not have been possible to write this doctoral thesis
without the help and support
of the kind people around me, to only some of whom it is
possible to give particular mention
here.
First of all, I would like to thank my parents Fani, and Dante
for their endless support
through every single step of this journey. I thank my brother
Reykjavil, and sister Lisbeth
for keeping me on the path and making me believe that everything
is possible. I thank my
boyfriend Jose, for his unlimited love and throughout support,
for which my mere expression
of thanks likewise does not suffice.
This thesis would not have been possible without the help,
support and patience of
my colleagues and collaborators. A special thanks to all my
colleagues at NUCAR group.
Specially, to Leiming, Fritz, Julian and Xiangyu for their
contributions towards the concepts,
ideas, and for keeping company on the doctoral journey. I would
also like to thanks our
collaborators Dr. Qianqian Fang, Dr. Norm Rubin (NVIDIA) and Dr.
Ningfang Mi for their
constructive feedback on this dissertation.
It is my deepest gratitude and warmest affection that I dedicate
this thesis to my advisor
Dr. David Kaeli who has been a constant source of knowledge and
inspiration.
xiii
-
Abstract of the Dissertation
Characterization and exploitation of nested parallelism and
concurrent kernel execution to accelerate high performance
applicationsby
Fanny Nina Paravecino
Doctor of Philosophy in Computer Engineering
Northeastern University, March 2017
Dr. David Kaeli, Adviser
Over the past decade, GPU computing has evolved from being a
simple task of mapping
data-parallel kernels to Single Instruction Multiple Thread
(SIMT) hardware, to a more
complex challenge, mapping multiple complex, and potentially
irregular, kernels to more
powerful and sophisticated many-core engines. Further, recent
advances in GPU architec-
tures, including support for advanced features such as nested
parallelism and concurrent
kernel execution, further complicate the mapping task.
Improving application performance is a central concern for
software developers. To
start with, the programmer needs to be able to identify where
opportunities for optimization
reside. Many times the right optimization is tied to the
underlying nature of the application
and the specific algorithms used. The task of tuning kernels to
exploit hardware features can
become an endless manual process. There is a growing need to
develop characterization
xiv
-
techniques that can help the programmer identify opportunities
to exploit new hardware
features, and to port a broader range of applications to GPUs
efficiently.
In this thesis, we present novel approaches to characterize
application behavior that
can exploit nested parallelism and concurrent kernel execution
introduced on recent GPU
architectures. To identify bottlenecks that can be improved
through the exploitation of
nested parallelism and concurrent kernel execution, we proposed
a set of metrics for a range
of GPU kernels.
For nested parallelism, our approach focuses on irregular and
recursive kernel applica-
tions. For irregular applications we define, implement, and
evaluate three main runtime
components: i) control flow workload analysis, ii) child kernel
launching, and iii) child
kernel synchronization. For recursive kernel applications, we
define, implement, and eval-
uate: i) degree of thread-level parallelism, ii) work
efficiency, and iii) overhead of kernel
launches. For concurrent kernel execution, our characterization
captures a kernel’s launch
configuration, the resource consumption, and the degree of
overlapped execution. Our pro-
posed metrics help us to better understand when to exploit
nested parallelism and concurrent
kernel execution.
We demonstrate the utility of our framework of metrics by
focusing on a diverse set
of workloads that include both irregular and recursive program
behavior. This suite of
workloads includes: i) a set of microbenchmarks that
specifically target the set of new
GPU features discussed in this thesis, ii) the NUPAR suite, iii)
the Lonestar suite and iv)
real-world applications. By using our framework, we are able
speedup application by more
than 5x-23x as compared to non-advanced-parallel-feature GPU
implementations.
xv
-
Chapter 1
Introduction
In 1965 Gordon Moore proposed Moore’s Law, which states that the
number of tran-
sistors on a microprocessor doubles roughly every 18 months [1].
Since 1965, Moore’s
Law has been shown to be remarkably accurate, and
microprocessors have doubled their
capabilities every one to two years. However, the translation of
increased transistor density
into improved application performance remains a challenging
endeavour. There is no silver
bullet that automatically optimizes software, programming
frameworks, and algorithms so
that they can benefit from advances in hardware.
In many areas, performance improvement have been possible only
due to modifications
in algorithms, providing substantial performance gains that are
much higher than those
enabled by increasing processor speed alone. There are still
many challenges that need
to be addressed through the discovery of new parallel
algorithms, specifically designed to
take advantage of the potential power of parallel hardware,
while avoiding some of the
bottlenecks that can occur on these platforms.
In this thesis, we will explore different mechanisms to
understand the behavior of the
parallel code (i.e., kernels) at different stages of the
computing stack, including multiple
compilation levels, as well as runtime execution. This work will
define a characterization
process of parallel execution that will guide and inform the
programmer on how best to
exploit new parallel features. Equipped with this knowledge, the
programmer can then
exploit parallelism at different grains of concurrency. We test
our characterization process
on a broad set of parallel applications, demonstrating the
utility of this knowledge to tune
1
-
CHAPTER 1. INTRODUCTION
application to effectively exploit two recently introduced
parallelization features: 1) nested
parallelism and 2) concurrent kernel execution. We will also
present a tuning mechanism to
further improve application throughput.
1.1 Parallel Programming
Parallel programming provides a myriad of advantages over
sequential programming,
such as increased application throughput, improved utilization
of hardware resources, and
enhanced concurrent execution [2]. Given the wide range of
parallel computing hardware
platforms available to-date, spanning massively parallel
supercomputers to multicore smart-
phones, parallel execution has become the most effective path to
improve performance. The
need for high performance has been amplified due to the rate at
which raw data is being
generated today and is rapidly growing for the foreseeable
future.
Commonly, the easiest way to write parallel code is using a
framework such as OpenMP.
OpenMP is a simple, directive-based interface that offers
incremental parallelization, which
allows for loops in serial code executed concurrently without
changing their structure [3].
However, Using OpenMP does not solve the problem of load
imbalance, and the resulting
performance gain is limited by the Amdahl’s law [4], which
defines that code improvement
is to the portion of serial code that is suitable for
parallelization.
When working with a distributed system, Message Passing
Interface (MPI) [5] provides
an effective programming model for expressing parallelization.
MPI is commonly used on
distributed memory systems that leverage message passing.
However, one notable trend we
are witnessing in the field of parallel scientific computing is
the dramatic increase in the
number of applications that utilize GPUs. Based on Flynn’s
widely used taxonomy [6, 3], the
large number of cores on the Graphics Processing Unit (GPU)
enable us to launch thousands
of compute threads to execute in Single-Instruction Multiple
Data (SIMD) fashion. SIMD
provides parallelism by operating on multiple data streams
concurrently [3]. Applications
for GPUs are commonly developed using programming frameworks
such as Kronos’s Open
Compute Language (OpenCL) [7, 8, 9] and NVIDIA’s NVIDIA’s
Compute Unified Device
Architecture Framework (CUDA) [10]. Both OpenCL and CUDA are
based on the high-level
2
-
CHAPTER 1. INTRODUCTION
programming constructs of the C and C++ languages. The data
parallel and computationally
intensive portions of an application are offloaded to the GPU
for accelerated execution.
These programming frameworks offer a rich set of runtime API,
and allow the developer to
write optimized kernels for execution on GPUs.
Researchers and developers have enthusiastically adopted the
CUDA programming
model and GPU computing for a diverse range of applications [11,
12, 13, 14]. Given the
varying degrees of parallelism present in many applications, we
are motivated to explore
advanced parallel features on the GPU.
1.1.1 Advanced Parallel Features
Recent advances in GPU architectures have pushed a number of
computational barriers,
enabling researchers to leverage parallel computing to improve
application throughput.
Graphics hardware has substantially evolved over the years to
include more functionality
and programmability. NVIDIA’s previous generation of GPUs, the
Fermi family, has been
used in a number of applications, promising peak
single-precision floating performance of
up to 1.5 TFLOPS. However, NVIDIA’s Kepler GK110 GPU offers more
than 4.29 TFLOPs
of single-precision computing capability. The newest features
provided on Kepler enable
programmers to move a wider range of applications to the CUDA
framework.
Given the new hardware features provided on recent hardware,
exploiting these features
to improve overall execution throughput has become paramount.
Thread-level parallelism
provides impressive speedups for applications ported to the GPU.
Moreover, the addition of
nested parallelism supports conditional-loop execution
throughput, which requires working
at a finer thread granularity. Another new feature is the
concurrent kernel execution of ker-
nels, which improves the utilization and runtime of multiple
kernel, removing the overhead
due to context switching. There is also a performance advantage
provided by performing
back-to-back kernel launches. In the CUDA API, kernel
invocations are asynchronous. If a
developer can call a kernel (or kernels) multiple times without
any intervening synchroniza-
tion (i.e., memory transfers or dependency checking), then the
multiple kernel calls will be
batched in the CUDA driver, and the application can overlap
kernel execution on the GPU.
Given the level of sophistication provided in the modern GPUs,
we have focused our
3
-
CHAPTER 1. INTRODUCTION
work on the characterization of advanced parallel features in
order to guide the improvement
of application throughput. We consider optimization of
applications for two new features
available on NVIDIA Kepler GPUs and more recent GPU
generations:
• Nested Parallelism: modern GPUs add the capability to launch
child kernels within aparent kernel. A pattern commonly found in
many sequential algorithms are nested
loops. Nested parallelism allows us to implement a nested loop
with variable amounts
of parallelism.
• Concurrent Kernel Execution: modern GPUs provide the ability
to run multiplekernels, assigned to different streams,
concurrently. The Kepler, Maxwell and Pascal
architectures support up to 32 concurrent streams (as compared
to 16 on the Fermi).
Each stream is assigned to a different hardware queue.
1.2 Characterization of Advanced Parallel Features
The utilization of high performance computing resources has also
been hampered by the
relative dearth of system software and of tools for monitoring
and optimizing performance.
Profilers have evolved to provide application execution insights
to the developer in order to
improve application throughput. However, profilers are tightly
tied to specific hardware and
do not support the latest advanced parallel features, which
makes tuning the applications
targeting modern GPUs a challenge.
New approaches to profiling/instrumentation are needed to
understand application in-
teraction with the latest hardware features. Binary
instrumentation can be used on a GPU
for performance debugging, correctness checks, workload
characterization, and runtime
optimization. Such techniques typically involve inserting code
at the instruction level of an
application during back-end compilation, Binary Translation is
able to gather data-dependent
application behavior.
Given the presence of data-dependent behavior of an application,
we can characterize
different execution patterns. Our focus is to characterize
dynamically available parallelism
with the aim to evaluate implementations designed to exploit the
execution patterns using ad-
vanced parallel features such as nested parallelism. Our
characterization approach evaluates
4
-
CHAPTER 1. INTRODUCTION
the potential for optimization by analysing the impact on
control, memory and synchro-
nization behavior on a GPU. As an illustrative example, our
study targets a comprehensive
understanding of the overhead of current nested parallelism
supported on GPUs in terms of
kernel launch, control flow, nested synchronization and
algorithm overhead.
We also consider another form of parallelism available on modern
GPUs: concurrent
kernel execution. Just as a typical CPU application can consist
of multiple functions, it
is also common to have multiple GPU kernels present in a single
GPU application. A
GPU kernel is a function executed on a GPU device. Managing
efficient concurrent kernel
execution using independent thread blocks is cumbersome at best.
In particular, this thesis
targets a detailed understanding of the run-time costs of
concurrent kernel execution in terms
of kernel launch configuration, resource contention, and
overlapped computation.
1.3 Challenges in Exploiting Parallel Execution Features
The software implementation of a GPU application can
dramatically influence the
application’s performance. For example, performance will suffer
if kernels are stalled due
to control dependence. Delays also occur when data dependencies
are encountered. GPU
stream processors are more difficult to utilize effectively if
the targeted applications present
dynamic and frequent data dependencies (commonly present in
sorting, recursion, dynamic
programming and evolutionary programming).
Along with the challenges of dynamic and global dependencies,
many applications
involve the execution of multiple kernels. The current
generation of NVIDIA GPUs already
support concurrent execution of kernels using Hyper-Q
technology, allowing concurrent
execution of kernels from the same application or different
applications. In this thesis,
we characterize concurrent kernel execution, and explore how to
perform resource utilize
and minimize kernel launch overhead. Presently, it is difficult
to modify an application to
effectively leverage nested parallelism and concurrent kernel
execution. Addressing this gap
is the major focus of this thesis.
5
-
CHAPTER 1. INTRODUCTION
1.3.1 Nested Parallelism Challenges
Depending on the application characteristics and the
parallelization strategy, a kernel
can exhibit a range of dynamic behaviors. The dynamic behavior
is highly correlated to data-
dependent parallel execution. Data dependencies are found in
parallel loops and recursive
calls. Parallel loops and recursive calls are forms of nested
Thread-level Parallelism (TLP).
Nested TLP can present a range of control flow behaviors.
Explicit control flow con-
structs such as if-then-else or for-loop are fundamental
constructs in any high-
level programming language. In kernels with complex control
flow, SIMD threads can
follow different paths of execution, causing thread divergence.
Thread divergence would
seem to cause a paradox, since all threads in a basic group
(e.g., a warp) must execute
the same instruction on each cycle. If the threads in a warp
diverge, the warp serially
executes each branch path, disabling threads that do not take
that path. Warp divergence can
dramatically degrade application performance.
Understanding control flow effects is a key step towards
characterization of nested
parallelism. We have faced the following challenges when trying
to exploit dynamic
parallelism:
• For control flow analysis, it is important to quantify the
impact of thread divergence bycategorizing divergent and convergent
paths in order to understand how performance is
impacted. Control flow divergence effects can severely impact
our ability to leverage
nested parallelism. On-the-fly analysis of control flow workload
provides a better
understanding for data-dependent applications. In previous work,
control flow analysis
has been performed statically.
• To properly characterize child kernel launches, we need to
understand kernel launchparameters and device runtime management.
There presently are no tools or profilers
that can proper analyze nested-kernel launch overhead.
• Nested parallelism requires that parent kernels and child
kernels explicitly synchronizewith each other in order to assure
consistent application execution. In order to perform
child kernel synchronization, the device runtime has to save the
state of parent kernels
when they are suspended and yield to the child kernels at the
synchronization points.
6
-
CHAPTER 1. INTRODUCTION
To our knowledge, there are no tools available that can measure
dynamic child kernel
synchronization.
1.3.2 Concurrent Kernel Execution Challenges
Enabling multiple kernels to execute concurrently on GPUs leads
to the physical shar-
ing of compute resources. Concurrent kernel execution can
increase overall application
throughput and can also reduce energy consumption. In order to
deliver performance im-
provement there needs to be sufficient resources on the GPU to
launch concurrent kernels.
In other words, concurrent kernel execution provides performance
improvement through
overlapped kernel computation. In order to achieve overlapped
kernel computation, we
need to understand the sources of resource contention and the
effects of the kernel launch
configuration.
Resources contention is heavily dependent on the application
input. For example, a
small input set might not stress the memory, whereas a large
input set might. At the same
time, resource contention is dependent on the amount of GPU
hardware resources available.
An application binary compiled and optimized for one GPU may
perform poorly on another
GPU due to resource contention.
The kernel launch configuration can give us clues leading to
resource contention. Each
kernel is launched with a set of variables called the launch
configuration variables. Com-
monly, these variables include the number of threads per block,
the number of thread-blocks
per grid, the usage of shared memory, and the number of
registers used. Most of the time,
these variables are dictated by the number of data elements the
kernel operates on. De-
pending on the GPU architecture, the resource usage based on
these variables can changed
dramatically. Having a better understanding of the resource
contention is a key step towards
the characterization of concurrent kernel execution. We face the
following challenges when
trying to exploit concurrent kernel execution:
• To properly understand resource contention, we need to have a
better control of theresources utilized by the kernel. We can bring
software threads closer to the actual
hardware thread execution by implementing persistent threads.
Persistent threads
break the mapping of one software thread to one data element,
and instead it is
7
-
CHAPTER 1. INTRODUCTION
dynamically defined by the availability of resources on the GPU.
There is no general
way to map any kernel to persistent threads; persistent threads
will not always provide
the best performance for every kernel.
• Resource contention varies dramatically across different GPU
architectures, driverversions, and CUDA frameworks. Furthermore,
the compiler and driver can have
significant impact on kernel performance. To properly exploit
concurrent kernel
execution, we need to understand hardware, driver, compiler and
CUDA framework
interaction, which unfortunately are non-disclosed by hardware
vendors.
1.3.3 Benchmark Suite
Many applications—both academic research and industrial
products—have been accel-
erated using parallel framework to achieve significant parallel
speedup. Such applications
encompass a variety of problem domains, including security
surveillance, numerical linear
algebra, graph theory problems, among others. Of these many
applications, we select a set
of representative real-world applications to focus our
discussion.
There has been considerable growth in interest of security
surveillance image segmenta-
tion problems. This interest has created an increased need for
performant image segmen-
tation kernels. Different approaches of image segmentation have
used GPU computing in
a wide variety of applications[15, 14, 16, 17]. Among the
different image segmentation
approaches, Connected Component Labeling (CCL), and Level Set
Segmentation (LSS) are
the most well-known applications.
CCL is a widely used image segmentation algorithm. It connects
neighboring pixels
based on their similarities. The dependencies between the
neighboring pixels and continuous
propagation of connectivity between pixels makes CCL a highly
sequential application. CCL
is a great candidate for characterization of nested parallelism
due its dynamic propagation
of connected components.
LSS is an evolutionary image segmentation algorithm. Given an
initial curve C, LSS
expands C, or contracts C, based on the evolution of the
function f . The expansion of
the curve is an outward evolution, and the contraction of the
curve is an inward evolution.
Every evolution cycle depends on the previous cycle in terms of
computing the curve. The
8
-
CHAPTER 1. INTRODUCTION
dependencies between multiple pixels makes LSS a great candidate
for characterization of
nested parallelism and concurrent kernel execution together.
To analyze how best to accelerate recursion, we have explored
graph theoretic algorithms,
including BFS and Prim algorithm. In addition, we evaluated
selected Lonestar [18], and
NUPAR benchmarks [19] in this thesis. In summary, we have used
two real applications
and four different benchmark applications as we developed
characterization schemes in this
thesis. Next, we outline the contributions and describe the
organization of the remainder of
this thesis.
1.4 Contributions of the Thesis
In this thesis, a number of key contributions towards the deep
analysis and exploitation
of advanced parallel features are presented. The key
contributions are summarized below:
• We characterize parallel applications, identifying when we can
leverage nested paral-lelism available on NVIDIA GPUs (Kepler and
Maxwell families). We define three
workload components that can guide the developer on how best to
leverage nested
parallelism. To the best of our knowledge, we are the first work
to define, implement,
and evaluate these combined three components: i) control flow
workload analysis, ii)
child kernel launching, and iii) child kernel
synchronization.
• We develop NVIDIA SASS instrumentation handlers to
characterize data-dependentapplication behavior. We use an NVIDIA
assembly code SASS Instrumentor (SASSI)
to evaluate dynamic application behavior. We provide a handler
to profile and measure
binary execution for control flow dependent loops. Our handler
can collect and
measure the control dependent loop efficiency.
• We characterize recursive parallel workload, identifying when
we can leverage nestedparallelism available on NVIDIA GPUs (Kepler
and Maxwell families). We define
three workload components that can guide the developer on how
best to leverage
nested parallelism in the case of parallel recursion. We
evaluate three components: i)
the degree of thread-level parallelism, ii) the work efficiency,
and iii) the overhead of
9
-
CHAPTER 1. INTRODUCTION
kernel launches. Furthermore, we propose a new approach to
increase thread-level
parallelism in order to increase work efficiency and reduce the
number of recursive
kernel launches.
• We characterize the execution of concurrent kernels on NVIDIA
GPUs (Kepler andMaxwell families). Our characterization captures a
kernel’s launch configuration, the
resource consumption, and the degree of overlapped execution.
Our proposed metrics
help us to better understand when to use concurrent kernel
execution.
• We propose, implement, and evaluate kernels with persistent
threads as a mechanismto control resource contention for concurrent
kernel execution on GPUs. Our results
show that kernels with persistent threads can be beneficial to
identify peak resource
contention. Unfortunately, this does not directly lead to an
overall performance
improvement.
• Our proposed workload metrics for irregular applications and
parallel recursive kernelshave been applied to a number of CUDA
kernels taken from the problem domains of
image processing, linear algebra, and graph theory problems. For
these performance-
hungry applications, we achieve 1.3x to more than 100x speedup,
as compared to a
flat GPU kernels.
• We compare state-of-the-art image segmentation applications,
including connectedcomponent labeling, and level set segmentation,
exploring both nested parallelism
and concurrent kernel execution. Our accelerated connected
component labeling has
been presented at the International Conference on Computer
Vision and Graphics
(ICCVG) [15]. In addition, it has been also presented at the GPU
Technology Con-
ference (GTC) [20]. Our work on fast level set segmentation
exploiting advanced
parallel features has been presented on Irregular Applications:
Architectures and
Algorithms Workshop (IA3) [21] and featured as a poster in
Programming and Tuning
Massively Parallel Systems Summer School (PUMPS). Furthermore,
our accelerated
connected component labeling has been ported to OpenCL. We have
analyzed the
benefits of advanced parallel features on AMD cards, and it has
been presented at the
10
-
CHAPTER 1. INTRODUCTION
3rd International Workshop on OpenCL (IWOCL) [22]. Both of these
real-world ap-
plications are part of the NUPAR benchmark presented at the
International Conference
on Performance Engineering (ICPE) [19].
1.5 Organization of Thesis
The central focus of this work is to characterize nested
parallelism and concurrent kernel
execution in a systematic way that works well for any GPU and
any application. The
remainder of the thesis is organized as follows: Chapter 2
presents background information
on GPU architecture, specifically the NVIDIA GPU architecture,
the parallel framework
CUDA, and the NVIDIA SASSI instrumentation framework. In Chapter
3, we present
related work in the area of characterization of parallel
kernels, nested parallelism, and
concurrent kernel execution in GPU devices. In Chapter 4, we
discuss the characterization
of nested parallelism for conditional nested loop, parallel
recursion, and concurrent kernel
execution in NVIDIA Kepler and Maxwell architectures. Next, in
Chapter 5 we present our
benchmark kernels that are used throughout this thesis to
leverage advanced parallel features.
In Chapter 6, we present real applications that leverage our
framework to effectively exploit
advanced parallel features. In Chapter 7, we conclude the thesis
and summarize our work.
We also suggest directions for future work.
11
-
Chapter 2
Background
As we enter the era of GPU computing, demanding applications
with substantial par-
allelism can leverage the massive parallelism of GPUs to achieve
superior performance
and efficiency. Today GPU computing enables applications that
were previously thought
to be infeasible because of long execution times. By enjoying
the benefits of Moore’s
Law [1, 23, 24], NVIDIA GPUs have evolved since 2001. Table 2.1
shows the evolution of
NVIDIA graphic cards since the first programmable GPU was
released.
Date Product Transistors CUDA cores
2001 GeForce 3 60 million -
2002 GeForce FX 125 million -
2004 GeForce 6800 222 million -
2006 GeForce 8800 681 million 128 (First support for CUDA
Programming)
2007 Tesla T8, C870 681 million 128
2008 GeForce GTX 280 1.4 billion 240
2008 Tesla T10, S1070 1.4 billion 240
2009 Fermi 3.0 billion 512
2012 GK104 Kepler 3.5 billion 1536
2012 GK110 Kepler 7.0 billion 2688
2014 GM204 Maxwell 5.2 billion 2816
Table 2.1: NVIDIA GPU technology evolution [25]
12
-
CHAPTER 2. BACKGROUND
With the rapid evolution of GPUs from a configurable graphics
processor to a general
purpose programmable parallel processor, the ubiquity of GPUs in
every PC, laptop, desktop,
and smartphone was imminent. A large community of researchers
and developers have
adopted the CUDA programming framework for a diverse range of
applications [26, 27, 28].
The CUDA runtime on an NVIDIA GPU enables us to execute programs
developed
in high-level languages, including C, C++, Fortran, OpenCL,
DirectCompute, and oth-
ers [26, 25, 2]. The nature of CUDA is to try to preserve
elements of common sequential
programming and extend them to a parallel thread execution. CUDA
presents a Single
Instruction Multiple Thread (SIMT) abstraction with a
straightforward set of configurations
for expressing parallelism.
2.1 CUDA Model
The CUDA model acts as a bridge between an application and its
implementation
on available hardware [29]. There are a number of different
layers that lie between the
application and the hardware. Figure 2.1 shows the different
layers of abstraction between a
software implementation and hardware level. The programming
model provides a logical
view of the specific computing architectures.
!"#$%&'()&**+,-&$,".)
-/0&)'/.$,1()
-/0&)0',2(')
3*/)4&'0%&'()
Figure 2.1: Layers of abstraction between software application
and GPU hardware.
CUDA enables the developer to write parallel code that can run
across tens of thousands
of concurrent threads and hundreds of core processors. CUDA
divides execution down,
13
-
CHAPTER 2. BACKGROUND
hierarchically, using parallel abstractions such as kernels,
blocks, and threads per block (see
Figure 2.2). A kernel executes a sequential program on a set of
parallel threads. Each thread
has its registers and private local memory. Each block allows
communication among its
threads through shared memory. Blocks communicate between
themselves using global
memory. This memory hierarchy is illustrated in Figure 2.3.
!"#$%$&&'&()*+'"!(
,,-&*.$&,,(/*0+(1'%2'&34(
5(
(!!6(
7(
!"82+(#$%$&&'&()*+'"!(
9( :(
;(?&*)1(
@ABC(
?&*)1(
Figure 2.2: The CUDA model: a kernel, grid, and threads per
block.
Other memories included in the CUDA memory model include:
• texture memory, specialized for 2D read-only coalesced
memory
• constant memory, design to support for read-only accesses from
different threadsacross blocks.
As mentioned in Chapter 1, the CUDA model follows a SIMT
architecture to manage
and execute threads in groups of 32 named warps. Even though all
threads in a warp must
execute the same instructions, and the GPU is a SIMD
architecture, there are some key
features that differentiate a GPU from traditional SIMD:
14
-
CHAPTER 2. BACKGROUND
!"#$%&"'%(
)*+,-(."/0(
!"#$%&"'%(
)*+,-(."/0(
!"#$%&"'%(
)*+,-(."/0(
!"#$%&"'%(
)*+,-(."/0(
12,'"3(."/*'4(
!"#$%&"'%(
)*+,-(."/0(
!"#$%&"'%(
)*+,-(."/0(
!"#$%&"'%(
)*+,-(."/0(
12,'"3(."/*'4(
!"#$%&"'%(
)*+,-(."/0(
5-*6,-(."/*'4(
7*8%&,8&(."/*'4(
9":&;'"(."/*'4(
Figure 2.3: The CUDA memory hierarchy.
• Each thread in the warp has its own instruction address
counter.
• Each thread has its own register state.
• Each thread can have an independent execution path.
Although, the CUDA model enables each thread in a warp to
display a different exe-
cution behavior, divergent behavior degrades performance since
warps will be executed
serially. Control flow instructions (e.g., if-then-else, for,
while) is one of the fun-
damental constructs in CUDA programming that causes this
undesired behavior called warp
divergence.
2.1.1 Divergence
The use of control flow instructions is unavoidable in any
applications. Modern CPUs in-
clude complex hardware to perform branch prediction [30, 31].
Hardware branch predictors
speculate the direction of conditional control flow in programs
[32, 33, 34]. If the predictor
is correct, branch execution incurs little or no performance
penalty. If the prediction is
15
-
CHAPTER 2. BACKGROUND
not correct, the CPU stalls for a number of cycles as the
instruction pipeline is flushed,
and instruction fetching resumes at the correct program counter.
In comparison, GPUs are
high-throughput, but lack complex branch prediction mechanisms
[35, 36, 37]. Execution
on an NVIDIA GPU using the CUDA execution model assumes that all
threads in a warp
must execute identical instructions on the same cycle. Executing
complex control flow
typically results in divergent execution between the threads in
the same warp [38].
Recent GPUs are designed to better handle control flow. The
modern GPU hardware
supports condition codes (CC) and CC registers that contain the
4-bit state vector (sign,
carry, zero, overflow) used in integer comparisons [39]. The CC
registers can direct the flow
of execution via predication or divergence. Predication allows
(or suppresses) the execution
of instructions on a per-thread basis within a warp, while
divergence supports conditional
execution of longer instruction sequences.
Due to the additional overhead of managing divergence and
convergence, the compiler
uses predication for short instruction sequences. The effect of
most instructions can be pred-
icated on a condition; if the condition is not true, the
instruction is suppressed. Predication
works well for small fragments of conditional code, especially
for if statements with no
corresponding else. For larger conditional code segments,
predication becomes inefficient
because every instruction is executed, regardless of whether it
will affect the computation.
When the length of the conditional code fragment is long and the
cost of predication would
exceed the benefits, the compiler will generate conditional
branches. If the threads in a warp
diverge due to a data-dependent conditional branch, the warp
serially executes each branch
path taken, disabling threads that are not on that path. Once
all paths complete, all threads
re-converge to the original execution path. Figure 2.4
illustrates how warp divergence is
handled on a GPU.
Although warp divergence can have a negative impact on
application throughput, this
impact varies dramatically across GPU architectures. In the
following sections we address
divergent execution for the latest GPU generations, from the
Fermi GPU architecture to the
Pascal GPU architecture.
16
-
CHAPTER 2. BACKGROUND
!"
!"#$$%
&$!&%
'(%
!"#$$%
#"
'(%
!"#$$%
$"
'(%
!"#$$%
%"
'(%
!"#$$%
&"
'(%
!"#$$%
'"
'(%
!"#$$%
("
)
$&"
!"#$$%
&$!&%
$'"
!"#$$%
&$!&%
$("
!"#$$%
&$!&%
$*"
!"#$$%
&$!&%
$+"
!"#$$%
&$!&%
$,"
!"#$$%
&$!&%
%!"
!"#$$%
&$!&%
%#"
!"#$$%
&$!&%
Figure 2.4: Branch divergence in the GPU
2.2 GPU Computing Architecture
A Streaming Multiprocessor (SM) is the centrepiece of the NVIDIA
GPU architecture.
A thread block is scheduled on a single SM, and once it is
scheduled on the SM, it remains
there until execution completes. An SM can hold more than one
thread block at the same
time. Registers and shared memory are scarce resources in the
SM. These resources have to
be partitioned among all threads resident on an SM. Each SM
contains hundreds of CUDA
cores, and each GPU device contains tens of SM.
Logically, all threads in a block run in parallel, but not all
threads can execute physically
at the same. Therefore, different blocks may make progress at
different rates. Since warps
are the atomic unit of execution on the GPU, many warps can be
scheduled in an SM, but
depending on the SM resource availability, not all scheduled
warps will be active. If a warp
is idle, then the SM schedules another warp from any block that
is resident on the same
SM. The benefits of this switching between concurrent warps is
that we avoid all overhead.
Given the importance of determining the right warp granularity,
we would like to quickly
17
-
CHAPTER 2. BACKGROUND
find the best grid configuration for any application.
2.2.1 Fermi Architecture
The NVIDIA Fermi (chip GF110) GPU was released in 2009. Fermi
introduced an
increased number of CUDA cores per SM, higher space for shared
memory, configurable
shared memory, and Error Correcting Codes (ECC) on main memory
and caches. Each SM
in Fermi has 32 CUDA processor cores, 16 load/store units, and
four special function units
(SFUs). Fermi has a 64-KByte register file, an instruction
cache, two multi-thread warp
schedulers and two instruction dispatch units [40].
The SIMT instructions control the execution of an individual
thread, including arithmetic,
memory access, and branch/control flow instructions. Fermi
extends SIMT to control flow
with support for indirect branches and function-call
instructions. With the improvements
introduced in the Fermi Parallel Thread Execution (PTX) 2.0
Instruction Set Architecture
(ISA), individual thread control flow can predicate
instructions.
2.2.2 Kepler Architecture
A number of new features were introduced in Kepler as compared
to the earlier GPU
Fermi architecture. Table 2.2 compares some of theses features
for Fermi (instance of chip
GF110) and Kepler (instance of chip GK110).
Kepler GK110 comprises up to 15 Kepler SM (SMX) units, four warp
schedulers, and
eight instruction dispatch units. Thus, it can issue and execute
four warps simultaneously.
Each SMX has 192 single-precision CUDA cores, 64
double-precision units, 32 load/store
units, and 32 special function units, which can operate sine,
cosine, reciprocal or square root
per thread per clock [42]. Kepler GK110 can provide up to 4.29
TFLOPS single-precision
and 1.43 TFLOPS double-precision floating point performance
[43].
In addition to an increase in the number of CUDA cores per SM
and a dramatic increase
in the number of registers per thread, Kepler (compute
capability 3.5 or higher) introduced a
number of new features to further simplify parallel program
design.
18
-
CHAPTER 2. BACKGROUND
Fermi (chip GF110) Kepler (chip GK110)
SPs per SM 32 192
Threads per SM 1536 2048
Thread blocks per SM 8 16
Warp schedulers per SM 2 4
Dispatch Units per SM 2 8
Shared Memory/L1 cache 16/48KB 16/32/48KB
32-bit Registers per SM 32K 64K
Registers per thread 63 255
Table 2.2: Fermi chip GF110 versus Kepler chip GK110 [41]
2.2.2.1 Dynamic Parallelism
Dynamic parallelism is an extension to the CUDA programming
model, enabling CUDA
kernels to create, and synchronize, new kernel entirely on the
GPU. With this feature, any
kernel can launch a child kernel and manage inter-kernel
dependencies [35].
In order to manage the execution of dynamic parallelism, the
CUDA model added a
new feature known as the Grid Management Unit (GMU) [44, 42,
45], which is able to
dispatch, as well as pause the dispatch, of new grids. The GMU
can also queue pending,
and suspend running, grids. A grid includes all thread-blocks
associated with the kernel.
Grids are launched in the order that they are received.
In previous GPU generations, the host launched work through the
Compute Work
Distributor (CWD) unit [42, 2], and the CWD tracks blocks issued
and sends them to the SM
for execution. In Kepler and more recent GPU generations, the
GPU launch work from host
or device using the GMU. The GMU communicates with the CWD using
a bidirectional
link to prioritize or suspend/pause grids. Also, the GMU has a
direct connection to the SM
to support dynamic parallelism, and through this connection
device kernels can dispatch
child grids.
The aim of the GMU is to effectively manage grid dispatching, in
such way that, if we
need to free up resources for child kernels to execute, the GMU
will suspend parent kernel
grids [42, 45]. The device runtime will reschedule the grids on
different SMs in order to
19
-
CHAPTER 2. BACKGROUND
better manage resources. Figure 2.5 illustrates GMU interaction
with the CWD and the SM.
!"!"!"
!"#$%&'()$)$*'+',#-$#$-'.)$)$*'/0'1#2-*'
3456'78983:7:9;'$L'M%)>FJ'
Figure 2.5: Work flow of the Grid Management Unit to dispatch,
pause, and hold pending
and suspended grids.
Dynamic parallelism enables to create work directly on the GPU.
This can remove the
need to transfer execution control and data between the host and
the device. The child kernel
launch decisions are made at runtime by threads executing on the
device. CUDA model
controls the synchronization and communication between a parent
kernel and child kernels.
The local memory and registers associated with a parent thread
are still only accessible by the
parent thread, and are not accessible by other threads or any
child threads. Communication
with a child thread is only through global memory.
20
-
CHAPTER 2. BACKGROUND
Using dynamic parallelism, data-dependent parallel work can be
generated inline within
a kernel at runtime. These kernels take advantage of the GPU’s
hardware scheduler and load
balancer to dynamically adapt execution to make data-driven
decisions. Figure 2.6 shows
how dynamic parallelism works on a GPU.
!!"#$%!!'()*+)#,-'
.'
'//0'
'12',3$+415$+-'
' '361#47)*+)#88899'
'//:::'
;'
0'
-
CHAPTER 2. BACKGROUND
!"#$"%&'& !"#$"%&(& !"#$"%&)(&*
+,"-./0$&
12#"34&5& 12#"34&'& 12#"34&)'&
Figure 2.7: Hyper-Q
increases the total number of work queues between the host and
the device by allowing
32 simultaneous hardware-managed connections (as compared to the
single connection
available with Fermi). Figure 2.7 illustrates the Hyper-Q
feature in Kepler.
2.2.3 Maxwell Architecture
NVIDIA’s Maxwell generation provides only few enhancements to
the previous GPU
generation, with a focus on energy efficiency. In addition to
providing new features that
include dynamic parallelism and concurrent kernel execution, the
Maxwell generation
delivers 2x the performance per watt as compared to the Kepler
generation [46].
The Maxwell GTX 980 Ti (chip GM200) comprises 22 Maxwell SMs
(SMM). Each
SMM has 128 CUDA cores, four warp schedulers, eight instruction
dispatch units, and
eight texture units. Overall, the Maxwell SM looks very similar
to a Kepler SM, except that
Maxwell provides fewer CUDA cores per SM.
Another major change, as compared to the Kepler architecture, is
in the memory hierar-
chy. Shared memory and L1 cache are not longer combined. Shared
memory is dedicated,
and the L1 cache is combined with the texture cache. The Maxwell
GTX 980 Ti ships with
up to 96KB in its share memory unit, and 48KB for the L1
cache/texture cache.
2.2.4 Pascal Architecture
NVIDIA introduced Pascal architecture in 2016. The NVIDIA GTX
1080 which includes
a Pascal GP104, comprises 7.2 billions transistors and 2560
single-precision CUDA cores.
22
-
CHAPTER 2. BACKGROUND
GDDR5x memory is introduced with the GP104, providing 256-bit
memory interface,
delivering 43% higher memory bandwidth than NVIDIA’s prior
GeForce GTX 980 GPU.
The GP104 GPU consists of four Graphic Processing Clusters
(GPCs), 20 Pascal SMs,
and eight memory controllers. Each GPC has a dedicated raster
engine and five SMs. Each
SM contains 128 CUDA cores, four warp schedulers, eight
instruction dispatch units, a 256
KB of register file capacity, a 96 KB shared memory unit, 48 KB
of total L1 cache storage,
and eight texture units [47]. A comparative feature analysis
between four NVIDIA GPU
generations is presented in Table 2.3.
GPU GTX 590 GTX Titan GTX 980 Ti GTX 1080
Family Fermi Kepler Maxwell Pascal
Chip GF110 GK110 GM200 GP104
Compute Capability 2.0 3.5 5.2 6.1
SM 16 14 22 20
CUDA cores 32 192 128 128
Total cores 512 2688 2816 2560
Global Mem. 1474 MB 6083 MB 6083 MB 8113 MB
Shared Mem. 48 KB 48 KB 48 KB 48 KB
Threads/SM 1536 2048 2048 2048
Threads/block 1024 1024 1024 1024
Clock rate 1.26 GHz 0.88 GHz 1.29 GHz 1.84 GHz
TFLOPS 1.5 4.29 6.50 9.00
Table 2.3: A comparison of the features available on the four
generations of NVIDIA GPUs
considered in this thesis [25].
23
-
Chapter 3
Related work
In this chapter, we review related work in the areas of GPU
characterization, with special
emphasis on modern GPU features. We focus our literature review
on advanced parallel
features for multiple levels of concurrency, and different
grains of parallelism.
3.1 Characterization of GPUs
There have been studies focusing on GPU characterization to
better understand the
improvements during the evolution of these devices. This
evolution started with GPUs as
rendering tools, and spans to today where GPUs act as advanced
general purpose accelera-
tors [48, 49, 50, 51].
An early characterization study by Jia et al. [48] in 2012
focused on characterizing cache
memories on GPUs. Starting with the NVIDIA Fermi and the AMD
Fusion, GPU vendors
have included demand-fetching in their data caches. Earlier, GPU
generations were focused
on graphics rendering, providing local memories instead of
demand-fetched caches. With
the introduction of demand-fetch caches, a new challenges
arrived: 1) understanding the
benefits of cache memories and 2) a lack of intuition for
developers to efficiently use them.
They addressed these two problems and provided a mechanism to
efficiently utilize cache
memories.
Wong et al. [49] presented a characterization of Tesla GPUs
through the execution of a
set of microbenchmarks. Their analysis provided insights about
the characteristics of the
24
-
CHAPTER 3. RELATED WORK
GPUs beyond the information provided by NVIDIA. Another attempt
to characterize the
internals of a GPU was presented by Torres et al. [50]. In their
study, they focused on the
impact of the CUDA tuning techniques on the Fermi architecture.
Jiao et al. [51] presented
a characterization study of GPUs to evaluate power efficiency,
and the correlation between
application performance and power consumption.
A large body of work studies how to leverage GPUs effectively
through understanding
their characteristics for older [52, 53] and modern [54, 55, 56]
generations of GPUs. While
Kerr et al. [52] focused on understanding the behavior of PTX
1.4, Lee et al. [53] developed
an exhaustive performance analysis to capture performance gaps
between an NVIDIA
GTX280 Tesla architecture versus an Intel Core i7-960. In this
these, we focus our attention
primarily to the characterization of more modern GPUs.
3.1.1 Modern GPUs
Kayiran et al. [54] explored the impact of of memory accesses
during concurrent
execution thread execution and the resulting application
performance. They provided a
thorough evaluation of 31 applications - from the CUDA SDK to
Map-Reduce problems
- to understand resource contention in caches, networks and
memory. Furthermore, they
proposed a dynamic Cooperative Thread Arrays (CTA) scheduling
mechanism, which
regulates thread level parallelism by allocating an optimal
number of CTAs per application.
Mei et al. [55] provided a microbenchmark to dissect the device
memory hierarchy to
chacterize the organization of different GPUs cache systems on
Fermi, Kepler and Maxwell
architectures. Ukidave et al. [19] provided a set of application
benchmarks to analyze the
latest features on modern GPUs, such as nested parallelism,
concurrent kernel execution,
atomic operations, and shuffling.
In the next section, we review characterization of multiple
levels of concurrency and
thread granularity on modern GPUs.
25
-
CHAPTER 3. RELATED WORK
3.2 Multiple Levels of Concurrency
3.2.1 Nested Parallelism
One of the earliest characterizations of nested parallelism was
presented by DiMarco et
al. [57] in 2013. They aimed to quantify the performance gains
of dynamic parallelism pre-
sented by CUDA 5 and the Kepler architecture. Their exploration
covered two applications:
K-means and hierarchical clustering. Their results showed that
finer granularity of TLP
provides a more efficient way to leverage nested parallelism
than just avoiding CPU-GPU
synchronization.
In 2014, Wang et al. [58] presented an evaluation of the impact
of nested parallelism
in unstructured GPU applications for the Kepler architecture.
Irregular applications suffer
from workload imbalance, which provides a good target for
optimization using fine-grained
threads contained in coarse-grained blocks. Their
characterization focused on control flow
and memory access measurements. Two metrics were proposed in
their study: i.) warp
execution efficiency, and ii.) load/store replay overhead.
Although, they provided a thorough
analysis of nested parallelism for control flow instructions and
memory accesses, they
did not take into consideration synchronization cost between
parent and child kernels to
evaluate the benefits of nested parallelism. Furthermore, they
did not take into consideration
a finer grain classification of control flow divergence and
their impact on the application
performance.
In 2015, Wang et al. [59] continued their work on characterizing
nested parallelism in
GPUs. They proposed a new mechanism called Dynamic Thread Block
Launch (DTBL), a
new execution model to support irregular applications on GPUs.
DTBL allows coalesced
allocation of child kernels and parent kernels.
Yang et al. [60] analyzed a set of optimized parallel benchmark
applications that contain
loops. Their analysis covers the degree of TLP and proposed a
framework called CUDA-NP
to exploit nested parallelism in CUDA. CUDA-NP is a pragma-based
compiler approach that
generates GPU kernels with nested parallelism. Basically, their
approach reads the OpenMP-
like pragma directives in the input kernels and creates the
respective child kernels with a
grid configuration based on the parallel-loop-TLP degree.
However, they did not analyze
26
-
CHAPTER 3. RELATED WORK
the implications of parent-child synchronization. Furthermore,
they relied on developer’s
knowledge to identify potential parallel loops that can exploit
nested parallelism without
providing any insight about the behavior of the
architectures.
Further studies [61, 62, 63] characterized nested parallelism
based on the irregularity
of an application. Applications containing parallel loops and
recursive calls are suitable
to leverage nested parallelism. Zhang et al. [61] adapted two
irregular and data-driven
problems—bread-first search and single-source shortest path— to
leverage nested paral-
lelism. Li et al. [62] proposed parallelization templates to
leverage nested parallelism for
tree and graphs problems. These type of problems present
irregular nested loops and parallel
recursive computation. Wang et al. [63] provided insights on
leveraging nested parallelism
for general irregular applications. However, none of these
approaches provided a holistic
analysis on the implications on leveraging nested parallelism
and the effects across different
architecture/compiler versions.
3.2.2 Concurrent Kernel Execution
In early GPU architectures, concurrent kernel execution was
poorly supported. In 2011,
Wang et al. [64] proposed a mechanism to exploit concurrent
kernel execution through
manual context funnelling. They compared CUDA 4 automatic
context funnelling versus
their approach for Fermi architectures. They showed that manual
control of shared resources
might provide slight improvements in application performance.
However, they did not
discuss resource contention based on the interplay between
concurrent kernels.
In 2012, Wende et al. [65] provided a kernel reordering
mechanism to exploit concurrent
kernel execution for Fermi architectures. Their execution model
is designed to partition
kernels into small-scale computations, and by using
producer-consumer principles, manage
GPU kernel invocations after reordering them. Later, in 2014
Wende et al. [66] continued
their work on exploitation of concurrent kernel execution, and
proposed a characterization of
NVIDIA Hyper-Q feature for Kepler architecture, using an
offloading mechanism to allow
running multiple kernels simultaneously. Their analysis explored
synthetic benchmarks
and developed a performance evaluation, complementing their
previous work on kernel
reordering.
27
-
CHAPTER 3. RELATED WORK
Gregg et al. [67] proposed a kernel scheduler mechanism called
KernelMerge that
allows to run two OpenCL kernels concurrently on AMD cards.
KernelMerge takes into
consideration kernel configuration and investigates the
interaction between concurrent
kernels to analyze interference for sharing resources.
Since the Kepler architecture, NVIDIA provides a modern hardware
design to adequately
support concurrent kernel execution. In 2014, Jog et al.
[68]—moving to the next logical
step—proposed an Application-aware memory system for fair and
efficient execution of
concurrent applications. Their approach takes into consideration
memory awareness by
providing a new scheduling mechanism for serving memory requests
in a round-robin fash-
ion. They considered four metrics based on the Instructions Per
Cycle for each application.
However, they did not consider resources contention on
registers, nor the grid configu-
ration. Furthermore, they focused on memory-bound applications,
and did not discuss
arithmetic-bound applications.
In 2016, Luley et al. [69] proposed a framework to exploit
NVIDIA’s Hyper-Q. Their
framework oversubscribes kernels and defragments memory
transfers to effectively overlap
accesses with computations. Furthermore, they proposed multiple
mechanisms to reorder
kernels with the aim to improve application throughput.
Although, they have studied the
impact of memory transfers, they have not analyzed resource
contention between concurrent
kernels, which can be a key bottleneck when attempting to
leverage concurrent kernel
execution.
28
-
Chapter 4
Characterization of advanced parallel
features
Acceleration of high performance applications that exhibit
complex and irregular execu-
tion behavior is an ever-growing open problem. A naive port of
an irregular applications to a
parallel platform often leads to underutilization of hardware
resources, significantly limiting
performance. In this chapter, we present a characterization of
advanced parallel features on
a GPU that be effectively exploited to tune any application with
a high degree of irregularity.
4.1 Nested Parallelism
Irregularity in an application can result in poor workload
balance when attempting
to exploit fine-grained thread-level parallelism. We next
consider examples of high-level
language behavior that can suffer from a lack of inherent
thread-level parallelism.
A number of irregular applications contain control-flow
dependent nested loops. This
kind irregularity can inhibit thread-level parallelism, since
independence can only be de-
duced at runtime. Because many loops tend to be data dependent,
GPU hardware vendors
introduced support for nested parallelism, leveraging nested TLP
through the addition of a
new level of parallelism. We have studied a number of irregular
applications to identify how
frequently control flow dependent nested loops are used. Table
4.1 shows characterization
29
-
CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES
data from two different GPU benchmark suites, where control flow
dependent nested loops
occur.
Applications Benchmark Suite Number of Control Flow
Dependent Nested Loop
Barnes Hut Lonestar [18] 6
Delaunay Mesh Refinement Lonestar [18] 7
Points-to Analysis Lonestar [18] 31
Survey Propagation Lonestar [18] 7
Single-Source Shortest Paths Lonestar [18] 2
Connected Component Labeling NUPAR [19] 1
Level Set Segmentation NUPAR [19] 1
Table 4.1: Irregular Applications from two different GPU
benchmarks which exhibit control
flow dependent nested loops.
We have also explored recursive algorithm patterns that can
benefit from nested par-
allelism. Parallel recursion is a solution to efficiently
execute recursive algorithms which
exhibit the ability to spawn multiple threads per recursive
call. Before the introduction of
nested parallelism on the GPU, recursive solutions required GPU
and CPU intervention -
or an implementation of the GPU kernel void of recursive kernel
calls. However, constant
communication between the CPU and the GPU produces memory copies
and results in
communication overhead. In addition, most recursive solutions
are data-dependent solutions,
therefore it is challenging to anticipate the amount of overhead
will be introduced. On
the other hand, we cannot always use a single GPU kernel call
version for all recursive
algorithms. Table 4.2 shows a list of recursive kernels that can
be as parallel recursion.
Applications Benchmark Suite Number of Control Flow Number
of
Dependent Nest Loop Recursion calls
Breadth-First Search Lonestar 0 1
Prim’s Algorithm - 1 1
Table 4.2: Recursive applications which exhibit parallel
recursion.
30
-
CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES
1 g l o b a l vo id s i n g l e K e r n e l ( i n t ∗ A, i n t
∗B , i n t ∗C , i n t rows , i n t c o l s )2 {3 i n t i d x = b l
o c k I d x . x ∗ blockDim.x + t h r e a d I d x . x ;4 i f (A[ i d
x ∗ c o l s ] == 1)5 {6 f o r ( i n t i =0 ; i < c o l s ; i
++)
7 C[ i d x ∗ c o l s + i ] = A[ i d x ∗ c o l s + i ] + B[ i d x
∗ c o l s + i ] ;8 }9 }
Program Listing 4.1: Micro-benchmark Kernel with irregular
nested loop execution.
With nested parallelism, a recursive solution can be naturally
ported to the GPU and can
avoid CPU-GPU communication overhead. Nonetheless, the recursive
spawning of threads
cannot always lead to enough TLP to exploit the GPU, and it
could lead to substantial kernel
launch overhead and hardware underutilization.
Exploiting nested parallelism, either in the presence of
control-flow dependent nested
loop or parallel recursion, is not straightforward. Since nested
loop can include control
flow divergences; a recursive solution can lead to poor TLP and
low warp efficiency. At the
same time, nested synchronization can turn into a large number
of thread stalls and global
communication between parent-child kernels. Next, we explore
each of these factors and
present metrics to quantify their impact on kernel
performance.
4.1.1 Control Flow Instructions
Mapping parallel programs exhibiting arbitrary control flow onto
parallel units can be a
difficult task. There is generally no guarantee that parallel
units will execute the same control
flow path. For instance, Program 4.1 presents a micro-benchmark
kernel that executes a
loop based on an input parameter data. Figure 4.1 illustrates
the dynamic execution of the
micro-benchmark for two architectures: a Kepler GTX Titan, and a
Maxwell GTX Titan Ti.
Both of the execution examples are run with the same input
parameters, the same NVIDIA
driver, and the same CUDA version. However, the number of
instructions executed varies
along the control flow path.
31
-
CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES
Figure 4.1: Control flow graphs of Program 4.1 for Kepler GTX
Titan and Maxwell GTX
Titan Ti.
CUDA binary tools such as nvdisasm [45], and cuobjdump [45] have
been widely
used to produce control flow graphs (CFG). However, nvdisasm and
cuobjdump gather
kernel behavior statically, and do not allow dynamic analysis of
an application’s irregular-
ity. On the other hand, the SASS Instrumentation tool (SASSI)
[70] allows the dynamic
collection of metrics during execution time. Moreover, SASSI is
able to retrieve developer-
specified metrics about the control flow instructions executed
at runtime. SASSI, alongside
with nvprof [45] allow us to collect the following runtime
metrics:
1. instExec, Number of instructions executed. Reported by
nvprof.
2. warpDivEff : Ratio of the average active threads per warp and
the maximum number
of threads per warp supported on a multiprocessor, expressed as
percentage. Reported
by nvprof.
3. cfExecuted: Number of executed control-flow instructions.
Reported by nvprof.
4. cfDependentNestedLoop: Number of instructions executed inside
a control flow
loop-instruction. Reported by our handler and injected using
SASSI.
32
-
CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES
These metrics are intrinsically related to the execution of
kernel control flow and capture
the efficiency of the warp execution. We evaluate the impact of
the percentage of instructions
executed inside the loop to identify potential hotspots to
identify opportunities to exploit
nested TLP. We compute the ratio of instructions executed inside
of loop bodies, as a
fraction of all instructions executed, in order compute the
impact of instructions inside these
common control flow structures.
loopInstExec =cfDependentNestedLoop
instExec(4.1)
We also consider the amount of the idle resources due to the
warp divergence. warpDivEff
allows us to compute the reciprocal metric to measure the idle
threads waiting until a loop
execution ends. This can be computed as: warpDivIdle = 1 −
(warpDivEff/100).Next, we propose and compute loop warp efficiency
by taking into account the ratio of
instructions executed during loop execution (i.e. loopInstExec).
However, a simple mul-
tiplication of warpDivIdle ∗ loopInstrExec would give us the
loop warp threads idle(loopWarpThreadsIdle), in order to compute
the efficiency metric, we compute its re-
ciprocal by subtracting 1 − loopWarpThreadsIdle and multiple by
100 to expressed aspercentage:
loopWarpEff = (1− loopWarpThreadsIdle) ∗ 100 (4.2)
Our proposed metrics are specifically designed to measure
workload imbalance generated
by irregular applications. These applications have
data-dependent workload, unpredictable
control flow behavior that causes severe workload imbalance, and
eventually poor GPU
utilization.
4.1.2 Parallel Recursion
Recursion is a method of making self referential calls, commonly
used to compute
problems through breaking them into smaller sub-problems, and
using a divide-and-conquer
strategy. For instance, Program 4.2 illustrates a simple
recursive program which implements
the fibonacci sequence [71]. In a recursive solution, the
problem is broken into a base case
33
-
CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES
1 i n t f i b ( i n t n ) {2 i f ( n
-
CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES
1 g l o b a l vo id f i b k e r n e l p a r r e c ( i n t n , u
n s i g n e d long i n t ∗ vFib ) {2 i f ( n == 0 | | n == 1)3 r e
t u r n ;
4
5 f i b k e r n e l p a r r e c (n−2, vFib ) ;6 f i b k e r n e
l p a r r e c (n−1, vFib ) ;7 c u d a D e v i c e S y n c h r o n i
z e ( ) ;
8 vFib [ n ] = vFib [ n−1] + vFib [ n−2];9
10 }
Program Listing 4.3: Fibonacci parallel recursive scheme in
CUDA.
called CUDA blocks, also known as CTAs! (CTAs!) [72]. TLPDegree
is the number
of threads synchronized across the CTA.
2. workEfficiency: Ratio of the number of operations executed
that contribute to
solving the problem, divided by the total operations executed on
the GPU. The
goal of this metric is to provide a measure of the number of
non-redundant (vs.
redundant+ non− redundant) operations executed per GPU kernel.
For instance,a 100% work efficiency indicates that there are no
redundant operations executed.
3. depthKernelRecursion: Number of nested kernel calls.
Our proposed metrics are specifically designed to measure the
efficiency of recursive
execution by parallel recursive applications. These applications
have data-dependent work-
load, nested kernel calls, and irregular parallel recursion
which lead to unbalanced workload
execution, low work efficiency, and eventually poor GPU
utilization.
4.1.3 Child Kernel Launching and Synchronization
Nested Parallelism in CUDA allows explicit synchronization by
child kernels by call-
ing Application Programming Interfaces (API)
cudaDeviceSynchronize. When used,
the parent thread-block will wait until the child threads finish
their execution. cudaDeviceSynchronize
is expensive, and should not be used. However, for many
irregular applications the parent
35
-
CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES
thread will require results of the child threads to continue
execution. We characterize the
overhead of device synchronization by measuring its impact on
the overall performance.
Once a potential nested parallelism hotspot was identified by
our metrics, we imple-
mented a nested parallelism kernel, and compared it to the
non-nested parallelism kernel, a
well as a sequential implementation of the kernel, in order to
characterize the overhead of a
child kernel launching.
Figure 4.2 shows the execution time of the 3 different
implementations (i.e., sequential,
non-nested parallelism and nested parallelism) for our
micro-benchmark kernel. The micro-
benchmark computes the addition of two matrices if the value of
the first element in the row
matches to the condition in the first control flow instruction.
We defined a set of experiments,
varying the input sizes in terms of rows and columns. In
addition, we controlled the level of
divergence starting from 12.5% and increasing it up to 75%. We
argue that higher divergence
leads to better exploitation of nested parallelism, but this
divergence is going to be data
dependent.
We expected that small input sets will lead to poor performance
in a GPU due to low
utilization of the high TLP available. However, nested
parallelism starts outperforming
non-nested parallelism as the degree of TLP increases,
especially in the presence of a high
degree of divergence.
In order to characterize the behavior of nested parallelism
across different GPU archi-
tectures, we used two Kepler and two Maxwell architectures,
running with the same input
sets, the same NVIDIA driver, and the same CUDA version. Figure
4.3 shows the runtime
execution for different input sets, with data values generating
a degree of 75% divergence
across the four different GPUs.
Although the Kepler GT 730 has the same number of CUDA cores per
SM as the Kepler
GTX Titan, it also has a smaller number of SMs as compared to
GTX Titan, while the GTX
Titan has 15 SMs, the GT 730 has 2 SMs. The number of SMs has
high impact on our
ability to exploit nested parallelism. For instance, the child
kernel that is launched will have
to allocate a number of blocks on the remaining available SMs on
the device. If the device
does not have enough SMs free to leverage nested parallelism,
then the benefits of nested
parallelism will not be enjoyed.
We present an equation 4.3 to characterize kernel overhead
across different architectures,
36
-
CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES
Figure 4.2: Execution time of Sequential, non-nested parallelism
and nested parallelism
kernels on GTX Titan - Kepler architecture (lower is
better).
based on the SM usage. NumberThreadsPerBlock and NumberBlocks
are application
specific, and MaxThreadsPerSM is architecture specific. If
SMUsage surpasses the
available number of SMs on the GPU, it will prevent us to
effectively leverage nested
parallelism. We have found we also benefit from the use of
persistent threads to control
SMUsage.
SMUsage =NumberThreadsPerBlock ∗NumberBlocks
MaxThreadsPerSM(4.3)
In our analysis, we characterized cudaDeviceSynchronize API
calls for kernel
synchronization using CUDA counters, such as clocks and the
frequency rate. We verified
that synchronization can negatively impact application
performance when an application
launches a small number of threads per block and a reduced
number of blocks per kernel (i.e.
37
-
CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES
Figure 4.3: Execution time of non-nested parallelism and nested
parallelism across four
GPUs (2 Kepler and 2Maxwell GPUs).
poor TLP). However, we found that kernel synchronization can be
hidden by incrementing
the TLP and loopWarpEff .
4.1.4 Memory Overhead
When using nested parallelism, global memory in the GPUs is the
only channel of com-
munication between the parent and child kernels, and it may also
be tied by the device run-
time for child kernel launches. The device runtime keeps track
of the kernel launches by cre-
ating a pool for all launches. Kernels that are not enable to
launch due to a lack of resources
available remain in the pool on pending kernels. The size of
this pool is referred to as th