Characterization and exploitation of nested parallelism and ......NORTHEASTERN UNIVERSITY Graduate School of Engineering Dissertation Signature Page Dissertation Ti-tle: Characterization

Characterization and exploitation of nested parallelism

and concurrent kernel execution to accelerate high

performance applications

A Dissertation Presented

by

Fanny Nina Paravecino

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

in

Computer Engineering

Northeastern University

Boston, Massachusetts

March 2017

NORTHEASTERN UNIVERSITYGraduate School of Engineering

Dissertation Signature Page

Dissertation Ti-

tle:

Characterization and exploitation of nested parallelism and concurrent

kernel execution to accelerate high performance applications

Author: Fanny Nina Paravecino NUID: 001160686

Department: Electrical and Computer Engineering

Approved for Dissertation Requirements of the Doctor of Philosophy Degree

Dissertation Advisor

Dr. David KaeliSignature Date

Dissertation Committee Member

Dr. Qianqian FangSignature Date


Dr. Ningfang MiSignature Date


Dr. Norm RubinSignature Date

Department Chair

Dr. Miriam LeeserSignature Date

Associate Dean of Graduate School:

Dr. Sara Wadia-FascettiSignature Date

To the science and the pursuit of answers through research.

ii

Contents

List of Figures vi

List of Tables viii

List of Programs x

List of Acronyms xi

Acknowledgments xiii

Abstract of the Dissertation xiv

1 Introduction 1

1.1 Parallel Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Advanced Parallel Features . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Characterization of Advanced Parallel Features . . . . . . . . . . . . . . . 4

1.3 Challenges in Exploiting Parallel Execution Features . . . . . . . . . . . . 5

1.3.1 Nested Parallelism Challenges . . . . . . . . . . . . . . . . . . . . 6

1.3.2 Concurrent Kernel Execution Challenges . . . . . . . . . . . . . . 7

1.3.3 Benchmark Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.5 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

iii

2 Background 12

2.1 CUDA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 GPU Computing Architecture . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Fermi Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.2 Kepler Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.3 Maxwell Architecture . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.4 Pascal Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Related work 24

3.1 Characterization of GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.1 Modern GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Multiple Levels of Concurrency . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.1 Nested Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.2 Concurrent Kernel Execution . . . . . . . . . . . . . . . . . . . . . 27

4 Characterization of advanced parallel features 29

4.1 Nested Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.1 Control Flow Instructions . . . . . . . . . . . . . . . . . . . . . . 31

4.1.2 Parallel Recursion . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.3 Child Kernel Launching and Synchronization . . . . . . . . . . . . 35

4.1.4 Memory Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 Concurrent Kernel Execution . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2.1 Resource Contention . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Exploitation of advanced parallel features 44

5.1 Dependent Nested Loop Workloads . . . . . . . . . . . . . . . . . . . . . 45

5.1.1 Selective Matrix Addition . . . . . . . . . . . . . . . . . . . . . . 45

5.2 Parallel Recursive Workloads . . . . . . . . . . . . . . . . . . . . . . . . . 46

iv

5.2.1 Breadth-First Search Algorithm . . . . . . . . . . . . . . . . . . . 47

5.2.2 Prim algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 Validation with real-world applications 61

6.1 Connected Component Labeling . . . . . . . . . . . . . . . . . . . . . . . 61

6.2 Level-Set Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.2.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.2.2 Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.2.3 Finalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.3 Summary of Analysis for Real-world Applications . . . . . . . . . . . . . 67

7 Summary 70

7.1 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Bibliography 73

v

List of Figures

2.1 Layers of abstraction between software application and GPU hardware. . . 13

2.2 The CUDA model: a kernel, grid, and threads per block. . . . . . . . . . . 14

2.3 The CUDA memory hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Branch divergence in the GPU . . . . . . . . . . . . . . . . . . . . . . . . 17

2.5 Work flow of the Grid Management Unit to dispatch, pause, and hold

pending and suspended grids. . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6 Dynamic Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.7 Hyper-Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Control flow graphs of Program 4.1 for Kepler GTX Titan and Maxwell

GTX Titan Ti. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Execution time of Sequential, non-nested parallelism and nested parallelism

kernels on GTX Titan - Kepler architecture (lower is better). . . . . . . . . 37

4.3 Execution time of non-nested parallelism and nested parallelism across four

GPUs (2 Kepler and 2Maxwell GPUs). . . . . . . . . . . . . . . . . . . . . 38

4.4 Execution time of sequential execution of kernels versus concurrent kernel

execution for two different GPUs while varying input size (lower is better). 40

4.5 Execution time of sequential execution of kernels versus concurrent kernel

execution for two different GPUs with persistent threads execution (lower is

better). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

vi

4.6 Resource utilization for non-persistent thread kernels using different input

data sets for the Maxwell GTX Titan X (lower is better). . . . . . . . . . . 43

4.7 Resource utilization for persistent thread kernels using different input data

sets for Maxwell GTX Titan X (lower is better). . . . . . . . . . . . . . . . 43

5.1 Speedup evaluation of nested parallelism implementation compared to non-

nested parallelism implementation for Selective Matrix Add for the Kepler

GTX Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2 Graph representation using an adjacency list. . . . . . . . . . . . . . . . . 48

5.3 BFS operations while traversing a graph with six vertices, starting at source

vertex 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.4 BFS speedup analysis of naive nested parallelism and optimized nested

parallelism versus non-nested parallelism implementation on Kepler GTX

Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.5 MST Tree of graph G = (V,E), where V = {0, 1, 2, 3, 4, 5}, starting at

source vertex 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.6 Prim algorithm step by step work flow. Given a graph G = (V,E) with

an initial source vertex 0; find Minimum Spanning Tree (MST) using Prim

algorithm, where iteration 0 is described as the initialization of a MST tree

with a source vertex 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.1 Speedup comparison of the nested parallelism and non-nested parallelism

implementations, running CCL on a Kepler GTX Titan. . . . . . . . . . . . 63

6.2 Speedup comparison of the nested parallelism and non-nested parallelism

implementations, running Level-Set segmentation on a Kepler GTX Titan. . 68

vii

List of Tables

2.1 NVIDIA GPU technology evolution [25] . . . . . . . . . . . . . . . . . . . 12

2.2 Fermi chip GF110 versus Kepler chip GK110 [41] . . . . . . . . . . . . . 19

2.3 A comparison of the features available on the four generations of NVIDIA

GPUs considered in this thesis [25]. . . . . . . . . . . . . . . . . . . . . . 23

4.1 Irregular Applications from two different GPU benchmarks which exhibit

control flow dependent nested loops. . . . . . . . . . . . . . . . . . . . . . 30

4.2 Recursive applications which exhibit parallel recursion. . . . . . . . . . . . 30

5.1 Irregular and recursive applications, with potential for exploiting advanced

parallel features in modern GPUs. . . . . . . . . . . . . . . . . . . . . . . 44

5.2 Dynamic metrics for non-nested parallelism Selective Matrix Addition for

different input sets on Kepler GTX Titan. . . . . . . . . . . . . . . . . . . 45

5.3 Execution time of selective Matrix Add with different input sets for non-

nested parallelism and nested parallelism implementations on Kepler GTX

Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.4 Runtime execution analysis of Breadth-First Search with different input sets

from the DIMACS Challenge Ninth [87] and Tenth [88] for naive nested

parallelism implementation on Kepler GTX Titan. . . . . . . . . . . . . . . 54

viii

5.5 Runtime execution analysis of Breadth-First Search with different input sets

from DIMACS Challenge Ninth [87] and Tenth [88] for optimized nested


5.6 Runtime execution analysis of Prim’s algorithm with different input sets

from the DIMACS Challenge Ninth [87] and Tenth [88] for non-nested


5.7 Runtime execution analysis of Prim algorithm with different input sets from

the DIMACS Challenge Ninth [87] and Tenth [88] for optimized nested

parallelism implementation on the Kepler GTX Titan. . . . . . . . . . . . . 60

6.1 Dynamic metrics for non-nested parallelism of CCL for different input sets

on a Kepler GTX Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.2 Dynamic metrics for nested parallelism of CCL for different input sets on a

Kepler GTX Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.3 Execution time of selective Matrix Add with different input sets for non-

nested parallelism and nested parallelism implementations on the Kepler

GTX Titan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

ix

List of Programs

4.1 Micro-benchmark Kernel with irregular nested loop execution. . . . . . . . 31

4.2 Fibonacci recursive scheme. . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Fibonacci parallel recursive scheme in CUDA. . . . . . . . . . . . . . . . 35

4.4 Micro-benchmark Kernel with irregular nested loop execution. . . . . . . . 41

5.1 Graph input using DIMACS challenge structure for file storing. . . . . . . 49

5.2 Breadth-first Search (BFS) recursive implementation on a CPU, graph is

a global variable which contains vertex array and edge array. . . . . . . . . 50

5.3 BFS non-recursive implementation on GPU. . . . . . . . . . . . . . . . . 52

5.4 BFS optimized nested parallelism implementation on GPU. . . . . . . . . 53

5.5 Non-nested parallelism implementation of Prim’s algorithm on a GPU. . . . 57

5.6 Optimized nested parallelism implementation of Prim’s algorithm on a GPU. 58

x

List of Acronyms

GPGPU General Purpose computing on Graphic Processor Units. Definition associated

to the use of graphics processing units (GPU) to perform computation in application

traditionally handled by central processing unit (CPU).

GPU Graphics Processing Unit. Definition associated to the graphics processor unit in the

system.

CCL Connected Component Labeling. Algorithm refers to image segmentation process

using points connected by similarity function.

LSS Level Set Segmentation.

SC Spectral Clustering.

SIMD Single-Instruction Multiple Data.

SIMT Single Instruction Multiple Thread.

API Application Programming Interfaces.

SIMT Single Instruction Multiple Thread.

ISA Instruction Set Architecture.

TLP Thread-level Parallelism.

PTX Parallel Thread Execution.

xi

MPI Message Passing Interface.

CUDA NVIDIA’s Compute Unified Device Architecture Framework.

OpenCL Open Compute Language.

SM Streaming Multiprocessor.

ECC Error Correcting Codes.

CTA Cooperative Thread Arrays.

PT Persistent Threads.

PDE Partial Differential Equation.

PDEs Partial Differential Equations.

BFS Breadth-first Search.

MST Minimum Spanning Tree.

xii

Acknowledgments

It would not have been possible to write this doctoral thesis without the help and support

of the kind people around me, to only some of whom it is possible to give particular mention

here.

First of all, I would like to thank my parents Fani, and Dante for their endless support

through every single step of this journey. I thank my brother Reykjavil, and sister Lisbeth

for keeping me on the path and making me believe that everything is possible. I thank my

boyfriend Jose, for his unlimited love and throughout support, for which my mere expression

of thanks likewise does not suffice.

This thesis would not have been possible without the help, support and patience of

my colleagues and collaborators. A special thanks to all my colleagues at NUCAR group.

Specially, to Leiming, Fritz, Julian and Xiangyu for their contributions towards the concepts,

ideas, and for keeping company on the doctoral journey. I would also like to thanks our

collaborators Dr. Qianqian Fang, Dr. Norm Rubin (NVIDIA) and Dr. Ningfang Mi for their

constructive feedback on this dissertation.

It is my deepest gratitude and warmest affection that I dedicate this thesis to my advisor

Dr. David Kaeli who has been a constant source of knowledge and inspiration.

xiii

Abstract of the Dissertation

Characterization and exploitation of nested parallelism and

concurrent kernel execution to accelerate high performance

applicationsby

Fanny Nina Paravecino

Doctor of Philosophy in Computer Engineering

Northeastern University, March 2017

Dr. David Kaeli, Adviser

Over the past decade, GPU computing has evolved from being a simple task of mapping

data-parallel kernels to Single Instruction Multiple Thread (SIMT) hardware, to a more

complex challenge, mapping multiple complex, and potentially irregular, kernels to more

powerful and sophisticated many-core engines. Further, recent advances in GPU architec-

tures, including support for advanced features such as nested parallelism and concurrent

kernel execution, further complicate the mapping task.

Improving application performance is a central concern for software developers. To

start with, the programmer needs to be able to identify where opportunities for optimization

reside. Many times the right optimization is tied to the underlying nature of the application

and the specific algorithms used. The task of tuning kernels to exploit hardware features can

become an endless manual process. There is a growing need to develop characterization

xiv

techniques that can help the programmer identify opportunities to exploit new hardware

features, and to port a broader range of applications to GPUs efficiently.

In this thesis, we present novel approaches to characterize application behavior that

can exploit nested parallelism and concurrent kernel execution introduced on recent GPU

architectures. To identify bottlenecks that can be improved through the exploitation of

nested parallelism and concurrent kernel execution, we proposed a set of metrics for a range

of GPU kernels.

For nested parallelism, our approach focuses on irregular and recursive kernel applica-

tions. For irregular applications we define, implement, and evaluate three main runtime

components: i) control flow workload analysis, ii) child kernel launching, and iii) child

kernel synchronization. For recursive kernel applications, we define, implement, and eval-

uate: i) degree of thread-level parallelism, ii) work efficiency, and iii) overhead of kernel

launches. For concurrent kernel execution, our characterization captures a kernel’s launch

configuration, the resource consumption, and the degree of overlapped execution. Our pro-

posed metrics help us to better understand when to exploit nested parallelism and concurrent

kernel execution.

We demonstrate the utility of our framework of metrics by focusing on a diverse set

of workloads that include both irregular and recursive program behavior. This suite of

workloads includes: i) a set of microbenchmarks that specifically target the set of new

GPU features discussed in this thesis, ii) the NUPAR suite, iii) the Lonestar suite and iv)

real-world applications. By using our framework, we are able speedup application by more

than 5x-23x as compared to non-advanced-parallel-feature GPU implementations.

xv

Chapter 1

Introduction

In 1965 Gordon Moore proposed Moore’s Law, which states that the number of tran-

sistors on a microprocessor doubles roughly every 18 months [1]. Since 1965, Moore’s

Law has been shown to be remarkably accurate, and microprocessors have doubled their

capabilities every one to two years. However, the translation of increased transistor density

into improved application performance remains a challenging endeavour. There is no silver

bullet that automatically optimizes software, programming frameworks, and algorithms so

that they can benefit from advances in hardware.

In many areas, performance improvement have been possible only due to modifications

in algorithms, providing substantial performance gains that are much higher than those

enabled by increasing processor speed alone. There are still many challenges that need

to be addressed through the discovery of new parallel algorithms, specifically designed to

take advantage of the potential power of parallel hardware, while avoiding some of the

bottlenecks that can occur on these platforms.

In this thesis, we will explore different mechanisms to understand the behavior of the

parallel code (i.e., kernels) at different stages of the computing stack, including multiple

compilation levels, as well as runtime execution. This work will define a characterization

process of parallel execution that will guide and inform the programmer on how best to

exploit new parallel features. Equipped with this knowledge, the programmer can then

exploit parallelism at different grains of concurrency. We test our characterization process

on a broad set of parallel applications, demonstrating the utility of this knowledge to tune

1

CHAPTER 1. INTRODUCTION

application to effectively exploit two recently introduced parallelization features: 1) nested

parallelism and 2) concurrent kernel execution. We will also present a tuning mechanism to

further improve application throughput.

1.1 Parallel Programming

Parallel programming provides a myriad of advantages over sequential programming,

such as increased application throughput, improved utilization of hardware resources, and

enhanced concurrent execution [2]. Given the wide range of parallel computing hardware

platforms available to-date, spanning massively parallel supercomputers to multicore smart-

phones, parallel execution has become the most effective path to improve performance. The

need for high performance has been amplified due to the rate at which raw data is being

generated today and is rapidly growing for the foreseeable future.

Commonly, the easiest way to write parallel code is using a framework such as OpenMP.

OpenMP is a simple, directive-based interface that offers incremental parallelization, which

allows for loops in serial code executed concurrently without changing their structure [3].

However, Using OpenMP does not solve the problem of load imbalance, and the resulting

performance gain is limited by the Amdahl’s law [4], which defines that code improvement

is to the portion of serial code that is suitable for parallelization.

When working with a distributed system, Message Passing Interface (MPI) [5] provides

an effective programming model for expressing parallelization. MPI is commonly used on

distributed memory systems that leverage message passing. However, one notable trend we

are witnessing in the field of parallel scientific computing is the dramatic increase in the

number of applications that utilize GPUs. Based on Flynn’s widely used taxonomy [6, 3], the

large number of cores on the Graphics Processing Unit (GPU) enable us to launch thousands

of compute threads to execute in Single-Instruction Multiple Data (SIMD) fashion. SIMD

provides parallelism by operating on multiple data streams concurrently [3]. Applications

for GPUs are commonly developed using programming frameworks such as Kronos’s Open

Compute Language (OpenCL) [7, 8, 9] and NVIDIA’s NVIDIA’s Compute Unified Device

Architecture Framework (CUDA) [10]. Both OpenCL and CUDA are based on the high-level

2


programming constructs of the C and C++ languages. The data parallel and computationally

intensive portions of an application are offloaded to the GPU for accelerated execution.

These programming frameworks offer a rich set of runtime API, and allow the developer to

write optimized kernels for execution on GPUs.

Researchers and developers have enthusiastically adopted the CUDA programming

model and GPU computing for a diverse range of applications [11, 12, 13, 14]. Given the

varying degrees of parallelism present in many applications, we are motivated to explore

advanced parallel features on the GPU.

1.1.1 Advanced Parallel Features

Recent advances in GPU architectures have pushed a number of computational barriers,

enabling researchers to leverage parallel computing to improve application throughput.

Graphics hardware has substantially evolved over the years to include more functionality

and programmability. NVIDIA’s previous generation of GPUs, the Fermi family, has been

used in a number of applications, promising peak single-precision floating performance of

up to 1.5 TFLOPS. However, NVIDIA’s Kepler GK110 GPU offers more than 4.29 TFLOPs

of single-precision computing capability. The newest features provided on Kepler enable

programmers to move a wider range of applications to the CUDA framework.

Given the new hardware features provided on recent hardware, exploiting these features

to improve overall execution throughput has become paramount. Thread-level parallelism

provides impressive speedups for applications ported to the GPU. Moreover, the addition of

nested parallelism supports conditional-loop execution throughput, which requires working

at a finer thread granularity. Another new feature is the concurrent kernel execution of ker-

nels, which improves the utilization and runtime of multiple kernel, removing the overhead

due to context switching. There is also a performance advantage provided by performing

back-to-back kernel launches. In the CUDA API, kernel invocations are asynchronous. If a

developer can call a kernel (or kernels) multiple times without any intervening synchroniza-

tion (i.e., memory transfers or dependency checking), then the multiple kernel calls will be

batched in the CUDA driver, and the application can overlap kernel execution on the GPU.

Given the level of sophistication provided in the modern GPUs, we have focused our

3


work on the characterization of advanced parallel features in order to guide the improvement

of application throughput. We consider optimization of applications for two new features

available on NVIDIA Kepler GPUs and more recent GPU generations:

• Nested Parallelism: modern GPUs add the capability to launch child kernels within aparent kernel. A pattern commonly found in many sequential algorithms are nested

loops. Nested parallelism allows us to implement a nested loop with variable amounts

of parallelism.

• Concurrent Kernel Execution: modern GPUs provide the ability to run multiplekernels, assigned to different streams, concurrently. The Kepler, Maxwell and Pascal

architectures support up to 32 concurrent streams (as compared to 16 on the Fermi).

Each stream is assigned to a different hardware queue.

1.2 Characterization of Advanced Parallel Features

The utilization of high performance computing resources has also been hampered by the

relative dearth of system software and of tools for monitoring and optimizing performance.

Profilers have evolved to provide application execution insights to the developer in order to

improve application throughput. However, profilers are tightly tied to specific hardware and

do not support the latest advanced parallel features, which makes tuning the applications

targeting modern GPUs a challenge.

New approaches to profiling/instrumentation are needed to understand application in-

teraction with the latest hardware features. Binary instrumentation can be used on a GPU

for performance debugging, correctness checks, workload characterization, and runtime

optimization. Such techniques typically involve inserting code at the instruction level of an

application during back-end compilation, Binary Translation is able to gather data-dependent

application behavior.

Given the presence of data-dependent behavior of an application, we can characterize

different execution patterns. Our focus is to characterize dynamically available parallelism

with the aim to evaluate implementations designed to exploit the execution patterns using ad-

vanced parallel features such as nested parallelism. Our characterization approach evaluates

4


the potential for optimization by analysing the impact on control, memory and synchro-

nization behavior on a GPU. As an illustrative example, our study targets a comprehensive

understanding of the overhead of current nested parallelism supported on GPUs in terms of

kernel launch, control flow, nested synchronization and algorithm overhead.

We also consider another form of parallelism available on modern GPUs: concurrent

kernel execution. Just as a typical CPU application can consist of multiple functions, it

is also common to have multiple GPU kernels present in a single GPU application. A

GPU kernel is a function executed on a GPU device. Managing efficient concurrent kernel

execution using independent thread blocks is cumbersome at best. In particular, this thesis

targets a detailed understanding of the run-time costs of concurrent kernel execution in terms

of kernel launch configuration, resource contention, and overlapped computation.

1.3 Challenges in Exploiting Parallel Execution Features

The software implementation of a GPU application can dramatically influence the

application’s performance. For example, performance will suffer if kernels are stalled due

to control dependence. Delays also occur when data dependencies are encountered. GPU

stream processors are more difficult to utilize effectively if the targeted applications present

dynamic and frequent data dependencies (commonly present in sorting, recursion, dynamic

programming and evolutionary programming).

Along with the challenges of dynamic and global dependencies, many applications

involve the execution of multiple kernels. The current generation of NVIDIA GPUs already

support concurrent execution of kernels using Hyper-Q technology, allowing concurrent

execution of kernels from the same application or different applications. In this thesis,

we characterize concurrent kernel execution, and explore how to perform resource utilize

and minimize kernel launch overhead. Presently, it is difficult to modify an application to

effectively leverage nested parallelism and concurrent kernel execution. Addressing this gap

is the major focus of this thesis.

5


1.3.1 Nested Parallelism Challenges

Depending on the application characteristics and the parallelization strategy, a kernel

can exhibit a range of dynamic behaviors. The dynamic behavior is highly correlated to data-

dependent parallel execution. Data dependencies are found in parallel loops and recursive

calls. Parallel loops and recursive calls are forms of nested Thread-level Parallelism (TLP).

Nested TLP can present a range of control flow behaviors. Explicit control flow con-

structs such as if-then-else or for-loop are fundamental constructs in any high-

level programming language. In kernels with complex control flow, SIMD threads can

follow different paths of execution, causing thread divergence. Thread divergence would

seem to cause a paradox, since all threads in a basic group (e.g., a warp) must execute

the same instruction on each cycle. If the threads in a warp diverge, the warp serially

executes each branch path, disabling threads that do not take that path. Warp divergence can

dramatically degrade application performance.

Understanding control flow effects is a key step towards characterization of nested

parallelism. We have faced the following challenges when trying to exploit dynamic

parallelism:

• For control flow analysis, it is important to quantify the impact of thread divergence bycategorizing divergent and convergent paths in order to understand how performance is

impacted. Control flow divergence effects can severely impact our ability to leverage

nested parallelism. On-the-fly analysis of control flow workload provides a better

understanding for data-dependent applications. In previous work, control flow analysis

has been performed statically.

• To properly characterize child kernel launches, we need to understand kernel launchparameters and device runtime management. There presently are no tools or profilers

that can proper analyze nested-kernel launch overhead.

• Nested parallelism requires that parent kernels and child kernels explicitly synchronizewith each other in order to assure consistent application execution. In order to perform

child kernel synchronization, the device runtime has to save the state of parent kernels

when they are suspended and yield to the child kernels at the synchronization points.

6


To our knowledge, there are no tools available that can measure dynamic child kernel

synchronization.

1.3.2 Concurrent Kernel Execution Challenges

Enabling multiple kernels to execute concurrently on GPUs leads to the physical shar-

ing of compute resources. Concurrent kernel execution can increase overall application

throughput and can also reduce energy consumption. In order to deliver performance im-

provement there needs to be sufficient resources on the GPU to launch concurrent kernels.

In other words, concurrent kernel execution provides performance improvement through

overlapped kernel computation. In order to achieve overlapped kernel computation, we

need to understand the sources of resource contention and the effects of the kernel launch

configuration.

Resources contention is heavily dependent on the application input. For example, a

small input set might not stress the memory, whereas a large input set might. At the same

time, resource contention is dependent on the amount of GPU hardware resources available.

An application binary compiled and optimized for one GPU may perform poorly on another

GPU due to resource contention.

The kernel launch configuration can give us clues leading to resource contention. Each

kernel is launched with a set of variables called the launch configuration variables. Com-

monly, these variables include the number of threads per block, the number of thread-blocks

per grid, the usage of shared memory, and the number of registers used. Most of the time,

these variables are dictated by the number of data elements the kernel operates on. De-

pending on the GPU architecture, the resource usage based on these variables can changed

dramatically. Having a better understanding of the resource contention is a key step towards

the characterization of concurrent kernel execution. We face the following challenges when

trying to exploit concurrent kernel execution:

• To properly understand resource contention, we need to have a better control of theresources utilized by the kernel. We can bring software threads closer to the actual

hardware thread execution by implementing persistent threads. Persistent threads

break the mapping of one software thread to one data element, and instead it is

7


dynamically defined by the availability of resources on the GPU. There is no general

way to map any kernel to persistent threads; persistent threads will not always provide

the best performance for every kernel.

• Resource contention varies dramatically across different GPU architectures, driverversions, and CUDA frameworks. Furthermore, the compiler and driver can have

significant impact on kernel performance. To properly exploit concurrent kernel

execution, we need to understand hardware, driver, compiler and CUDA framework

interaction, which unfortunately are non-disclosed by hardware vendors.

1.3.3 Benchmark Suite

Many applications—both academic research and industrial products—have been accel-

erated using parallel framework to achieve significant parallel speedup. Such applications

encompass a variety of problem domains, including security surveillance, numerical linear

algebra, graph theory problems, among others. Of these many applications, we select a set

of representative real-world applications to focus our discussion.

There has been considerable growth in interest of security surveillance image segmenta-

tion problems. This interest has created an increased need for performant image segmen-

tation kernels. Different approaches of image segmentation have used GPU computing in

a wide variety of applications[15, 14, 16, 17]. Among the different image segmentation

approaches, Connected Component Labeling (CCL), and Level Set Segmentation (LSS) are

the most well-known applications.

CCL is a widely used image segmentation algorithm. It connects neighboring pixels

based on their similarities. The dependencies between the neighboring pixels and continuous

propagation of connectivity between pixels makes CCL a highly sequential application. CCL

is a great candidate for characterization of nested parallelism due its dynamic propagation

of connected components.

LSS is an evolutionary image segmentation algorithm. Given an initial curve C, LSS

expands C, or contracts C, based on the evolution of the function f . The expansion of

the curve is an outward evolution, and the contraction of the curve is an inward evolution.

Every evolution cycle depends on the previous cycle in terms of computing the curve. The

8


dependencies between multiple pixels makes LSS a great candidate for characterization of

nested parallelism and concurrent kernel execution together.

To analyze how best to accelerate recursion, we have explored graph theoretic algorithms,

including BFS and Prim algorithm. In addition, we evaluated selected Lonestar [18], and

NUPAR benchmarks [19] in this thesis. In summary, we have used two real applications

and four different benchmark applications as we developed characterization schemes in this

thesis. Next, we outline the contributions and describe the organization of the remainder of

this thesis.

1.4 Contributions of the Thesis

In this thesis, a number of key contributions towards the deep analysis and exploitation

of advanced parallel features are presented. The key contributions are summarized below:

• We characterize parallel applications, identifying when we can leverage nested paral-lelism available on NVIDIA GPUs (Kepler and Maxwell families). We define three

workload components that can guide the developer on how best to leverage nested

parallelism. To the best of our knowledge, we are the first work to define, implement,

and evaluate these combined three components: i) control flow workload analysis, ii)

child kernel launching, and iii) child kernel synchronization.

• We develop NVIDIA SASS instrumentation handlers to characterize data-dependentapplication behavior. We use an NVIDIA assembly code SASS Instrumentor (SASSI)

to evaluate dynamic application behavior. We provide a handler to profile and measure

binary execution for control flow dependent loops. Our handler can collect and

measure the control dependent loop efficiency.

• We characterize recursive parallel workload, identifying when we can leverage nestedparallelism available on NVIDIA GPUs (Kepler and Maxwell families). We define

three workload components that can guide the developer on how best to leverage

nested parallelism in the case of parallel recursion. We evaluate three components: i)

the degree of thread-level parallelism, ii) the work efficiency, and iii) the overhead of

9


kernel launches. Furthermore, we propose a new approach to increase thread-level

parallelism in order to increase work efficiency and reduce the number of recursive

kernel launches.

• We characterize the execution of concurrent kernels on NVIDIA GPUs (Kepler andMaxwell families). Our characterization captures a kernel’s launch configuration, the

resource consumption, and the degree of overlapped execution. Our proposed metrics

help us to better understand when to use concurrent kernel execution.

• We propose, implement, and evaluate kernels with persistent threads as a mechanismto control resource contention for concurrent kernel execution on GPUs. Our results

show that kernels with persistent threads can be beneficial to identify peak resource

contention. Unfortunately, this does not directly lead to an overall performance

improvement.

• Our proposed workload metrics for irregular applications and parallel recursive kernelshave been applied to a number of CUDA kernels taken from the problem domains of

image processing, linear algebra, and graph theory problems. For these performance-

hungry applications, we achieve 1.3x to more than 100x speedup, as compared to a

flat GPU kernels.

• We compare state-of-the-art image segmentation applications, including connectedcomponent labeling, and level set segmentation, exploring both nested parallelism

and concurrent kernel execution. Our accelerated connected component labeling has

been presented at the International Conference on Computer Vision and Graphics

(ICCVG) [15]. In addition, it has been also presented at the GPU Technology Con-

ference (GTC) [20]. Our work on fast level set segmentation exploiting advanced

parallel features has been presented on Irregular Applications: Architectures and

Algorithms Workshop (IA3) [21] and featured as a poster in Programming and Tuning

Massively Parallel Systems Summer School (PUMPS). Furthermore, our accelerated

connected component labeling has been ported to OpenCL. We have analyzed the

benefits of advanced parallel features on AMD cards, and it has been presented at the

10


3rd International Workshop on OpenCL (IWOCL) [22]. Both of these real-world ap-

plications are part of the NUPAR benchmark presented at the International Conference

on Performance Engineering (ICPE) [19].

1.5 Organization of Thesis

The central focus of this work is to characterize nested parallelism and concurrent kernel

execution in a systematic way that works well for any GPU and any application. The

remainder of the thesis is organized as follows: Chapter 2 presents background information

on GPU architecture, specifically the NVIDIA GPU architecture, the parallel framework

CUDA, and the NVIDIA SASSI instrumentation framework. In Chapter 3, we present

related work in the area of characterization of parallel kernels, nested parallelism, and

concurrent kernel execution in GPU devices. In Chapter 4, we discuss the characterization

of nested parallelism for conditional nested loop, parallel recursion, and concurrent kernel

execution in NVIDIA Kepler and Maxwell architectures. Next, in Chapter 5 we present our

benchmark kernels that are used throughout this thesis to leverage advanced parallel features.

In Chapter 6, we present real applications that leverage our framework to effectively exploit

advanced parallel features. In Chapter 7, we conclude the thesis and summarize our work.

We also suggest directions for future work.

11

Chapter 2

Background

As we enter the era of GPU computing, demanding applications with substantial par-

allelism can leverage the massive parallelism of GPUs to achieve superior performance

and efficiency. Today GPU computing enables applications that were previously thought

to be infeasible because of long execution times. By enjoying the benefits of Moore’s

Law [1, 23, 24], NVIDIA GPUs have evolved since 2001. Table 2.1 shows the evolution of

NVIDIA graphic cards since the first programmable GPU was released.

Date Product Transistors CUDA cores

2001 GeForce 3 60 million -

2002 GeForce FX 125 million -

2004 GeForce 6800 222 million -

2006 GeForce 8800 681 million 128 (First support for CUDA Programming)

2007 Tesla T8, C870 681 million 128

2008 GeForce GTX 280 1.4 billion 240

2008 Tesla T10, S1070 1.4 billion 240

2009 Fermi 3.0 billion 512

2012 GK104 Kepler 3.5 billion 1536

2012 GK110 Kepler 7.0 billion 2688

2014 GM204 Maxwell 5.2 billion 2816

Table 2.1: NVIDIA GPU technology evolution [25]

12

CHAPTER 2. BACKGROUND

With the rapid evolution of GPUs from a configurable graphics processor to a general

purpose programmable parallel processor, the ubiquity of GPUs in every PC, laptop, desktop,

and smartphone was imminent. A large community of researchers and developers have

adopted the CUDA programming framework for a diverse range of applications [26, 27, 28].

The CUDA runtime on an NVIDIA GPU enables us to execute programs developed

in high-level languages, including C, C++, Fortran, OpenCL, DirectCompute, and oth-

ers [26, 25, 2]. The nature of CUDA is to try to preserve elements of common sequential

programming and extend them to a parallel thread execution. CUDA presents a Single

Instruction Multiple Thread (SIMT) abstraction with a straightforward set of configurations

for expressing parallelism.

2.1 CUDA Model

The CUDA model acts as a bridge between an application and its implementation

on available hardware [29]. There are a number of different layers that lie between the

application and the hardware. Figure 2.1 shows the different layers of abstraction between a

software implementation and hardware level. The programming model provides a logical

view of the specific computing architectures.

!"#$%&'()&**+,-&$,".)

-/0&)'/.$,1()

-/0&)0',2(')

3*/)4&'0%&'()

Figure 2.1: Layers of abstraction between software application and GPU hardware.

CUDA enables the developer to write parallel code that can run across tens of thousands

of concurrent threads and hundreds of core processors. CUDA divides execution down,

13


hierarchically, using parallel abstractions such as kernels, blocks, and threads per block (see

Figure 2.2). A kernel executes a sequential program on a set of parallel threads. Each thread

has its registers and private local memory. Each block allows communication among its

threads through shared memory. Blocks communicate between themselves using global

memory. This memory hierarchy is illustrated in Figure 2.3.

!"#$%$&&'&()*+'"!(

,,-&*.$&,,(/*0+(1'%2'&34(

5(

(!!6(

7(

!"82+(#$%$&&'&()*+'"!(

9( :(

;(?&*)1(

@ABC(

?&*)1(

Figure 2.2: The CUDA model: a kernel, grid, and threads per block.

Other memories included in the CUDA memory model include:

• texture memory, specialized for 2D read-only coalesced memory

• constant memory, design to support for read-only accesses from different threadsacross blocks.

As mentioned in Chapter 1, the CUDA model follows a SIMT architecture to manage

and execute threads in groups of 32 named warps. Even though all threads in a warp must

execute the same instructions, and the GPU is a SIMD architecture, there are some key

features that differentiate a GPU from traditional SIMD:

14


!"#$%&"'%(

)*+,-(."/0(

!"#$%&"'%(

)*+,-(."/0(

!"#$%&"'%(

)*+,-(."/0(

!"#$%&"'%(

)*+,-(."/0(

12,'"3(."/*'4(

!"#$%&"'%(

)*+,-(."/0(

!"#$%&"'%(

)*+,-(."/0(

!"#$%&"'%(

)*+,-(."/0(

12,'"3(."/*'4(

!"#$%&"'%(

)*+,-(."/0(

5-*6,-(."/*'4(

7*8%&,8&(."/*'4(

9":&;'"(."/*'4(

Figure 2.3: The CUDA memory hierarchy.

• Each thread in the warp has its own instruction address counter.

• Each thread has its own register state.

• Each thread can have an independent execution path.

Although, the CUDA model enables each thread in a warp to display a different exe-

cution behavior, divergent behavior degrades performance since warps will be executed

serially. Control flow instructions (e.g., if-then-else, for, while) is one of the fun-

damental constructs in CUDA programming that causes this undesired behavior called warp

divergence.

2.1.1 Divergence

The use of control flow instructions is unavoidable in any applications. Modern CPUs in-

clude complex hardware to perform branch prediction [30, 31]. Hardware branch predictors

speculate the direction of conditional control flow in programs [32, 33, 34]. If the predictor

is correct, branch execution incurs little or no performance penalty. If the prediction is

15


not correct, the CPU stalls for a number of cycles as the instruction pipeline is flushed,

and instruction fetching resumes at the correct program counter. In comparison, GPUs are

high-throughput, but lack complex branch prediction mechanisms [35, 36, 37]. Execution

on an NVIDIA GPU using the CUDA execution model assumes that all threads in a warp

must execute identical instructions on the same cycle. Executing complex control flow

typically results in divergent execution between the threads in the same warp [38].

Recent GPUs are designed to better handle control flow. The modern GPU hardware

supports condition codes (CC) and CC registers that contain the 4-bit state vector (sign,

carry, zero, overflow) used in integer comparisons [39]. The CC registers can direct the flow

of execution via predication or divergence. Predication allows (or suppresses) the execution

of instructions on a per-thread basis within a warp, while divergence supports conditional

execution of longer instruction sequences.

Due to the additional overhead of managing divergence and convergence, the compiler

uses predication for short instruction sequences. The effect of most instructions can be pred-

icated on a condition; if the condition is not true, the instruction is suppressed. Predication

works well for small fragments of conditional code, especially for if statements with no

corresponding else. For larger conditional code segments, predication becomes inefficient

because every instruction is executed, regardless of whether it will affect the computation.

When the length of the conditional code fragment is long and the cost of predication would

exceed the benefits, the compiler will generate conditional branches. If the threads in a warp

diverge due to a data-dependent conditional branch, the warp serially executes each branch

path taken, disabling threads that are not on that path. Once all paths complete, all threads

re-converge to the original execution path. Figure 2.4 illustrates how warp divergence is

handled on a GPU.

Although warp divergence can have a negative impact on application throughput, this

impact varies dramatically across GPU architectures. In the following sections we address

divergent execution for the latest GPU generations, from the Fermi GPU architecture to the

Pascal GPU architecture.

16


!"

!"#$$%

&$!&%

'(%

!"#$$%

#"

'(%

!"#$$%

$"

'(%

!"#$$%

%"

'(%

!"#$$%

&"

'(%

!"#$$%

'"

'(%

!"#$$%

("

)

$&"

!"#$$%

&$!&%

$'"

!"#$$%

&$!&%

$("

!"#$$%

&$!&%

$*"

!"#$$%

&$!&%

$+"

!"#$$%

&$!&%

$,"

!"#$$%

&$!&%

%!"

!"#$$%

&$!&%

%#"

!"#$$%

&$!&%

Figure 2.4: Branch divergence in the GPU

2.2 GPU Computing Architecture

A Streaming Multiprocessor (SM) is the centrepiece of the NVIDIA GPU architecture.

A thread block is scheduled on a single SM, and once it is scheduled on the SM, it remains

there until execution completes. An SM can hold more than one thread block at the same

time. Registers and shared memory are scarce resources in the SM. These resources have to

be partitioned among all threads resident on an SM. Each SM contains hundreds of CUDA

cores, and each GPU device contains tens of SM.

Logically, all threads in a block run in parallel, but not all threads can execute physically

at the same. Therefore, different blocks may make progress at different rates. Since warps

are the atomic unit of execution on the GPU, many warps can be scheduled in an SM, but

depending on the SM resource availability, not all scheduled warps will be active. If a warp

is idle, then the SM schedules another warp from any block that is resident on the same

SM. The benefits of this switching between concurrent warps is that we avoid all overhead.

Given the importance of determining the right warp granularity, we would like to quickly

17


find the best grid configuration for any application.

2.2.1 Fermi Architecture

The NVIDIA Fermi (chip GF110) GPU was released in 2009. Fermi introduced an

increased number of CUDA cores per SM, higher space for shared memory, configurable

shared memory, and Error Correcting Codes (ECC) on main memory and caches. Each SM

in Fermi has 32 CUDA processor cores, 16 load/store units, and four special function units

(SFUs). Fermi has a 64-KByte register file, an instruction cache, two multi-thread warp

schedulers and two instruction dispatch units [40].

The SIMT instructions control the execution of an individual thread, including arithmetic,

memory access, and branch/control flow instructions. Fermi extends SIMT to control flow

with support for indirect branches and function-call instructions. With the improvements

introduced in the Fermi Parallel Thread Execution (PTX) 2.0 Instruction Set Architecture

(ISA), individual thread control flow can predicate instructions.

2.2.2 Kepler Architecture

A number of new features were introduced in Kepler as compared to the earlier GPU

Fermi architecture. Table 2.2 compares some of theses features for Fermi (instance of chip

GF110) and Kepler (instance of chip GK110).

Kepler GK110 comprises up to 15 Kepler SM (SMX) units, four warp schedulers, and

eight instruction dispatch units. Thus, it can issue and execute four warps simultaneously.

Each SMX has 192 single-precision CUDA cores, 64 double-precision units, 32 load/store

units, and 32 special function units, which can operate sine, cosine, reciprocal or square root

per thread per clock [42]. Kepler GK110 can provide up to 4.29 TFLOPS single-precision

and 1.43 TFLOPS double-precision floating point performance [43].

In addition to an increase in the number of CUDA cores per SM and a dramatic increase

in the number of registers per thread, Kepler (compute capability 3.5 or higher) introduced a

number of new features to further simplify parallel program design.

18


Fermi (chip GF110) Kepler (chip GK110)

SPs per SM 32 192

Threads per SM 1536 2048

Thread blocks per SM 8 16

Warp schedulers per SM 2 4

Dispatch Units per SM 2 8

Shared Memory/L1 cache 16/48KB 16/32/48KB

32-bit Registers per SM 32K 64K

Registers per thread 63 255

Table 2.2: Fermi chip GF110 versus Kepler chip GK110 [41]

2.2.2.1 Dynamic Parallelism

Dynamic parallelism is an extension to the CUDA programming model, enabling CUDA

kernels to create, and synchronize, new kernel entirely on the GPU. With this feature, any

kernel can launch a child kernel and manage inter-kernel dependencies [35].

In order to manage the execution of dynamic parallelism, the CUDA model added a

new feature known as the Grid Management Unit (GMU) [44, 42, 45], which is able to

dispatch, as well as pause the dispatch, of new grids. The GMU can also queue pending,

and suspend running, grids. A grid includes all thread-blocks associated with the kernel.

Grids are launched in the order that they are received.

In previous GPU generations, the host launched work through the Compute Work

Distributor (CWD) unit [42, 2], and the CWD tracks blocks issued and sends them to the SM

for execution. In Kepler and more recent GPU generations, the GPU launch work from host

or device using the GMU. The GMU communicates with the CWD using a bidirectional

link to prioritize or suspend/pause grids. Also, the GMU has a direct connection to the SM

to support dynamic parallelism, and through this connection device kernels can dispatch

child grids.

The aim of the GMU is to effectively manage grid dispatching, in such way that, if we

need to free up resources for child kernels to execute, the GMU will suspend parent kernel

grids [42, 45]. The device runtime will reschedule the grids on different SMs in order to

19


better manage resources. Figure 2.5 illustrates GMU interaction with the CWD and the SM.

!"!"!"

!"#$%&'()$)$*'+',#-$#$-'.)$)$*'/0'1#2-*'

3456'78983:7:9;'$L'M%)>FJ'

Figure 2.5: Work flow of the Grid Management Unit to dispatch, pause, and hold pending

and suspended grids.

Dynamic parallelism enables to create work directly on the GPU. This can remove the

need to transfer execution control and data between the host and the device. The child kernel

launch decisions are made at runtime by threads executing on the device. CUDA model

controls the synchronization and communication between a parent kernel and child kernels.

The local memory and registers associated with a parent thread are still only accessible by the

parent thread, and are not accessible by other threads or any child threads. Communication

with a child thread is only through global memory.

20


Using dynamic parallelism, data-dependent parallel work can be generated inline within

a kernel at runtime. These kernels take advantage of the GPU’s hardware scheduler and load

balancer to dynamically adapt execution to make data-driven decisions. Figure 2.6 shows

how dynamic parallelism works on a GPU.

!!"#$%!!'()*+)#,-'

.'

'//0'

'12',3$+415$+-'

' '361#47)*+)#88899'

'//:::'

;'

0'


!"#$"%&'& !"#$"%&(& !"#$"%&)(&*

+,"-./0$&

12#"34&5& 12#"34&'& 12#"34&)'&

Figure 2.7: Hyper-Q

increases the total number of work queues between the host and the device by allowing

32 simultaneous hardware-managed connections (as compared to the single connection

available with Fermi). Figure 2.7 illustrates the Hyper-Q feature in Kepler.

2.2.3 Maxwell Architecture

NVIDIA’s Maxwell generation provides only few enhancements to the previous GPU

generation, with a focus on energy efficiency. In addition to providing new features that

include dynamic parallelism and concurrent kernel execution, the Maxwell generation

delivers 2x the performance per watt as compared to the Kepler generation [46].

The Maxwell GTX 980 Ti (chip GM200) comprises 22 Maxwell SMs (SMM). Each

SMM has 128 CUDA cores, four warp schedulers, eight instruction dispatch units, and

eight texture units. Overall, the Maxwell SM looks very similar to a Kepler SM, except that

Maxwell provides fewer CUDA cores per SM.

Another major change, as compared to the Kepler architecture, is in the memory hierar-

chy. Shared memory and L1 cache are not longer combined. Shared memory is dedicated,

and the L1 cache is combined with the texture cache. The Maxwell GTX 980 Ti ships with

up to 96KB in its share memory unit, and 48KB for the L1 cache/texture cache.

2.2.4 Pascal Architecture

NVIDIA introduced Pascal architecture in 2016. The NVIDIA GTX 1080 which includes

a Pascal GP104, comprises 7.2 billions transistors and 2560 single-precision CUDA cores.

22


GDDR5x memory is introduced with the GP104, providing 256-bit memory interface,

delivering 43% higher memory bandwidth than NVIDIA’s prior GeForce GTX 980 GPU.

The GP104 GPU consists of four Graphic Processing Clusters (GPCs), 20 Pascal SMs,

and eight memory controllers. Each GPC has a dedicated raster engine and five SMs. Each

SM contains 128 CUDA cores, four warp schedulers, eight instruction dispatch units, a 256

KB of register file capacity, a 96 KB shared memory unit, 48 KB of total L1 cache storage,

and eight texture units [47]. A comparative feature analysis between four NVIDIA GPU

generations is presented in Table 2.3.

GPU GTX 590 GTX Titan GTX 980 Ti GTX 1080

Family Fermi Kepler Maxwell Pascal

Chip GF110 GK110 GM200 GP104

Compute Capability 2.0 3.5 5.2 6.1

SM 16 14 22 20

CUDA cores 32 192 128 128

Total cores 512 2688 2816 2560

Global Mem. 1474 MB 6083 MB 6083 MB 8113 MB

Shared Mem. 48 KB 48 KB 48 KB 48 KB

Threads/SM 1536 2048 2048 2048

Threads/block 1024 1024 1024 1024

Clock rate 1.26 GHz 0.88 GHz 1.29 GHz 1.84 GHz

TFLOPS 1.5 4.29 6.50 9.00

Table 2.3: A comparison of the features available on the four generations of NVIDIA GPUs

considered in this thesis [25].

23

Chapter 3

Related work

In this chapter, we review related work in the areas of GPU characterization, with special

emphasis on modern GPU features. We focus our literature review on advanced parallel

features for multiple levels of concurrency, and different grains of parallelism.

3.1 Characterization of GPUs

There have been studies focusing on GPU characterization to better understand the

improvements during the evolution of these devices. This evolution started with GPUs as

rendering tools, and spans to today where GPUs act as advanced general purpose accelera-

tors [48, 49, 50, 51].

An early characterization study by Jia et al. [48] in 2012 focused on characterizing cache

memories on GPUs. Starting with the NVIDIA Fermi and the AMD Fusion, GPU vendors

have included demand-fetching in their data caches. Earlier, GPU generations were focused

on graphics rendering, providing local memories instead of demand-fetched caches. With

the introduction of demand-fetch caches, a new challenges arrived: 1) understanding the

benefits of cache memories and 2) a lack of intuition for developers to efficiently use them.

They addressed these two problems and provided a mechanism to efficiently utilize cache

memories.

Wong et al. [49] presented a characterization of Tesla GPUs through the execution of a

set of microbenchmarks. Their analysis provided insights about the characteristics of the

24

CHAPTER 3. RELATED WORK

GPUs beyond the information provided by NVIDIA. Another attempt to characterize the

internals of a GPU was presented by Torres et al. [50]. In their study, they focused on the

impact of the CUDA tuning techniques on the Fermi architecture. Jiao et al. [51] presented

a characterization study of GPUs to evaluate power efficiency, and the correlation between

application performance and power consumption.

A large body of work studies how to leverage GPUs effectively through understanding

their characteristics for older [52, 53] and modern [54, 55, 56] generations of GPUs. While

Kerr et al. [52] focused on understanding the behavior of PTX 1.4, Lee et al. [53] developed

an exhaustive performance analysis to capture performance gaps between an NVIDIA

GTX280 Tesla architecture versus an Intel Core i7-960. In this these, we focus our attention

primarily to the characterization of more modern GPUs.

3.1.1 Modern GPUs

Kayiran et al. [54] explored the impact of of memory accesses during concurrent

execution thread execution and the resulting application performance. They provided a

thorough evaluation of 31 applications - from the CUDA SDK to Map-Reduce problems

- to understand resource contention in caches, networks and memory. Furthermore, they

proposed a dynamic Cooperative Thread Arrays (CTA) scheduling mechanism, which

regulates thread level parallelism by allocating an optimal number of CTAs per application.

Mei et al. [55] provided a microbenchmark to dissect the device memory hierarchy to

chacterize the organization of different GPUs cache systems on Fermi, Kepler and Maxwell

architectures. Ukidave et al. [19] provided a set of application benchmarks to analyze the

latest features on modern GPUs, such as nested parallelism, concurrent kernel execution,

atomic operations, and shuffling.

In the next section, we review characterization of multiple levels of concurrency and

thread granularity on modern GPUs.

25


3.2 Multiple Levels of Concurrency

3.2.1 Nested Parallelism

One of the earliest characterizations of nested parallelism was presented by DiMarco et

al. [57] in 2013. They aimed to quantify the performance gains of dynamic parallelism pre-

sented by CUDA 5 and the Kepler architecture. Their exploration covered two applications:

K-means and hierarchical clustering. Their results showed that finer granularity of TLP

provides a more efficient way to leverage nested parallelism than just avoiding CPU-GPU

synchronization.

In 2014, Wang et al. [58] presented an evaluation of the impact of nested parallelism

in unstructured GPU applications for the Kepler architecture. Irregular applications suffer

from workload imbalance, which provides a good target for optimization using fine-grained

threads contained in coarse-grained blocks. Their characterization focused on control flow

and memory access measurements. Two metrics were proposed in their study: i.) warp

execution efficiency, and ii.) load/store replay overhead. Although, they provided a thorough

analysis of nested parallelism for control flow instructions and memory accesses, they

did not take into consideration synchronization cost between parent and child kernels to

evaluate the benefits of nested parallelism. Furthermore, they did not take into consideration

a finer grain classification of control flow divergence and their impact on the application

performance.

In 2015, Wang et al. [59] continued their work on characterizing nested parallelism in

GPUs. They proposed a new mechanism called Dynamic Thread Block Launch (DTBL), a

new execution model to support irregular applications on GPUs. DTBL allows coalesced

allocation of child kernels and parent kernels.

Yang et al. [60] analyzed a set of optimized parallel benchmark applications that contain

loops. Their analysis covers the degree of TLP and proposed a framework called CUDA-NP

to exploit nested parallelism in CUDA. CUDA-NP is a pragma-based compiler approach that

generates GPU kernels with nested parallelism. Basically, their approach reads the OpenMP-

like pragma directives in the input kernels and creates the respective child kernels with a

grid configuration based on the parallel-loop-TLP degree. However, they did not analyze

26


the implications of parent-child synchronization. Furthermore, they relied on developer’s

knowledge to identify potential parallel loops that can exploit nested parallelism without

providing any insight about the behavior of the architectures.

Further studies [61, 62, 63] characterized nested parallelism based on the irregularity

of an application. Applications containing parallel loops and recursive calls are suitable

to leverage nested parallelism. Zhang et al. [61] adapted two irregular and data-driven

problems—bread-first search and single-source shortest path— to leverage nested paral-

lelism. Li et al. [62] proposed parallelization templates to leverage nested parallelism for

tree and graphs problems. These type of problems present irregular nested loops and parallel

recursive computation. Wang et al. [63] provided insights on leveraging nested parallelism

for general irregular applications. However, none of these approaches provided a holistic

analysis on the implications on leveraging nested parallelism and the effects across different

architecture/compiler versions.

3.2.2 Concurrent Kernel Execution

In early GPU architectures, concurrent kernel execution was poorly supported. In 2011,

Wang et al. [64] proposed a mechanism to exploit concurrent kernel execution through

manual context funnelling. They compared CUDA 4 automatic context funnelling versus

their approach for Fermi architectures. They showed that manual control of shared resources

might provide slight improvements in application performance. However, they did not

discuss resource contention based on the interplay between concurrent kernels.

In 2012, Wende et al. [65] provided a kernel reordering mechanism to exploit concurrent

kernel execution for Fermi architectures. Their execution model is designed to partition

kernels into small-scale computations, and by using producer-consumer principles, manage

GPU kernel invocations after reordering them. Later, in 2014 Wende et al. [66] continued

their work on exploitation of concurrent kernel execution, and proposed a characterization of

NVIDIA Hyper-Q feature for Kepler architecture, using an offloading mechanism to allow

running multiple kernels simultaneously. Their analysis explored synthetic benchmarks

and developed a performance evaluation, complementing their previous work on kernel

reordering.

27


Gregg et al. [67] proposed a kernel scheduler mechanism called KernelMerge that

allows to run two OpenCL kernels concurrently on AMD cards. KernelMerge takes into

consideration kernel configuration and investigates the interaction between concurrent

kernels to analyze interference for sharing resources.

Since the Kepler architecture, NVIDIA provides a modern hardware design to adequately

support concurrent kernel execution. In 2014, Jog et al. [68]—moving to the next logical

step—proposed an Application-aware memory system for fair and efficient execution of

concurrent applications. Their approach takes into consideration memory awareness by

providing a new scheduling mechanism for serving memory requests in a round-robin fash-

ion. They considered four metrics based on the Instructions Per Cycle for each application.

However, they did not consider resources contention on registers, nor the grid configu-

ration. Furthermore, they focused on memory-bound applications, and did not discuss

arithmetic-bound applications.

In 2016, Luley et al. [69] proposed a framework to exploit NVIDIA’s Hyper-Q. Their

framework oversubscribes kernels and defragments memory transfers to effectively overlap

accesses with computations. Furthermore, they proposed multiple mechanisms to reorder

kernels with the aim to improve application throughput. Although, they have studied the

impact of memory transfers, they have not analyzed resource contention between concurrent

kernels, which can be a key bottleneck when attempting to leverage concurrent kernel

execution.

28

Chapter 4

Characterization of advanced parallel

features

Acceleration of high performance applications that exhibit complex and irregular execu-

tion behavior is an ever-growing open problem. A naive port of an irregular applications to a

parallel platform often leads to underutilization of hardware resources, significantly limiting

performance. In this chapter, we present a characterization of advanced parallel features on

a GPU that be effectively exploited to tune any application with a high degree of irregularity.

4.1 Nested Parallelism

Irregularity in an application can result in poor workload balance when attempting

to exploit fine-grained thread-level parallelism. We next consider examples of high-level

language behavior that can suffer from a lack of inherent thread-level parallelism.

A number of irregular applications contain control-flow dependent nested loops. This

kind irregularity can inhibit thread-level parallelism, since independence can only be de-

duced at runtime. Because many loops tend to be data dependent, GPU hardware vendors

introduced support for nested parallelism, leveraging nested TLP through the addition of a

new level of parallelism. We have studied a number of irregular applications to identify how

frequently control flow dependent nested loops are used. Table 4.1 shows characterization

29

CHAPTER 4. CHARACTERIZATION OF ADVANCED PARALLEL FEATURES

data from two different GPU benchmark suites, where control flow dependent nested loops

occur.

Applications Benchmark Suite Number of Control Flow

Dependent Nested Loop

Barnes Hut Lonestar [18] 6

Delaunay Mesh Refinement Lonestar [18] 7

Points-to Analysis Lonestar [18] 31

Survey Propagation Lonestar [18] 7

Single-Source Shortest Paths Lonestar [18] 2

Connected Component Labeling NUPAR [19] 1

Level Set Segmentation NUPAR [19] 1

Table 4.1: Irregular Applications from two different GPU benchmarks which exhibit control

flow dependent nested loops.

We have also explored recursive algorithm patterns that can benefit from nested par-

allelism. Parallel recursion is a solution to efficiently execute recursive algorithms which

exhibit the ability to spawn multiple threads per recursive call. Before the introduction of

nested parallelism on the GPU, recursive solutions required GPU and CPU intervention -

or an implementation of the GPU kernel void of recursive kernel calls. However, constant

communication between the CPU and the GPU produces memory copies and results in

communication overhead. In addition, most recursive solutions are data-dependent solutions,

therefore it is challenging to anticipate the amount of overhead will be introduced. On

the other hand, we cannot always use a single GPU kernel call version for all recursive

algorithms. Table 4.2 shows a list of recursive kernels that can be as parallel recursion.

Applications Benchmark Suite Number of Control Flow Number of

Dependent Nest Loop Recursion calls

Breadth-First Search Lonestar 0 1

Prim’s Algorithm - 1 1

Table 4.2: Recursive applications which exhibit parallel recursion.

30


1 g l o b a l vo id s i n g l e K e r n e l ( i n t ∗ A, i n t ∗B , i n t ∗C , i n t rows , i n t c o l s )2 {3 i n t i d x = b l o c k I d x . x ∗ blockDim.x + t h r e a d I d x . x ;4 i f (A[ i d x ∗ c o l s ] == 1)5 {6 f o r ( i n t i =0 ; i < c o l s ; i ++)

7 C[ i d x ∗ c o l s + i ] = A[ i d x ∗ c o l s + i ] + B[ i d x ∗ c o l s + i ] ;8 }9 }

Program Listing 4.1: Micro-benchmark Kernel with irregular nested loop execution.

With nested parallelism, a recursive solution can be naturally ported to the GPU and can

avoid CPU-GPU communication overhead. Nonetheless, the recursive spawning of threads

cannot always lead to enough TLP to exploit the GPU, and it could lead to substantial kernel

launch overhead and hardware underutilization.

Exploiting nested parallelism, either in the presence of control-flow dependent nested

loop or parallel recursion, is not straightforward. Since nested loop can include control

flow divergences; a recursive solution can lead to poor TLP and low warp efficiency. At the

same time, nested synchronization can turn into a large number of thread stalls and global

communication between parent-child kernels. Next, we explore each of these factors and

present metrics to quantify their impact on kernel performance.

4.1.1 Control Flow Instructions

Mapping parallel programs exhibiting arbitrary control flow onto parallel units can be a

difficult task. There is generally no guarantee that parallel units will execute the same control

flow path. For instance, Program 4.1 presents a micro-benchmark kernel that executes a

loop based on an input parameter data. Figure 4.1 illustrates the dynamic execution of the

micro-benchmark for two architectures: a Kepler GTX Titan, and a Maxwell GTX Titan Ti.

Both of the execution examples are run with the same input parameters, the same NVIDIA

driver, and the same CUDA version. However, the number of instructions executed varies

along the control flow path.

31


Figure 4.1: Control flow graphs of Program 4.1 for Kepler GTX Titan and Maxwell GTX

Titan Ti.

CUDA binary tools such as nvdisasm [45], and cuobjdump [45] have been widely

used to produce control flow graphs (CFG). However, nvdisasm and cuobjdump gather

kernel behavior statically, and do not allow dynamic analysis of an application’s irregular-

ity. On the other hand, the SASS Instrumentation tool (SASSI) [70] allows the dynamic

collection of metrics during execution time. Moreover, SASSI is able to retrieve developer-

specified metrics about the control flow instructions executed at runtime. SASSI, alongside

with nvprof [45] allow us to collect the following runtime metrics:

1. instExec, Number of instructions executed. Reported by nvprof.

2. warpDivEff : Ratio of the average active threads per warp and the maximum number

of threads per warp supported on a multiprocessor, expressed as percentage. Reported

by nvprof.

3. cfExecuted: Number of executed control-flow instructions. Reported by nvprof.

4. cfDependentNestedLoop: Number of instructions executed inside a control flow

loop-instruction. Reported by our handler and injected using SASSI.

32


These metrics are intrinsically related to the execution of kernel control flow and capture

the efficiency of the warp execution. We evaluate the impact of the percentage of instructions

executed inside the loop to identify potential hotspots to identify opportunities to exploit

nested TLP. We compute the ratio of instructions executed inside of loop bodies, as a

fraction of all instructions executed, in order compute the impact of instructions inside these

common control flow structures.

loopInstExec =cfDependentNestedLoop

instExec(4.1)

We also consider the amount of the idle resources due to the warp divergence. warpDivEff

allows us to compute the reciprocal metric to measure the idle threads waiting until a loop

execution ends. This can be computed as: warpDivIdle = 1 − (warpDivEff/100).Next, we propose and compute loop warp efficiency by taking into account the ratio of

instructions executed during loop execution (i.e. loopInstExec). However, a simple mul-

tiplication of warpDivIdle ∗ loopInstrExec would give us the loop warp threads idle(loopWarpThreadsIdle), in order to compute the efficiency metric, we compute its re-

ciprocal by subtracting 1 − loopWarpThreadsIdle and multiple by 100 to expressed aspercentage:

loopWarpEff = (1− loopWarpThreadsIdle) ∗ 100 (4.2)

Our proposed metrics are specifically designed to measure workload imbalance generated

by irregular applications. These applications have data-dependent workload, unpredictable

control flow behavior that causes severe workload imbalance, and eventually poor GPU

utilization.

4.1.2 Parallel Recursion

Recursion is a method of making self referential calls, commonly used to compute

problems through breaking them into smaller sub-problems, and using a divide-and-conquer

strategy. For instance, Program 4.2 illustrates a simple recursive program which implements

the fibonacci sequence [71]. In a recursive solution, the problem is broken into a base case

33


1 i n t f i b ( i n t n ) {2 i f ( n


1 g l o b a l vo id f i b k e r n e l p a r r e c ( i n t n , u n s i g n e d long i n t ∗ vFib ) {2 i f ( n == 0 | | n == 1)3 r e t u r n ;

4

5 f i b k e r n e l p a r r e c (n−2, vFib ) ;6 f i b k e r n e l p a r r e c (n−1, vFib ) ;7 c u d a D e v i c e S y n c h r o n i z e ( ) ;

8 vFib [ n ] = vFib [ n−1] + vFib [ n−2];9

10 }

Program Listing 4.3: Fibonacci parallel recursive scheme in CUDA.

called CUDA blocks, also known as CTAs! (CTAs!) [72]. TLPDegree is the number

of threads synchronized across the CTA.

2. workEfficiency: Ratio of the number of operations executed that contribute to

solving the problem, divided by the total operations executed on the GPU. The

goal of this metric is to provide a measure of the number of non-redundant (vs.

redundant+ non− redundant) operations executed per GPU kernel. For instance,a 100% work efficiency indicates that there are no redundant operations executed.

3. depthKernelRecursion: Number of nested kernel calls.

Our proposed metrics are specifically designed to measure the efficiency of recursive

execution by parallel recursive applications. These applications have data-dependent work-

load, nested kernel calls, and irregular parallel recursion which lead to unbalanced workload

execution, low work efficiency, and eventually poor GPU utilization.

4.1.3 Child Kernel Launching and Synchronization

Nested Parallelism in CUDA allows explicit synchronization by child kernels by call-

ing Application Programming Interfaces (API) cudaDeviceSynchronize. When used,

the parent thread-block will wait until the child threads finish their execution. cudaDeviceSynchronize

is expensive, and should not be used. However, for many irregular applications the parent

35


thread will require results of the child threads to continue execution. We characterize the

overhead of device synchronization by measuring its impact on the overall performance.

Once a potential nested parallelism hotspot was identified by our metrics, we imple-

mented a nested parallelism kernel, and compared it to the non-nested parallelism kernel, a

well as a sequential implementation of the kernel, in order to characterize the overhead of a

child kernel launching.

Figure 4.2 shows the execution time of the 3 different implementations (i.e., sequential,

non-nested parallelism and nested parallelism) for our micro-benchmark kernel. The micro-

benchmark computes the addition of two matrices if the value of the first element in the row

matches to the condition in the first control flow instruction. We defined a set of experiments,

varying the input sizes in terms of rows and columns. In addition, we controlled the level of

divergence starting from 12.5% and increasing it up to 75%. We argue that higher divergence

leads to better exploitation of nested parallelism, but this divergence is going to be data

dependent.

We expected that small input sets will lead to poor performance in a GPU due to low

utilization of the high TLP available. However, nested parallelism starts outperforming

non-nested parallelism as the degree of TLP increases, especially in the presence of a high

degree of divergence.

In order to characterize the behavior of nested parallelism across different GPU archi-

tectures, we used two Kepler and two Maxwell architectures, running with the same input

sets, the same NVIDIA driver, and the same CUDA version. Figure 4.3 shows the runtime

execution for different input sets, with data values generating a degree of 75% divergence

across the four different GPUs.

Although the Kepler GT 730 has the same number of CUDA cores per SM as the Kepler

GTX Titan, it also has a smaller number of SMs as compared to GTX Titan, while the GTX

Titan has 15 SMs, the GT 730 has 2 SMs. The number of SMs has high impact on our

ability to exploit nested parallelism. For instance, the child kernel that is launched will have

to allocate a number of blocks on the remaining available SMs on the device. If the device

does not have enough SMs free to leverage nested parallelism, then the benefits of nested

parallelism will not be enjoyed.

We present an equation 4.3 to characterize kernel overhead across different architectures,

36


Figure 4.2: Execution time of Sequential, non-nested parallelism and nested parallelism

kernels on GTX Titan - Kepler architecture (lower is better).

based on the SM usage. NumberThreadsPerBlock and NumberBlocks are application

specific, and MaxThreadsPerSM is architecture specific. If SMUsage surpasses the

available number of SMs on the GPU, it will prevent us to effectively leverage nested

parallelism. We have found we also benefit from the use of persistent threads to control

SMUsage.

SMUsage =NumberThreadsPerBlock ∗NumberBlocks

MaxThreadsPerSM(4.3)

In our analysis, we characterized cudaDeviceSynchronize API calls for kernel

synchronization using CUDA counters, such as clocks and the frequency rate. We verified

that synchronization can negatively impact application performance when an application

launches a small number of threads per block and a reduced number of blocks per kernel (i.e.

37


Figure 4.3: Execution time of non-nested parallelism and nested parallelism across four

GPUs (2 Kepler and 2Maxwell GPUs).

poor TLP). However, we found that kernel synchronization can be hidden by incrementing

the TLP and loopWarpEff .

4.1.4 Memory Overhead

When using nested parallelism, global memory in the GPUs is the only channel of com-

munication between the parent and child kernels, and it may also be tied by the device run-

time for child kernel launches. The device runtime keeps track of the kernel launches by cre-

ating a pool for all launches. Kernels that are not enable to launch due to a lack of resources

available remain in the pool on pending kernels. The size of this pool is referred to as th

Characterization and exploitation of nested parallelism and ......NORTHEASTERN UNIVERSITY Graduate School of Engineering Dissertation Signature Page Dissertation Ti-tle: Characterization

Documents