FDTD CUDA

Finite-Difference Time-Domain MethodImplemented on the CUDA Architecture

Wei Chern TEE

School of Information Technology and Electrical Engineering

University of Queensland

Submitted for the degree of Bachelor of Engineering (Honours)

in the division of Electrical and Electronic Engineering

June 2011

Statement of Originality

June 3, 2011

Head of School

School of Information Technology and Electrical Engineering

University of Queensland

St Lucia, Q 4072

Dear Professor Paul Strooper,

In accordance with the requirements of the degree of Bachelor of Engi-

neering (Honours) in the division of Electrical and Electronic Engineering, I

present the following thesis entitled “Finite-Difference Time-Domain Method

Implemented on the CUDA Architecture”. This work was performed under

the supervision of Dr. David Ireland.

I declare that the work submitted in this thesis is my own, except as ac-

knowledged in the text and footnotes, and has not been previously submitted

for a degree at the University of Queensland or any other institution.

Yours sincerely,

—————————

Wei Chern TEE

i

Abstract

The finite-difference time-domain (FDTD) method is a numerical method

that is relatively simple, robust and accurate. Moreover, it lends itself well

to a parallel implementation. Modern FDTD simulations however, are often

time-consuming and can take days or months to complete depending on the

complexity of the problem. A potential way of reducing this simulation time

is in the use of graphical processor units (GPUs). This thesis thus studies

the challenges of using GPUs for solving the FDTD algorithm.

GPUs are no longer used just to render graphics. Recent advancements in

the use of GPUs for general-purpose and scientific computing have sparked an

interest in the use of FDTD. New graphics processors such as NVIDIA CUDA

GPUs provide a cost-effective alternative to traditional supercomputers and

cluster computers. The parallel nature of the FDTD algorithm coupled with

the use of GPUs can potentially and significantly reduce simulation time

compared to the CPU.

The focus of the thesis is to utilize NVIDIA CUDA GPUs to implement

the FDTD method. A brief study is done on CUDAs architecture and how

it is capable of reducing the FDTD simulation time. The thesis examines

implementations of the FDTD in one, two and three dimensions using CUDA

and the CPU. Comparisons of code-complexity, accuracy and simulation time

are made in order to provide substantial arguments for concluding whether

the implementation of FDTD on CUDA is beneficial. In summary, speed-ups

of over 20x, 60x and 50x were achieved for one-, two- and three-dimensions

respectively. However, there are challenges involved in using CUDA which

are investigated in the thesis.

ii

Acknowledgements

I would like to acknowledge my supervisor, Dr. David Ireland

especially for his patience and guidance.

I would also like to acknowledge Dr. Konstanty Bialkowski

for his ideas and help.

To my parents, without whom I would be sorely put.

Thank you.

Special thanks to

Ahmad Faiz, Tan Jon Wen, Christina Lim, Anne-Sofie Pederson,

Franciss Chuah, Kenny Heng and Lee Kam Heng.

For friendship.

iii

Contents

1 Introduction 1

1.1 Thesis Introduction . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4.1 Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4.2 Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4.3 Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4.4 Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4.5 Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4.6 Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Finite-Difference Time-Domain (FDTD) 6

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 One-Dimensional FDTD Equations . . . . . . . . . . . . . . . 7

2.3 Two-Dimensional FDTD Equations . . . . . . . . . . . . . . . 8

2.4 Three-Dimensional FDTD Equations . . . . . . . . . . . . . . 9

3 Compute Unified Device Architecture (CUDA) 11

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Computation Capability of Graphics Processing Units . . . . . 14

3.3 Memory Structure . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Literature Review 21

4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

iv

CONTENTS

5 1-D FDTD Results 23

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2 Test Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.4 Discrepancy In Results . . . . . . . . . . . . . . . . . . . . . . 32

5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.2 Test Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.3 Initial Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.4 Updating PML Using CUDA . . . . . . . . . . . . . . . . . . 42

6.5 Compute Profiler . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.6 CUDA Memory Restructuring . . . . . . . . . . . . . . . . . . 47


6.8 Alternative Absorbing Boundaries . . . . . . . . . . . . . . . . 55

6.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.2 Test Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7.3 CUDA Block & Grid Configurations . . . . . . . . . . . . . . 65

7.4 Three-Dimensional Arrays In CUDA . . . . . . . . . . . . . . 67

7.5 Initial Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.6 Visual Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7.7 Memory Coalescing for 3-D Arrays . . . . . . . . . . . . . . . 69


7.9 Alternative Absorbing Boundaries . . . . . . . . . . . . . . . . 74

v

CONTENTS

7.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

8 Conclusions 84

8.1 Thesis Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 84

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Bibliography 87

A Emory Specifications 91

B Uncoalesed Memory Access Test 94

C Coalesed Memory Access Test 100

vi

List of Figures

2.1 Position of the electric and magnetic fields in Yee’s scheme.

(a) Electric element. (b) Relationship between the electric

and magnetic elements. Source: [4]. . . . . . . . . . . . . . . . 7

3.1 A multithreaded program is partitioned into blocks of threads

that execute independently from each other, so that a GPU

with more cores will automatically execute the program in less

time than a GPU with fewer cores. Source: [6]. . . . . . . . . 12

3.2 The GPU devotes more transistors to data processing. Source:

[6]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Hierarchy of threads, blocks and grid in CUDA. Source: [6]. . 15

3.4 Growth in single precision computing capability of NVIDIA’s

GPUs compared to Intel’s CPUs. Source: [8]. . . . . . . . . . 16

3.5 Hierarchy of various types of memory in CUDA. Source: [6]. . 18

5.1 Speed-up for one-dimensional FDTD simulation. . . . . . . . . 29

5.2 Throughput for one-dimensional FDTD simulation running on

CUDA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.3 Plot of 1-D FDTD simulation results to compare accuracy

between CPU and GPU. The plots are generated from the

result from simulation size of 100 cells. The location of the

probe is at x = 30. . . . . . . . . . . . . . . . . . . . . . . . . 34

vii

LIST OF FIGURES



result from simulation size of 100 cells. The location of the

probe is at x = 30. The plot is centered between time-steps

50 and 90. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.1 Flowchart for two-dimensional FDTD simulation. . . . . . . . 39

6.2 Screen-shot of Visual Profiler GUI from CUDA Toolkit 3.2

running on Windows 7. The data in the screen-shot is im-

ported from the results of the memory test in Listing 6.4.

These results are also available in Appendix B. . . . . . . . . . 44

6.3 Simple program to determine advantages of cudaMallocPitch

(). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.4 Throughput for two-dimensional FDTD simulation running on

CUDA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54



result from simulation size 128 × 128. The location of the

probe is at cell (25, 25) as specified in Table 6.1. . . . . . . . . 56

6.6 Comparison of speed-up for various ABCs in two-dimensional

FDTD simulation running on CUDA. . . . . . . . . . . . . . . 57

6.7 Comparison of throughput for various ABCs in two-dimensional


7.1 Flowchart for three-dimensional FDTD simulation. . . . . . . 63

7.2 Speed-up for three-dimensional FDTD simulation. . . . . . . . 71

7.3 Throughput for three-dimensional FDTD simulation running

on CUDA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

viii

LIST OF FIGURES



result from simulation size 160× 160× 160 cells. The location

of the probe is at cell (60, 60, 60) as specified in Table 7.1 . . 75



result from simulation size 160× 160× 160 cells. The location

of the probe is at cell (60, 60, 60) as specified in Table 7.1.

The plot is centered between time-steps 300 and 350. . . . . . 76

7.6 Comparison of speed-up for various ABCs in three-dimensional


7.7 Comparison of throughput for various ABCs in three-dimensional


7.8 Plot of 3-D FDTD simulation results to compare accuracy be-

tween various ABCs. The plots are generated from the result

from simulation size 160× 160× 160 cells. The location of the

probe is at cell (60, 60, 60) as specified in Table 7.1. . . . . . . 81

7.9 Plot of 3-D FDTD simulation results to compare accuracy be-

tween various ABCs. The plots are generated from the result

from simulation size 160× 160× 160 cells. The location of the

probe is at cell (60, 60, 60) as specified in Table 7.1. The plot

is centered between time-steps 400 and 700. . . . . . . . . . . 82

ix

List of Tables

5.1 Specifications of the test platform. . . . . . . . . . . . . . . . . 24

5.2 Results for one-dimensional FDTD simulation. . . . . . . . . . 28

5.3 Discrepancy in results between CPU and CUDA simulation of

the one-dimensional FDTD method. . . . . . . . . . . . . . . . 33

6.1 Set-up for two-dimensional FDTD simulation. . . . . . . . . . 40

6.2 Results for two-dimensional FDTD simulation on CPU. . . . . 41

6.3 Results for two-dimensional FDTD simulation on CUDA (ini-

tial run). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.4 Results for two-dimensional FDTD simulation on CUDA (us-

ing CUDA to update PML). . . . . . . . . . . . . . . . . . . . 43

6.5 Textual profiling results for investigation into memory coalesc-

ing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46


ing cudaMallocPitch() for Ex memory allocation). . . . . . . 48

6.7 Snippet of results from textual profiling on uncoalesced mem-

ory access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.8 Snippet of results from textual profiling on coalesced memory

access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53


ing cudaMallocPitch() and column-major indexing). . . . . . 53


the two-dimensional FDTD method. . . . . . . . . . . . . . . 55

7.1 Set-up for three-dimensional FDTD simulation. . . . . . . . . 64

7.2 Results for three-dimensional FDTD simulation on CPU. . . . 64

x

LIST OF TABLES

7.3 Limitations of the block and grid configuration for CUDA for

Compute Capability 1.x [6]. . . . . . . . . . . . . . . . . . . . 65

7.4 Results for three-dimensional FDTD simulation on CUDA (ini-

tial run). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7.5 Results from visual profiling on three-dimensional FDTD sim-

ulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7.6 Results for three-dimensional FDTD simulation on CUDA (with

coalesced memory access). . . . . . . . . . . . . . . . . . . . . 70

7.7 Results from visual profiling on three-dimensional FDTD sim-

ulation (with new indexing of arrays in kernel). . . . . . . . . 73


the three-dimensional FDTD method. . . . . . . . . . . . . . . 74

B.1 CUDA Textual Profiler results from testing uncoalesced mem-

ory access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

C.1 CUDA Textual Profiler results from testing coalesced memory

access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

xi

1Introduction

1.1 Thesis Introduction

The finite-difference time-domain (FDTD) modelling technique is used to

solve Maxwell’s equations in the time-domain. The FDTD method was in-

troduced by Yee in 1966 [1] and interest on the topic has increased almost

exponentially over the past 30 years [2]. The FDTD method provides a

relatively simple mathematical solution to the Maxwell’s equations but sim-

ulation can take days or months to complete depending on the complexity

of the problem. However, the FDTD technique is also highly parallel which

allows it to leverage parallel processing architectures to achieve speed-ups.

This project involves the use of graphics processing units (GPUs) which have

parallel architectures to implement the FDTD method in order to reduce

computation time.

Historically, the GPU is used as a co-processor to the main processor of

a computer, the central processing unit (CPU). The GPU is designed with

a mathematically-intensive, highly parallel architecture for rendering graph-

ics. Modern GPUs however, are becoming increasing popular for performing

1

1.2. MOTIVATION

general purpose computations instead of just graphics-processing. By utiliz-

ing the GPU’s massively parallel architecture with high memory bandwidth,

applications running on the GPU can achieve speed-ups in orders of mag-

nitude compared to CPU implementations. The utilization of GPUs for

applications other than graphics rendering is known as general-purpose com-

puting on graphics processing units (GPGPU). Therefore, the thesis focuses

on combining the parallel nature of the FDTD technique with the parallel

architecture of the GPU to achieve speed-ups compared to the traditional

use of CPU for computation.

1.2 Motivation

The Microwave and Optical Communications (MOC) Group at the Uni-

versity of Queensland require computer simulations of very large, realistic

models of the human anatomy and tissues interacting with electromagnetic

energy. The FDTD technique is used for the simulation and takes a long

period of time to solve due to the large simulation models. Therefore, the

MOC group is interested in utilizing the GPU to reduce simulation time.

The FDTD technique is gaining popularity in vast areas of study such as

telecommunications, optoelectronics, biomedical engineering and geophysics.

Thus, the outcome of the thesis and the understanding of GPUs will be

valuable towards the reduction of computation time in scientific applications.

As many engineering practice require design by repetitive simulation such

as in design optimization, a more thorough optimization procedure can be

achieved with a faster simulation time.

2

1.3. SIGNIFICANCE

1.3 Significance

From the results of the thesis, a conclusion will be made on the feasibility

of using GPUs as an alternative to the CPU. If the outcome of the thesis

is a successful implementation of the FDTD method using the CUDA archi-

tecture, the thesis could be a motivation for further research into both the

FDTD method and GPU acceleration.

While the focus of this thesis is on obtaining speed-ups for the FDTD

method, the results could be used as a gauge for the benefits of GPU ac-

celeration in other applications. Other applications can also benefit from

leveraging a technology that already exists in our computers today.

1.4 Thesis Outline

A review of the remaining chapters of the thesis is given here.

1.4.1 Chapter 2

This chapter serves to introduce the finite-difference time-domain method.

A short introduction is given for one-, two- and three-dimensional FDTD

equations.

1.4.2 Chapter 3

The CUDA architecture is introduced in this chapter. A short summary of

the various types of memory available on a GPU are given here. The differ-

ences between a CUDA GPU and a CPU are discussed. CUDA’s potential

in reducing computation time for the FDTD method is also explored.

3

1.4. THESIS OUTLINE

1.4.3 Chapter 4

A review of existing literature are selectively given in this chapter. Focus

is given on literature focusing on implementations of the FDTD method on

GPUs and their findings.

1.4.4 Chapter 5

The implementation of the FDTD method for one-dimension is discussed

in this chapter. While this chapter serves as more of an introduction into

programming with CUDA, significant speed-ups of over 20x were achieved.

Details of the test platform used throughout the thesis are also listed in this

chapter.

1.4.5 Chapter 6

The implementation of the two-dimensional FDTD method is discussed in

this chapter. The chapter explores the use of CUDA’s Compute Profiler as

a tool for determining the efficiency of the kernel code. Memory coalesc-

ing requirements are also discussed and this chapter starts to introduce the

complexity involved in programming using the CUDA framework.

The use of various absorbing boundary conditions (ABCs) as an alterna-

tive to perfectly-matched layers (PMLs) are also discussed.

1.4.6 Chapter 7

In this chapter, the FDTD method for three-dimensions is explored. The

chapter explains the difficulties involved in implementing three-dimensional

4

1.4. THESIS OUTLINE

grids and blocks of threads on CUDA. The use of the CUDA Visual Profiler

as an effective tool for debugging CUDA applications are discussed.

As with Chapter 6, alternative ABCs are explored in order to produce

faster execution times and better throughput.

5

2Finite-Difference Time-Domain

(FDTD)

2.1 Introduction

The FDTD method is a numerical method introduced by Yee in 1966 [1] to

solve the differential form of Maxwell’s equations in time-domain. Although

the method has existed for over four decades, enhancements to improve the

FDTD are continuously being published [2].

The FDTD method discretizes the Maxwell’s curl equations into time and

spatial domains. The electric fields are generally located at the edge of the

‘Yee Cell’ and the magnetic fields are located at the centre of the Yee Cell.

This is shown in Figure 2.1. In three dimensions, the number of cells in one

time step can easily be in orders of millions. A dimension of 100× 100× 100

cells already yields a total of one million cells. For example, in [3], a high

resolution head model of an adult male has a total of 4,642,730 Yee cells with

each cell having a dimension of 1 × 1 × 1 mm3.

6

2.2. ONE-DIMENSIONAL FDTD EQUATIONS

Figure 2.1: Position of the electric and magnetic fields in Yee’s scheme.

(a) Electric element. (b) Relationship between the electric and magnetic

elements. Source: [4].

2.2 One-Dimensional FDTD Equations

The Maxwell’s curl equations in free space for one-dimension are

δExδt

= − 1

ε0

δHy

δz

�� 2.1

δHy

δt= − 1

µ0

δExδz

�� 2.2

Using the finite-difference method of central approximation and rearranging

[5], the equations become

Ex

∣∣∣∣n+1/2

k

= Ex

∣∣∣∣n−1/2

k

− ∆t

ε0 · ∆x

(Hy

∣∣∣∣nk+1/2

−Hy

∣∣∣∣nk−1/2

) �� 2.3

Hy

∣∣∣∣n+1

k+1/2

= Hy

∣∣∣∣nk+1/2

− ∆t

µ0 · ∆x

(Ex

∣∣∣∣n+1/2

k+1

− Ex

∣∣∣∣n+1/2

k

) �� 2.4

The electric fields and magnetic fields are calculated alternately over the

entire spatial domain at time-step n, and this process the continued over all

7

2.3. TWO-DIMENSIONAL FDTD EQUATIONS

the time-steps until convergence is achieved.

Depending on the size of the computation domain, millions of iterations

could be required to solve the differential equations. However, the benefit

of using the FDTD method is that it only requires exchange of data with

neighbouring cells. The equations above show that Ex∣∣n+1/2

kis only updated

from Ex∣∣n−1/2

k, Hy

∣∣nk+1/2

and Hy

∣∣nk−1/2

. Similarly, Hy

∣∣n+1

k+1/2is updated from

Hy

∣∣nk+1/2

, Ex∣∣n+1/2

k+1and Ex

∣∣n+1/2

k. Therefore, the FDTD algorithm is highly

parallel in nature and significant speed-ups can be achieved by harnessing

the power of parallel processing. Further reading on parallel FDTD can be

found in [4].

2.3 Two-Dimensional FDTD Equations

For two-dimensional FDTD method in transverse-magnetic (TM) mode , the

update equations are

Hz

∣∣n+1/2

i,j= Da

∣∣i,jHz

∣∣n−1/2

i,j+Db

∣∣i,j

[(Ex∣∣ni,j+1/2

− Ex∣∣ni,j−1/2

∆y

)−(

Ey∣∣ni+1/2,j

− Ey∣∣ni−1/2,j

∆x

)] �� 2.5

Da

∣∣i,j

=1 − σ∗

i,j∆t

2µi,j

1 +σ∗i,j∆t

2µi,j

�� 2.6

Db

∣∣i,j

=

∆tµi,j

1 +σ∗i,j∆t

2µi,j

�� 2.7

Ex∣∣n+1

i,j= Ca

∣∣i,jEx∣∣ni,j

+ Cb∣∣i,j

(Hz

∣∣n+1/2

i,j+1/2−Hz

∣∣n+1/2

i,j−1/2

∆y

) �� 2.8

Ey∣∣n+1

i,j= Ca

∣∣i,jEy∣∣ni,j

+ Cb∣∣i,j

(Hz

∣∣n+1/2

i+1/2,j−Hz

∣∣n+1/2

i−1/2,j

∆x

) �� 2.9

8

2.4. THREE-DIMENSIONAL FDTD EQUATIONS

Ca∣∣i,j

=1 − σi,j∆t

2εi,j

1 +σi,j∆t

2εi,j

�� 2.10

Cb∣∣i,j

=

∆tεi,j

1 +σi,j∆t

2εi,j

�� 2.11

where σ and σ∗ are the electric and magnetic conductivity respectively. ε

and µ are the permittivity and permeability respectively.

2.4 Three-Dimensional FDTD Equations

For three-dimensional FDTD method, the update equations are

Ex∣∣n+1

i,j,k= Ca

∣∣i,j,k

Ex∣∣ni,j,k

+ Cb∣∣i,j,k

[(Hz

∣∣n+1/2

i,j+1/2,k−Hz

∣∣n+1/2

i,j−1/2,k

∆y

)−

(Hy

∣∣n+1/2

i,j,k+1/2−Hy

∣∣n+1/2

i,j,k−1/2

∆z

)] �� 2.12

Ey∣∣n+1

i,j,k= Ca

∣∣i,j,k

Ey∣∣ni,j,k

+ Cb∣∣i,j,k

[(Hx

∣∣n+1/2

i,j,k+1/2−Hx

∣∣n+1/2

i,j,k−1/2

∆z

)−

(Hz

∣∣n+1/2

i+1/2,j,k−Hz

∣∣n+1/2

i−1/2,j,k

∆x

)] �� 2.13

Ez∣∣n+1

i,j,k= Ca

∣∣i,j,k

Ez∣∣ni,j,k

+ Cb∣∣i,j,k

[(Hy

∣∣n+1/2

i+1/2,j,k−Hy

∣∣n+1/2

i−1/2,j,k

∆x

)−

(Hz

∣∣n+1/2

i,j+1/2,k−Hz

∣∣n+1/2

i,j−1/2,k

∆y

)] �� 2.14

Ca∣∣i,j,k

=1 − σi,j,k∆t

2εi,j,k

1 +σi,j,k∆t

2εi,j,k

�� 2.15

Cb∣∣i,j,k

=

∆tεi,j,k

1 +σi,j,k∆t

2εi,j,k

�� 2.16

9

2.4. THREE-DIMENSIONAL FDTD EQUATIONS

Hx

∣∣n+1/2

i,j,k= Da

∣∣i,j,k

Hx

∣∣n−1/2

i,j,k+Db

∣∣i,j,k

[(Ey∣∣ni,j,k+1/2

− Ey∣∣ni,j,k−1/2

∆z

)−(

Ez∣∣ni,j+1/2,k

− Ez∣∣ni,j−1/2,k

∆y

)] �� 2.17

Hy

∣∣n+1/2

i,j,k= Da

∣∣i,j,k

Hy

∣∣n−1/2

i,j,k+Db

∣∣i,j,k

[(Ez∣∣ni+1/2,j,k

− Ez∣∣ni−1/2,j,k

∆x

)−(

Ex∣∣ni,j,k+1/2

− Ex∣∣ni,j,k−1/2

∆z

)] �� 2.18

Hz

∣∣n+1/2

i,j,k= Da

∣∣i,j,k

Hz

∣∣n−1/2

i,j,k+Db

∣∣i,j,k

[(Ex∣∣ni,j+1/2,k

− Ex∣∣ni,j−1/2,k

∆y

)−(

Ey∣∣ni+1/2,j,k

− Ey∣∣ni−1/2,j,k

∆x

)] �� 2.19

Da

∣∣i,j,k

=1 − σ∗

i,j,k∆t

2µi,j,k

1 +σ∗i,j,k∆t

2µi,j,k

�� 2.20

Db

∣∣i,j,k

=

∆tµi,j,k

1 +σ∗i,j,k∆t

2µi,j,k

�� 2.21

where σ and σ∗ are the electric and magnetic conductivity respectively. ε

and µ are the permittivity and permeability respectively.

10

3Compute Unified Device Architecture

(CUDA)

3.1 Introduction

CUDA is the hardware and software architecture introduced by NVIDIA in

November 2006 [6] to provide developers with access to the parallel computa-

tional elements of NVIDIA GPUs. The CUDA architecture enables NVIDIA

GPUs to execute programs written in various high-level languages such as

C, Fortran, OpenCL and DirectCompute. The newest architecture of GPUs

by NVIDIA (codenamed ‘Fermi’) also fully supports programming through

the C++ language [7].

Because of advancements in technology, the processing power and paral-

lelism of GPUs are continuously increasing. CUDA’s scalable programming

model makes it easy to provide this abstraction to software developers, allow-

ing the program the automatically scale according to the capabilities of the

GPU without any change in code, unlike traditional graphics programming

languages such as OpenGL [8]. This is illustrated in Figure 3.1.

11

3.1. INTRODUCTION

Figure 3.1: A multithreaded program is partitioned into blocks of threads

that execute independently from each other, so that a GPU with more cores

will automatically execute the program in less time than a GPU with fewer

cores. Source: [6].

12

3.1. INTRODUCTION

Figure 3.2: The GPU devotes more transistors to data processing. Source:

[6].

Because the GPU and CPU both serve different purposes in a computer,

their microprocessor architecture as shown in Figure 3.2 is very different.

While CPUs currently have up to six processor cores (Intel Core i7-970), a

GPU has hundreds. For example, the NVIDIA Tesla 20-series has 448 CUDA

cores [7].

Compared to the CPU, the GPU devotes more transistors to data pro-

cessing rather than data caching and flow control. This allows GPUs to spe-

cialize in math-intensive, highly parallel operations compared to the CPU

which serves as a multi-purpose microprocessor. Therefore, calculations of

the FDTD algorithm are potentially much faster when executed on the GPU

instead of the CPU. This is becoming increasingly true as graphics card ven-

dors such as NVIDIA and AMD are now developing more graphics card for

high performance computing (HPC) such as the NVIDIA Tesla [9].

CUDA has a single-instruction multiple-thread (SIMT) execution model

where multiple independent threads execute concurrently using a single in-

struction [7]. CUDA GPUs have a hierarchy of grids, threads and blocks

as shown in Figure 3.3. Each thread has its own private memory. Shared

13

3.2. COMPUTATION CAPABILITY OF GRAPHICS PROCESSINGUNITS

memory is available per-block and global memory is accessible by all threads.

This multi-threaded architecture model puts focus on data calculations rather

than data caching. Thus, it can sometimes be faster to recalculate rather

than cache on a GPU.

A CUDA program is called a kernel and the kernel is invoked by a CPU

program. The CUDA programming model assumes that CUDA threads ex-

ecute on a physically separate device (GPU). The device is a co-processor

to the host (CPU) which runs the program. CUDA also assumes that the

host and device both have separate memory spaces: host memory and device

memory, respectively. Because host and device both have their own separate

memory spaces, there is potentially a lot of memory allocation, deallocation

and data transfer between host and device. Thus, memory management is a

key issue in GPGPU computing. Inefficient use of memory can significantly

increase the computation time and mask the speed-ups obtained by the data

calculations.

3.2 Computation Capability of Graphics Processing

Units

The number of floating points operations per second (flops) of a computer is

one of the measures of the computational abilities of a computer. This is an

important measure especially in scientific calculations as it is an indication

of a computer’s arithmetic capabilities.

While a high-performance CPU can have a double precision computation

capability of 140 Gflops (Intel Nehalem architecture) [8], an NVIDIA Tesla

20-series (NVIDIA Fermi architecture) GPU has a peak single precision per-

formance of 1.03 Tflops and a peak double precision performance of 515

14


Figure 3.3: Hierarchy of threads, blocks and grid in CUDA. Source: [6].

15


Figure 3.4: Growth in single precision computing capability of NVIDIA’s

GPUs compared to Intel’s CPUs. Source: [8].

Gflops [9]. Furthermore, Figure 3.4 shows that the computation capability

of a GPU is growing at a much faster pace compared to the CPU.

Although the compute capability of a GPU is impressive when compared

to the CPU, it has one significant disadvantage in scientific applications. Not

all GPUs fully conform to the IEEE standard for floating point operations

[10]. Although the floating point arithmetic of NVIDIA graphics cards is

similar to the IEEE 754-2008 standard used by many CPU vendors, it is not

quite the same especially for double precision [6].

In computers, the natural form of representation of numbers is in binary

(1’s and 0’s). Thus, computers cannot accurately represent real numbers.

There are standards to represent floating-point numbers in computers and

the most widely used is the IEEE 754 standard. Accuracy in representation

16

3.3. MEMORY STRUCTURE

of floating point numbers in computers is important in scientific applications.

There have been many cases where errors in floating-point representation

have caused catastrophes. One example is the failure of the American Pa-

triot Missile defence system to intercept an incoming Iraqi Scud missile at

Dharan, Saudi Arabia on February 25, 1991. This resulted in the death of

28 Americans [11]. The cause of this was determined to be the loss of accu-

racy from conversion of an integer to real number in the Patriot’s computer.

Other examples of catastrophes resulting from floating-point representation

errors can be found at [12].

Thus, accurate floating-point representation is important in scientific

computations. Most CPU manufacturers now use the IEEE 754 floating-

point standard. As developments in GPGPUs continue, GPU vendors will

inevitably conform to the IEEE 754 standard for floating-point representa-

tion as well. This is evident with the newest Fermi architecture from NVIDIA

which implements the IEEE 754-2008 floating point for both single and dou-

ble precision arithmetic [7].

3.3 Memory Structure

The CUDA memory hierarchy is shown in Figure 3.5.

These different types of memory differ in size, access times and restric-

tions. Detailed descriptions of the various memory types are available in the

CUDA Programming Guide [6] and the CUDA Best Practices Guide [13].

In short, global memory is the largest in size and is located off the GPU

chip and can be accessed by any thread. Because it is off-chip, its access

time is the slowest amongst all the various types of memory. Shared memory

is located on-chip which makes memory access very fast compared to global

17

3.3. MEMORY STRUCTURE

Figure 3.5: Hierarchy of various types of memory in CUDA. Source: [6].

18

3.4. CONCLUSIONS

memory. However, it is limited in size and shared memory access is only on

block-level. This means that a thread cannot access shared memory that is

allocated outside its block. Local memory is located off-chip and thus has a

long latency. It is used to store variables when there are insufficient registers

available.

Other different types of memory not shown in Figure 3.5 are registers and

constant memory. Registers are located on-chip and are scarce. Registers

are not shared between threads. Constant memory is located off-chip but is

cached. Caching makes memory accesses to constant memory fast although

it is located off-chip.

While global memory is located off-chip and has the longest latency, there

are techniques available that can reduce the amount of GPU clock cycles

required to access large amounts of memory at one time. This can be done

through memory coalescing. Memory coalescing refers to the alignment of

threads and memory. For example, if memory access is coalesced, it takes

only one memory request to read 64-bytes of data. On the other hand, if it

is not coalesced, it could take up to 16 memory requests depending on the

GPU’s compute capability. This is further explained in Section 3.2.1 of the

CUDA Best Practices Guide [13].

3.4 Conclusions

GPUs have a parallel architecture with the capability of executing thousands

of threads simultaneously. This gives the GPU advantage over the CPU

when it comes to intensive computations on large amounts of data. With

the CUDA framework, developers have access to CUDA-enabled NVIDIA

GPUs. This allows developers to leverage this computation capability for

19

3.4. CONCLUSIONS

applications other than graphics-rendering. Along with this, CUDA provides

a relatively cheap alternative to supercomputing.

20

4Literature Review

In the NVIDIA GPU Computing Sofware Development Kit (SDK), there is

a sample code for three-dimensional FDTD simulation. However, the FDTD

method implemented is not the conventional FDTD method as discussed in

Chapter 2. In [14], the code was tested on an NVIDIA Tesla S1070 and a

throughput of nearly 3,000 Mcell/s were achieved. While the code is not of

much use to the FDTD method explored in this thesis, the throughput it

achieved provides a good indication of the NVIDIA Tesla S1070’s capability.

In [15], the use of CUDA on the FDTD method is explored and a short

summary of CUDA’s architecture is given. In this paper, four NVIDIA Tesla

C1060’s were used for testing and this resulted in a throughput of almost

2,000 Mcell/s.

In [16], a two-dimensional FDTD simulation for mobile communications

systems was implemented on CUDA. In this paper, the Convolutional Per-

fectly Matched Layer (CPML) was used as the absorbing region. The paper

discusses the use of shared memory, and configuring block sizes for optimal

performance. The results from the simulation on an NVIDIA Tesla C870 pro-

duced a throughput of 760 Mcells/s. MATLAB was also used for comparison

21

4.1. SUMMARY

and it was the slower compared to both the CPU and CUDA.

In the article by Garland, M. et al. [17], a detailed explanation of CUDA’s

architecture is given. The article also summarises a few applications that are

suited for running on CUDA such as molecular dynamics, medical imaging

and fluid dynamics.

In [8], the author discusses the three-dimensional FDTD algorithm in-

cluding its implementation on CUDA. Applications of the FDTD method

such as in microwave systems and biomedicine are discussed. The author ar-

gues that “there are an infinite number of ways in which an algorithm can be

partitioned for parallel execution on a GPU”. On an NVIDIA Tesla S1070, a

maximum speed-up of 1,680 Mcells/s was achieved and the optimum simula-

tion size is said to be up to 380 Mcells. By using a cluster of Tesla S1070’s,

a throughput of over 15,000 Mcells/s was achieved.

4.1 Summary

In summary, while there are many existing literature on the topic of ac-

celerating the FDTD method using CUDA, there are few that detail the

difficulty and complexity involved in developing the program. All of the ex-

isting literature reviewed show that significant speed-ups were achieved by

using CUDA to accelerate computation. Thus, this thesis investigates the

complexity involved in programming using the CUDA framework to achieve

the speed-ups.

22

51-D FDTD Results

5.1 Introduction

The development of the code for running the FDTD method using CUDA was

done incrementally, starting from one-dimension (1-D), followed by 2-D and

finally 3-D; these are given in later chapters. For each phase, new methods

and techniques were used to provide more speed-up as the algorithm became

more complex and the amount of data processed increased. For consistency,

the platform used for testing was not changed throughout the development.

The specifications of the test platform are listed in the following section.

Each chapter will also present the various methods and techniques used

to achieve speed-ups. In each phase, various simulation sizes tested. The

results of the simulations are compared and explained. Execution times and

speed-ups are listed. To address concerns in the accuracy of the CUDA

implementation, the variation of results between CPU and GPU will also be

detailed.

Throughout the thesis, speed-ups and throughputs are used to quantify

23

5.2. TEST PLATFORM

performance. They are defined as

Speed-up =CPU Execution Time

GPU Execution Time

�� 5.1

Throughput (Mcells/s) =Number of cells × Number of time-steps

106 × Execution Time (s)

�� 5.2

where Mcells is a million-cells.

5.2 Test Platform

The finite-difference time-domain implementation for running on CUDA was

tested on the computer ‘Emory’. The specifications of the computer are

shown in Table 5.1.

Operating System 64-bit CentOS Linux

Memory (RAM) 32 GB

CPU Dual Intel Xeon E54x30

Clock Speed: 2.66GHz

Number of cores: 4

GPU Dual NVIDIA Tesla S1070

Clock Speed: 1.30GHz

Number of processors: 4

Number of cores per processor: 240

Table 5.1: Specifications of the test platform.

The NVIDIA Tesla S1070 has a CUDA Compute Capability of 1.3. De-

tailed specifications of the Tesla S1070 are listed in Appendix A.

As shown in Table 5.1, there are 4 processors in the Tesla S1070. However,

for the purpose of this thesis, only one of the processors is utilized.

24

5.3. RESULTS

5.3 Results

The one-dimensional FDTD algorithm that is used for porting to CUDA

is a simulation of a wave travelling in free space with absorbing boundary

conditions. The computer equations for the algorithm are

1 ex[i] = ca[i] * ex[i] - cb[i] * (hy[i] - hy[i-1]);

2 hy[i] = da[i] * hy[i] - db[i] * (ex[i+1] - ex[i]);

Listing 5.1: Main update equations for the 1-D FDTD Method.

where ex is the electric field and hy is the magnetic field. ca, cb, da and

db are coefficients which are pre-calculated before running the main update

equations of Listing 5.1. These coefficients remain constant throughout the

main update loop.

As this is the first attempt at getting the FDTD to run on CUDA, the

algorithm in Listing 5.1 was made to run in a CUDA kernel with very little

modifications to other segments of the code. The initialization routines and

pre-calculating of coefficients (ca, cb, da and db) are still done by the CPU.

After all initialization, all necessary data (ex, hy, ca, cb, da and db) are

transferred from the CPU to the GPU’s global memory. Then, the CUDA

kernel which contains the update equations of Listing 5.1 is executed.

To compare execution time, CUDA’s timer functions are utilized. Only

the time taken to run the main loop of the FDTD update equations is

recorded.

In order to analyse both the accuracy and the execution time between

CPU and GPU, the CUDA code has to be executed twice. This is because

the GPU works with the data stored in its global memory and the data has

to be transferred to the CPU before it can be analysed. Thus, after each

time-step, the data in GPU’s global memory is transferred to the CPU for

25

5.3. RESULTS

processing.

However, with these memory transfers, the execution time for the main

loop cannot be accurately obtained. Thus, the CUDA code is executed again,

this time without the memory transfers. This will provide a more accurate

and fair comparison against the CPU’s execution time.

Listing 5.2 and Listing 5.3 show the C code that utilizes the CPU and

CUDA respectively. Listing 5.4 shows the C code for the CUDA kernel.

1 for (int n = 0; n < Nmax; n++) {

2 int m;

3

4 pulse = exp(-0.5 * pow((no - n)/spread,2)) ;

5

6 for ( m = 1; m < Ncells ; m++)

7 ex[m] = ca[m] * ex[m] - cb[m] * (hy[m] - hy[m-1]);

8

9 ex[Location] = ex[Location] + pulse;

10

11 for ( m = 0; m < Ncells - 1; m++)

12 hy[m] = da[m] * hy[m] - db[m] * (ex[m+1] - ex[m]);

13 }

Listing 5.2: Main update loop for the 1-D FDTD method using CPU.

1 for (int n = 0; n < Nmax; n++) {

2 calcOneTimeStep_cuda<<<numberOfBlocks, threadsPerBlock>>>(ex

, hy, ca_d, cb_d, da_d, db_d, Ncells, Ncells/2, n);

3 }

Listing 5.3: Main update loop for the 1-D FDTD method using CUDA.

26

5.3. RESULTS

1 __global__ void calcOneTimeStep_cuda(float *ex, float* hy,

float* ca, float* cb, float* da, float* db, int Ncells,

int Location, int Nmax) {

2 float pulse;

3 float no = 40;

4 float spread = 12;

5

6 int i = blockIdx.x * blockDim.x + threadIdx.x;

7

8 pulse = exp(-0.5 * pow((no - n)/spread,2)) ;

9

10 if ( i > 0 && i < Ncells )

11 ex[i] = ca[i] * ex[i] - cb[i] * (hy[i] - hy[i-1]);

12

13 if ( i == Location )

14 ex[i] = ex[i] + pulse;

15

16 __syncthreads();

17

18 if ( i < Ncells - 1 )

19 hy[i] = da[i] * hy[i] - db[i] * (ex[i+1] - ex[i]);

20 }

Listing 5.4: CUDA kernel for the 1-D FDTD method.

27

5.3. RESULTS

Table 5.2 shows the results obtained from running the code on Emory.

The number of time steps is 3,000 and the execution time is an average of

five runs.

Simulation Size Execution Time (ms)Speed-up

(number of cells) CPU CUDA

100 5.368 178.111 0.0301

512 25.340 188.3199 0.1346

1,000 49.781 187.775 0.2651

1,024 51.324 183.427 0.2798

5,120 263.243 193.365 1.3614

10,240 520.865 203.430 2.5604

51,200 2,674.100 315.700 8.4704

102,400 5,125.420 456.463 11.2286

512,000 28,962.292 1,739.171 16.6529

1,024,000 57,911.306 2,938.229 19.7096

5,120,000 290,531.271 13,897.174 20.9058

10,240,000 558,871.438 27,616.772 20.2367

Table 5.2: Results for one-dimensional FDTD simulation.

There are a few interesting observations that can be made from the re-

sults. Firstly, as expected, higher speed-ups are obtained when the simulation

size is increased. At the simulation size of approximately ten million cells,

the CPU takes more than nine minutes to run while the GPU only takes

slightly more than 27 seconds.

This speed-up will be more appreciable as simulation size is increased

and the FDTD is done in three dimensions instead of only one dimension.

Simulation that will take hours to run on CPU could potentially take only

28

5.3. RESULTS

0.005 0.01 0.05 0.1 0.5 1 5 100

5

10

15

20

25

Speed−

up

Simulation Size (Mcells)

Figure 5.1: Speed-up for one-dimensional FDTD simulation.

29

5.3. RESULTS

0.005 0.01 0.05 0.1 0.5 1 5 100

200

400

600

800

1000

1200

Th

roug

hpu

t, M

cells

/s

Simulation Size (Mcells)

Figure 5.2: Throughput for one-dimensional FDTD simulation running on

CUDA.

30

5.3. RESULTS

minutes to run on a GPU.

Secondly, the results show that for small simulation sizes such as 100 cells,

the CPU runs much faster than the CPU. This is a result of the latency of

memory transfers between host (CPU) and device (GPU) as discussed in

Section 3.1. Because both host and device have separate memory spaces,

data has to be transferred between host and device in order for the GPU

to perform the calculations. However, when the simulation size is small, the

fast speed of the GPU in arithmetic operations is obscured by the time taken

to transfer the data.

Also, the runtime of the GPU is longer when the simulation size is 1,000

instead of 1,024. The reason for this is with the thread and block organization

of the GPU. For this simulation, the number of threads per block was set

constant at 512. The number of blocks required is then obtained by dividing

the simulation size with the threads per block. Simulation sizes of 100 and

1,024 cells both require two blocks but the former does not fully use all

threads in the one of the blocks.

CUDA uses a SIMT model as discussed in Section 3.1. Multiple threads

run the same instruction simultaneously. With the simulation size of 1,000

cells, not all the threads in the blocks are used. Specifically, 24 threads are

not performing the calculations. This breaks homogeneity and causes the

CUDA GPU which has a SIMT model to perform slower. One method to

prevent breaking homogeneity would be to program the GPU to calculate on

all threads but ignore the results from the 24 unused threads.

It is also interesting to note that from Figure 5.2, the throughput appears

to saturate at around 1,100 Mcells/s.

31

5.4. DISCREPANCY IN RESULTS

5.4 Discrepancy In Results

The results from CUDA and the CPU were compared by obtaining the differ-

ence between the CPU and CUDA electric fields. The percentage in change

was also calculated.

Difference =

∣∣∣∣Ex,CPU ∣∣nk − Ex,CUDA∣∣nk

∣∣∣∣ �� 5.3

Percentage change =

∣∣∣∣∣Ex,CPU∣∣nk− Ex,CUDA

∣∣nk

Ex,CPU∣∣nk

∣∣∣∣× 100%�� 5.4

where n is the time step and k is the spatial location. Instead of analysing all

the differences at every time step and at every cell, only the largest difference

and the largest percentage of change over all the time steps in the whole

simulation space is recorded. This is sufficient to determine whether CUDA

is performing correctly. The results are shown in Table 5.3.

To illustrate the significance of the discrepancy in results, MATLAB was

used to the plot the results generated from both the CPU and CUDA. Only

the results for the simulation size of 100 cells are plotted. It is redundant to

plot the other simulation sizes because all parameters including the type of

wave used are set to be constant for all the simulation sizes. The only change

is in spatial size.

From the plots, it can be concluded that there are no appreciable dif-

ferences in the results. The differences are small enough to ignore and in-

significant considering the speed-ups obtained. Although Table 5.3 shows

a discrepancy of almost 50%, the magnitude of this difference is less than

9 × 10−7. The plot confirms that the differences are negligible.

However so, it must be said that these comparisons are crucial and were

made throughout the development. There are many parts of the program that

could easily go wrong and these checks are necessary in order to ensure that

32


Simulation Size Difference Percentage change

100 8.940697 × 10−7 46.901276%

512 1.184060 × 10−6 20.490625%

1,000 5.365000 × 10−6 37.937469%

1,024 5.598064 × 10−6 18.774035%

5,120 4.082790 × 10−5 2.390608%

10,240 4.082790 × 10−5 2.390608%

51,200 4.082790 × 10−5 2.390608%

102,400 4.082790 × 10−5 2.390608%

512,000 4.082790 × 10−5 2.390608%

1,024,000 4.082790 × 10−5 2.390608%

5,120,000 4.082790 × 10−5 2.390608%

10,240,000 4.082790 × 10−5 2.390608%

Table 5.3: Discrepancy in results between CPU and CUDA simulation of the

one-dimensional FDTD method.

33


0 500 1000 1500 2000 2500 3000−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

Time−step

Magnitude

CPU

GPU (CUDA)

Figure 5.3: Plot of 1-D FDTD simulation results to compare accuracy be-

tween CPU and GPU. The plots are generated from the result from simula-

tion size of 100 cells. The location of the probe is at x = 30.

34


50 55 60 65 70 75 80 85 900

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Time−step

Magnitude

CPU

GPU (CUDA)


tween CPU and GPU. The plots are generated from the result from simu-

lation size of 100 cells. The location of the probe is at x = 30. The plot is

centered between time-steps 50 and 90.

35

5.5. CONCLUSIONS

the changes made to the program do not cause CUDA to produce incorrect

results.

5.5 Conclusions

Porting the one-dimension FDTD method to CUDA has demonstrated that

there is a lot of potential in GPU acceleration for the FDTD method. Speed-

ups of over 20x and throughputs of over 1,100 Mcells/s (in comparison to a

CPU throughput of only 54 Mcells/s) are convincing.

The results have also shown that CUDA is best-suited for repetitive pro-

cessing on large amounts of data. For a small data set, CUDA does not

perform well due to the memory latency issues.

Concerns of discrepancy between the CPU and GPU results discussed in

Section 3.2 were addressed and the simulations performed shows no signifi-

cant difference.

Although the results are convincing, the one-dimension FDTD method is

relatively simple compared to its two- and three-dimension counterparts. The

difficulty in extracting speed-ups from CUDA increases with the number of

dimensions and various optimizations have to be performed. This is discussed

in the following chapters.

36

62-D FDTD Results

6.1 Introduction

Just as in the one-dimensional FDTD method, a wave emanating through

free space from a point source is simulated. The computer code for the

two-dimensional FDTD method’s main field update equations are shown in

Listing 6.1.

1 ex[i][j] = caex[i][j] * ex[i][j] + cbex[i][j] * (hz[i][j] - hz

[i][j-1]);

2 ey[i][j] = caey[i][j] * ey[i][j] + cbey[i][j] * (hz[i-1][j] -

hz[i][j]);

3 hz[i][j] = dahz[i][j] * hz[i][j] + dbhz[i][j] * (ex[i][j+1] -

ex[i][j] + ey[i][j] - ey[i+1][j]);


Perfectly-match layers (PML) were used as the artificial absorbing layer

to prevent the travelling waves from reflecting at the boundaries of the simu-

lation space. This causes the main update loop to be more complex compared

to the 1-D loop. This is because the equations for the PML region are differ-

37

6.2. TEST PARAMETERS

ent from the FDTD update equations. Apart from that, the use of the PMLs

introduces four more regions in addition to the main field domain—the PML

regions on the top, bottom, left and right of the central FDTD region.

The flowchart in Figure 6.1 provides an illustration of the main update

loop for the two-dimensional FDTD method with PMLs as the absorbing

region.

As the PML update equations are too long, they will not be listed here.

However, it is important to note that the PML update equations and main

field update equations are independent of each other. This allows the PML

and the field equations to be updated concurrently.

Obtaining a noticeable speed-up for the two-dimensional FDTD method

proved to be more challenging compared to the one-dimensional FDTD method.

As was discovered, the main factors that were causing CUDA to run slowly

were related to memory. However, all problems faced during the development

and methods used to improve the speed-up will be discussed in the following

sections.

6.2 Test Parameters

The program execution set-up used for testing is shown in Table 6.1.

As in the one-dimensional FDTD simulation, there needs to be a way of

determining whether CUDA is performing correctly and producing accurate

results. To do this, the magnetic field, Hz, is monitored at a particular

location. This probe location could be anywhere in the FDTD region. For

consistency, the cell at location (25, 25) is used. The magnetic field calculated

by the CPU and by CUDA at this location is recorded throughout all the

time-steps.

38


Initialize Model

Update electric

fields ex and

ey in main grid

Update ex in all

PML regions

Update ey in all

PML regions

Update magnetic

field hz in main grid

Update hzx in

all PML regions

Update hzy in

all PML regions

Has

time-step

ended?

End

no

yes

Figure 6.1: Flowchart for two-dimensional FDTD simulation.

39


Number of time steps 1000

x-location of wave source 75

y-location of wave source 75

x-location of probe 25

y-location of probe 25

CUDA x-dimension of a block 16

CUDA y-dimension of a block 16

Simulation Sizes 128 × 128

256 × 256

512 × 512

1024 × 1024

2048 × 2048

Thickness of PML 8 cells

Table 6.1: Set-up for two-dimensional FDTD simulation.

40

6.3. INITIAL RUN

The original CPU execution times are shown in Table 6.2. Throughout

this chapter, the speed-ups calculated are based on the CPU execution times

in this table.

Simulation Size CPU Execution Time (ms)

128 × 128 5,177.621

256 × 256 18,342.557

512 × 512 75,437.734

1024 × 1024 288,112.225

2048 × 2048 1,129,977.275

Table 6.2: Results for two-dimensional FDTD simulation on CPU.

6.3 Initial Run

Similar to the development of the 1-D code, the 2-D code was written to

execute the main update equation using CUDA. Specifically, Ex, Ey and Hz

update equations are executed using CUDA. The field updates in the PML

regions however were still executed using the CPU. In order to do this, the Ex

and Ey fields had to be transferred from the host to CUDA before executing

a CUDA kernel and then transferred back from CUDA to the host after the

kernel has finished executing. As expected from the discussion in Section

3.3, the memory operations were costly and resulted in a slower execution

compared to the CPU. The results are shown in Table 6.3.

The results show that as the simulation sized increases, the speed-up

decreases. This is because at a larger simulation sizes, the amount of mem-

ory transferred per time-step is larger. To reduce the memory transactions,

CUDA is used to update the PML regions as well.

41

6.4. UPDATING PML USING CUDA

Simulation Size CUDA Execution Time (ms) Speed-up

128 × 128 5,327.528 0.97186181

256 × 256 19,307.889 0.950003219

512 × 512 108,445.182 0.695630109

1024 × 1024 392,133.552 0.734729848

2048 × 2048 1,550,150.542 0.728946799

Table 6.3: Results for two-dimensional FDTD simulation on CUDA (initial

run).

6.4 Updating PML Using CUDA

By using CUDA to update the PML region, the need for any data transfers

between the host and the device in the main update loop is eliminated. How-

ever, this prevents us from probing the magnetic field in order to determine

the accuracy of CUDA’s execution as discussed in Section 6.2. This is solved

in a similar fashion as the 1-D FDTD simulation. The program is executed

twice. First without probing and second with probing. The first execution

will be timed so that an accurate speed-up can be calculated. In the second

execution, the magnetic field is saved so that a comparison can be made

between CUDA and the CPU.

Comparing the results of Table 6.3 and Table 6.4, there is a noticeable

improvement to the execution times when CUDA is used to calculate the

PML. However, CUDA’s perfomance is still very poor and this is certainly not

to be expected given the results from the one-dimensional FDTD simulation.

The reason behind this is not immediately apparent but further investigation

into the CUDA execution model reveals the reason and it is discussed is

Section 6.6.

42

6.5. COMPUTE PROFILER


128 × 128 2,388.178 2.168021637

256 × 256 7,920.042 2.315967187

512 × 512 67,085.505 1.124501243

1024 × 1024 286,012.760 1.007340458

2048 × 2048 1,157,616.333 0.976124163

Table 6.4: Results for two-dimensional FDTD simulation on CUDA (using

CUDA to update PML).

6.5 Compute Profiler

In order to understand what is happening in the GPU throughout the execu-

tion of the program and determine where the bottlenecks occur, the NVIDIA

Compute Visual Profiler tool is used. This tool is bundled together with the

NVIDIA CUDA Toolkit. The Compute Visual Profiler is “used to measure

performance and find potential opportunities for optimization in order to

achieve maximum performance from NVIDIA GPUs” [18].

Since the testing of CUDA on Emory is performed through a command-

line interface, a textual method of profiling the CUDA program is used in-

stead of the graphical user interface (GUI) method shown in Figure 6.2.

The article in [19] provides an excellent tutorial on textual profiling. For

the purpose of profiling the FDTD program, focus is given on the following

parameters:

gld incoherent Number of non-coalesced global memory loads

gld coherent Number of coalesced global memory loads

gst incoherent Number of non-coalesced global memory stores

43


Figure 6.2: Screen-shot of Visual Profiler GUI from CUDA Toolkit 3.2 run-

ning on Windows 7. The data in the screen-shot is imported from the results

of the memory test in Listing 6.4. These results are also available in Appendix

B.

44


gst coherent Number of coalesced global memory stores

The results of the profiling is expected to show a high number of non-

coalesced memory loads and non-coalesced memory stores as there has been

no optimizations to the utilization of CUDA’s memory yet. However, the

results of the profiling showed that all global memory loads and stores were

coalesced. The source code for the test and the profiling results are shown

in Appendix B.

The reason behind this could be the CUDA compute capability of Emory’s

GPU. As shown in Appendix A, the NVIDIA Tesla S1070’s compute capa-

bility is 1.3. The memory coalescing requirements for compute capability

1.3 are relaxed. This is explained in Section 3.2.1 of CUDA Best Practices

Guide 3.1 [13]. Although the profiling shows that all memory accesses are

coalesced, it does not mean memory accesses are optimised. Improvements

can be made and a simple program based on Section 3.2.1.3 of the CUDA

Best Practices Guide is tested on Emory. Table 6.5 summarises the textual

profiling results for the program.

The results in Table 6.5 show 0 incoherent memory loads (gld incoherent)

and 0 incoherent memory stores (gst incoherent). This is inconsistent with

what is expected from the program. The program should show uncoalesed

memory accesses at offsets other than 0 and 16.

Therefore, the example above clearly shows that on Emory’s GPU, textual

profiling does not provide sufficient information on whether memory accesses

are optimized.

45


Offset gld coherent gld incoherent gst coherent gst incoherent

0 128 0 512 0

1 192 0 512 0

2 192 0 512 0

3 192 0 512 0

4 192 0 512 0

5 192 0 512 0

6 192 0 512 0

7 192 0 512 0

8 192 0 512 0

9 128 0 512 0

10 192 0 512 0

11 192 0 512 0

12 192 0 512 0

13 192 0 512 0

14 192 0 512 0

15 192 0 512 0

16 128 0 512 0

Table 6.5: Textual profiling results for investigation into memory coalescing.

46

6.6. CUDA MEMORY RESTRUCTURING

6.6 CUDA Memory Restructuring

As explained in Section 3.3, there are different types of memory in CUDA and

they all have different advantages and disadvantages. While global memory is

the largest, it is not cached and the memory is located off-chip. This causes

access to global memory to be slow. Therefore, to improve the results of

the simulation, CUDA’s memory structure is investigated in order to obtain

better speed-ups. It is worth noting that in the CUDA C Best Practices

Guide [13], memory optimizations are recommended as high-priority.

In Section 6.4, the amount of memory transfers between host and device

has been reduced by using CUDA to perform the calculations in the PML

regions. This eliminated the need for any data transfers within the main

update loop. However, the data was not stored in a way to assist in coalesced

memory accesses. Although the simulation is for 2-D space, memory was only

allocated in 1-D arrays in CUDA. An example of this is shown in Listing 6.2.

1 cudaMalloc(&ex_d, ie*jb*sizeof(float));

2 cudaMemcpy(ex_d, ex, ie*jb*sizeof(float),

cudaMemcpyHostToDevice);

3 cudaMemcpy(ex, ex_d, ie*jb*sizeof(float),

cudaMemcpyDeviceToHost);

Listing 6.2: 1-D array in CUDA.

To improve on this, the cudaMallocPitch() function is used for allo-

cating memory instead of cudaMalloc(). This function is recommended by

the NVIDIA CUDA Programming Guide 3.1.1 for allocating 2D memory.

cudaMallocPitch() ensures that memory allocation is padded properly to

meet alignment requirements for coalesced memory access. Further expla-

nation on this can be found in Section 5.3.2.1.2 of the Programming Guide.

Listing 6.3 shows how cudaMallocPitch() is used in place of the code in

47


Listing 6.2.

1 cudaMallocPitch((void**)&ex_d, &ex_pitch, jb * sizeof(float),

ie);

2 cudaMemcpy2D(ex_d, ex_pitch, ex, jb * sizeof(float), jb *

sizeof(float), ie, cudaMemcpyHostToDevice);

3 cudaMemcpy2D(ex, jb * sizeof(float), ex_d, ex_pitch, jb *

sizeof(float), ie, cudaMemcpyDeviceToHost);

Listing 6.3: 1-D array in CUDA allocated using cudaMallocPitch().

To test this, the allocation of the electric field Ex in CUDA is changed to

use cudaMallocPitch() instead of cudaMalloc(). The result of this change is

shown in Table 6.6. However, the results are not convincing. There is little

difference compared the the previous result from using cudaMalloc().


128 × 128 1,380.316 3.751039236

256 × 256 5,523.732 3.320682159

512 × 512 49,619.559 1.520322553

1024 × 1024 211,769.401 1.360499787

2048 × 2048 879,721.688 1.284471318


cudaMallocPitch() for Ex memory allocation).

To investigate the cause of this, a simple program was developed to deter-

mine whether there are any advantages of using cudaMallocPitch() for 2-D

arrays. The algorithm for the program is shown in Figure 6.3. The kernel

invocation and kernel definition is shown in Listing 6.4. A full-listing of the

source code is attached in Appendix B.

48


1 // Kernel definition

2 __global__ void copy(float *odata, float* idata, int pitch,

int size_x, int size_y) {

3 unsigned int yid = blockIdx.x * blockDim.x + threadIdx.x;

4 unsigned int xid = blockIdx.y * blockDim.y + threadIdx.y;

5

6 if ( xid < size_y && yid < size_x ) {

7 int index = ( yid * pitch / sizeof(float) ) + xid;

8 odata[index] = idata[index] * 2.0;

9 }

10 }

11

12 int main() {

13 ...

14 dim3 grid(ceil((float)size_x/16), ceil((float)size_y/16), 1)

;

15 dim3 threads(16, 16, 1);

16

17 //Kernel invocation

18 copy<<<grid, threads>>>(d_odata, d_idata, pitch, size_x,

size_y);

19 }

Listing 6.4: Simple program to determine advantages of using

cudaMallocPitch().

49


Run initialization

Create 2-D array in CPU

and fill with random data

Create 2-D array in CUDA

using cudaMallocPitch().

Copy array from CPU to CUDA

Multiply each element in

array by 2 using CUDA

Copy array from CUDA to CPU

Check that CUDA has mul-

tiplied all elements correctly

End

Figure 6.3: Simple program to determine advantages of cudaMallocPitch().

A snippet of the textual profiling results are shown in the Table 6.7. For

the full results, refer to Appendix B.

Again, the results from executing this program were unconvincing. Al-

though there were no indication of uncoalesced memory reads or writes, the

execution time is still quite slow and does not improve although cudaMallocPitch

50


method Z4copyPfS iii

gputime 7579.68

gld coherent 104960

gld incoherent 0

gst coherent 209920

gst incoherent 0

Table 6.7: Snippet of results from textual profiling on uncoalesced memory

access.

() was used. Thus, this program was profiled on another computer with

Compute Capability 1.0 and the CUDA Compute Profiler showed that that

memory accesses were still uncoalesced.

This was not expected and further testing showed that CUDA memory

allocation in CUDA was column-major. To fix the problem, the code from

Listing 6.4 is replaced with the code in Listing 6.5. The only changes made

to the source code are in Lines 3, 4 and 18. The full source code is available

in Appendix C.

A snippet of the test results are shown in the Table 6.8. For the full

results, refer to Appendix C.

This change to column-major produced a significantly faster result—46

times faster in this test. It is also interesting to note that the number of

global memory loads recorded in the profiling has dropped by 16 times. This

indicates that for the uncoalesced memory test, although the profiler did not

show incoherent memory access, there were 16 times more memory accessed

than were required for a coalesced kernel. The number 16 is equivalent to a

half-warp of threads. This is consistent with the explanation in Section 3.2.1

of the CUDA Best Practices Guide [13].

51





3 unsigned int xid = blockIdx.x * blockDim.x + threadIdx.x;

4 unsigned int yid = blockIdx.y * blockDim.y + threadIdx.y;

5




9 }

10 }

11

12 int main() {

13 ...

14 dim3 grid(ceil((float)size_y/16), ceil((float)size_x/16), 1)

;

15 dim3 threads(16, 16, 1);

16



size_y);

19 }

Listing 6.5: Simple program to determine advantages of using

cudaMallocPitch().

52


method Z4copyPfS iii

gputime 161.792

gld coherent 6560

gld incoherent 0

gst coherent 26240

gst incoherent 0

Table 6.8: Snippet of results from textual profiling on coalesced memory

access.

This change to the memory access pattern is applied to the 2-D FDTD

code. The results are shown in Table 6.9.


128 × 128 855.447 6.05253435

256 × 256 1,273.917 14.39854914

512 × 512 2,514.767 29.99790577

1024 × 1024 6,739.438 42.75018872

2048 × 2048 22,020.758 51.31418658


cudaMallocPitch() and column-major indexing).

A plot of the throughput for CUDA is shown in Figure 6.4. The through-

put for CUDA is over 190 Mcells/s. This is significantly faster than the

CPU’s throughput of less than 4 Mcells/s.

53


16,384 65,536 262,144 1,048,576 4,192,3040

20

40

60

80

100

120

140

160

180

200

Th

rou

gh

pu

t, M

ce

lls/s

Simulation Size (number of cells)

Figure 6.4: Throughput for two-dimensional FDTD simulation running on

CUDA.

54



The similar tests for discrepancy between CPU and GPU as in the one-

dimensional FDTD simulation detailed in Section 5.4 are performed here.

The results are shown in Table 6.10 below.


128 × 128 4.507601 × 10−7 0.299014%

256 × 256 3.613532 × 10−7 0.304905%

512 × 512 3.613532 × 10−7 0.304905%

1024 × 1024 3.613532 × 10−7 0.304905%

2048 × 2048 3.613532 × 10−7 0.304905%

Table 6.10: Discrepancy in results between CPU and CUDA simulation of

the two-dimensional FDTD method.

The conversion to utilize CUDA for computation has not caused any

significant change in the FDTD results. The discrepancy is only around

0.3% and Figure 6.5 shows no noticeable difference in the plot of CPU and

CUDA results.

6.8 Alternative Absorbing Boundaries

During the development of the code, it became obvious that the PML used

to absorb the waves at the boundaries of the simulation space was complex.

The PMLs did not appear to be good candidate for running on CUDA and

thus, other absorbing boundary conditions (ABCs) were explored. Mur’s

first order ABC, Mur’s second order ABC and Liao’s ABC were tested as

replacement for the PMLs.

Figure 6.6 shows that both versions of Mur’s ABC have better speed-ups

55

6.8. ALTERNATIVE ABSORBING BOUNDARIES

0 1000 2000 3000 4000 5000−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

Time−step

Ma

gn

itu

de

CPU

GPU (CUDA)



tion size 128 × 128. The location of the probe is at cell (25, 25) as specified

in Table 6.1.

56


16,384 65,536 262,144 1,048,576 4,192,3040

10

20

30

40

50

60

70

Speedup


PML

1st Order Mur

2nd Order Mur

Liao

Figure 6.6: Comparison of speed-up for various ABCs in two-dimensional

FDTD simulation running on CUDA.

57


16,384 65,536 262,144 1,048,576 4,192,3040

200

400

600

800

1000

1200

Th

rou

gh

pu

t, M

ce

lls/s


PML

1st Order Mur

2nd Order Mur

Liao

Figure 6.7: Comparison of throughput for various ABCs in two-dimensional


58

6.9. CONCLUSIONS

than the PML while Liao’s ABC has the lowest speedup. However, this does

not mean that Liao’s ABC is the slowest. In fact, Liao’s ABC has a much

higher throughput than the PML as shown in Figure 6.7. The reason Liao’s

ABC performs slower than Mur’s ABC is in the use of doubles instead of

floats for the field variables. The use of floats causes Liao’s ABC to become

unstable.

The results confirm that the PML’s complexity significantly reduces the

performance of the FDTD simulations. The PML is widely-used because its

accuracy is better than the ABCs. However, the performance benefit that

can be obtained from using other ABCs on CUDA makes using the PML

less attractive. At 2048 × 2048 cells, it takes 22 seconds for the 2-D FDTD

simulation using PMLs to complete while using Mur’s second order ABC

reduces that time to less than four seconds. Furthermore, for Mur’s second

order ABC and Liao’s ABC, there was no significant discrepancy noticed

when compared to the PML.

6.9 Conclusions

In the one-dimension FDTD method, speed-ups of over 20x and a throughput

of over 1,100 Mcells/s for CUDA was recorded. In the two-dimension FDTD

method, similarly convincing results were achieved with over 60x speedup

and a throughput of over 1,400 Mcells. However, achieving those results

comes with increasing complexity and difficulty.

While a simple port to CUDA with almost no optimizations yielded sig-

nificant speed-ups for 1-D, the same cannot be said for the 2-D. It became

obvious that programming on CUDA’s framework required many optimiza-

tions for optimal performance. It also becomes clear that the question of

59

6.9. CONCLUSIONS

whether it is worth taking the time to obtain these speed-ups has to be

answered.

Alternative ABCs were explored in this chapter because the PML is not

well-suited to run on CUDA. The throughput was increased by more than

five times when Mur’s second order ABC was used. Also, Mur’s second order

and Liao’s ABC did not show significant discrepancies compared to the PML

implementation.

60

73-D FDTD Results

7.1 Introduction

The three-dimension FDTD method has six main equations as listed in Sec-

tion 2.4. The equations are converted into computer code as shown in Listing

7.1.

1 ex[i][j][k] = ca[id] * ex[i][j][k] + cby[id] * (hz[i][j][k] -

hz[i][j-1][k]) - cbz[id] * (hy[i][j][k] - hy[i][j][k-1]);

2 ey[i][j][k] = ca[id] * ey[i][j][k] + cbz[id] * (hx[i][j][k] -

hx[i][j][k-1]) - cbx[id] * (hz[i][j][k] - hz[i-1][j][k]);

3 ez[i][j][k] = ca[id] * ez[i][j][k] + cbx[id] * (hy[i][j][k] -

hy[i-1][j][k]) - cby[id] * (hx[i][j][k] - hx[i][j-1][k]);

4 hx[i][j][k] = da[id] * hx[i][j][k] + dbz[id] * (ey[i][j][k+1]

- ey[i][j][k]) - dby[id] * (ez[i][j+1][k] - ez[i][j][k]);

5 hy[i][j][k] = da[id] * hy[i][j][k] + dbx[id] * (ez[i+1][j][k]

- ez[i][j][k]) - dbz[id] * (ex[i][j][k+1] - ex[i][j][k]);

6 hz[i][j][k] = da[id] * hz[i][j][k] + dbx[id] * (ex[i][j+1][k]

- ex[i][j][k]) - dbz[id] * (ey[i+1][j][k] - ey[i][j][k]);


61


The flowchart in Figure 7.1 illustrates the main update loop of the FDTD

method for three-dimensional space.

For development and testing purposes, a Gaussian wave propagating from

a dipole antenna into free-space is simulated. After stable and fast speed-

ups were obtained on the CUDA GPU, a head model was then simulated.

The head model was chosen because it was the motivation of the project.

Simulations of the head model were taking a long time to complete using

CPU and thus, it is only appropriate that the findings of this project be used

to improve the execution time on the head model.

While the two-dimension FDTD method was more challenging to work

with on CUDA compared to the one-dimension FDTD method, the three-

dimension FDTD method proves to be significantly more difficult. As will be

discussed in this chapter, the CUDA architecture does not support running a

large number of threads in three-dimensional and thus, alternative methods

had to be used to get around this.

As was found in the two-dimension FDTD simulations, PMLs as an ab-

sorbing region is not best suited to run on CUDA. In fact, it was found that

updating the PMLs were taking longer to run than updating the main grid.

Thus, alternative absorbing boundary conditions are explored and the results

are presented at the end of this chapter.

7.2 Test Parameters

Table 7.1 shows the configuration used for the three-dimension FDTD simu-

lations during development.

The original CPU execution times are shown in Table 7.2 and these exe-

cution times are used as a base for the calculation of speed-ups.

62


Initialize Model

Update electric fields ex,

ey and ez in main grid

Update ex in

all PML regions

Update ey in

all PML regions

Update ez in

all PML regions

Update magnetic field hx,

hy and hz in main grid

Update hx in

all PML regions

Update hy in

all PML regions

Update hz in

all PML regions

Has

time-step

ended?

End

no

yes

Figure 7.1: Flowchart for three-dimensional FDTD simulation.

63


Number of time steps 1,000

Center of dipole antenna (60, 60, 60)

Location of probe (60, 60, 60)

CUDA x-dimension of a block 16

CUDA y-dimension of a block 16

Simulation Sizes 128 × 128 × 128

160 × 160 × 160

192 × 192 × 192

224 × 224 × 224

256 × 256 × 256

Thickness of PML 4 cells

Table 7.1: Set-up for three-dimensional FDTD simulation.

Simulation Size CPU Execution Time (ms)

128 × 128 × 128 515,913.3910

160 × 160 × 160 947,819.1530

192 × 192 × 192 1,588,263.3060

224 × 224 × 224 2,452,357.4220

256 × 256 × 256 3,609,773.9260

Table 7.2: Results for three-dimensional FDTD simulation on CPU.

64

7.3. CUDA BLOCK & GRID CONFIGURATIONS

7.3 CUDA Block & Grid Configurations

In the one-dimensional FDTD method, the blocks and grids of the CUDA

kernel were configured as one-dimension. Similarly, in the two-dimensional

FDTD method, the CUDA kernel was configured for two-dimension blocks

and grids. This is because it simplifies programming the CUDA kernel. How-

ever, this could not be extended to the three-dimensional FDTD method. At

this point, it worthy to note the limitations in CUDA’s block-grid configura-

tion as shown in Table 7.3.

Maximum x- or y-dimension of a grid of thread blocks 65,535

Maximum z-dimension of a grid of thread blocks 1

Maximum x- or y-dimension of a block 512

Maximum z-dimension of a block 64

Table 7.3: Limitations of the block and grid configuration for CUDA for

Compute Capability 1.x [6].

While CUDA does support three-dimensional blocks, the number of threads

in the z-dimension of a block is limited to only 64 compared to the 512 threads

for x- and y-dimension. More importantly, CUDA does not support three-

dimensional grids.

To circumvent this limitation, a two-dimensional configuration of blocks

and threads are used. Then, within the CUDA kernel, a for-loop is used to

cycle through the z-dimension. An example of this is shown in Listing 7.2.

65

7.3. CUDA BLOCK & GRID CONFIGURATIONS


2 __global__ void kernel(float *odata, int size_x, int size_y,

int size_z) {



5

6 if ( xid < size_x && yid < size_y ) {

7 for ( int zid = 0; zid < size_z; zid++ ) {

8 ...

9 odata[index] = ...;

10 }

11 }

12 }

13

14 int main() {

15 ...

16 dim3 grid(ceil((float)size_x/16), ceil((float)size_y/16), 1)

;

17 dim3 threads(16, 16, 1);

18


20 kernel<<<grid, threads>>>(d_odata, size_x, size_y, size_z);

21 }

Listing 7.2: Looping through a three-dimensional array using two-

dimensional blocks and grids in CUDA.()

66

7.4. THREE-DIMENSIONAL ARRAYS IN CUDA

7.4 Three-Dimensional Arrays In CUDA

In the two-dimensional FDTD simulation, cudaMallocPitch() was used to

allocate memory for the two-dimensional arrays. The use of cudaMallocPitch

() ensured that the arrays were properly aligned to meet CUDA’s require-

ments for coalesced memory access. For three-dimensional arrays, the cudaMalloc3D

() and make_cudaExtent() functions are used. Listing 7.3 shows how these

functions are used.

1 int main() {

2 ...

3 float* ptr;

4

5 cudaExtent extent = make_cudaExtent(width * sizeof(float),

height, depth);

6 cudaMalloc3D(&ptr, extent);

7

8 ...

9 }

Listing 7.3: Allocating a three-dimensional array in CUDA.

Further details of these functions are described in the CUDA Reference

Manual [20].

7.5 Initial Run

Table 7.4 shows the results obtained from converting the main update loop

of Figure 7.1 to utilize CUDA.

At 256 × 256 × 256 cells, the throughput was calculated to be less than

19 Mcells/s. This is significantly less than what was achieved for the two-

dimensional FDTD—approximately 222 Mcells/s for the same number of

67

7.6. VISUAL PROFILING


128 × 128 × 128 107,880.5770 4.7823

160 × 160 × 160 194,591.8580 4.8708

192 × 192 × 192 568,707.0920 2.7928

224 × 224 × 224 881,243.1030 2.7828

256 × 256 × 256 890,487.4880 4.0537

Table 7.4: Results for three-dimensional FDTD simulation on CUDA (initial

run).

cells (4096 × 4096).

However, it is worthy to note that although there hasn’t been much op-

timizations done yet, it takes CUDA less than 15 minutes to complete a

simulation that takes an hour to complete on the CPU.

7.6 Visual Profiling

To improve on the speed-ups, the CUDA textual profiler was used. The

results were similar to those obtained in Chapter 6. The textual profiler

did not show any uncoalesced memory accesses and provided insufficient

information to determine where the bottleneck was.

Later on in the project, it was discovered that although Emory does not

have a GUI natively, the Visual Profiler could be executed executed using X11

forwarding. A secondary computer was used to connect to Emory through

an SSH client with X11 forwarding enabled. This allowed the CUDA Visual

Profiler to show on the secondary computer.

The Visual Profiler proved to be very useful and immediately the bottle-

neck for the three-dimensional FDTD method was found. The Visual Profiler

68

7.7. MEMORY COALESCING FOR 3-D ARRAYS

showed that global memory load and global memory store efficiency was just

0.06 for most of the CUDA kernels. The profiling results for the six main

kernels are shown in Table 7.5.

On the other hand, when the Visual Profiler was executed for the two-

dimensional FDTD method developed in Chapter 6, the results showed global

memory load and global memory store efficiencies that were close to one.

Method gld efficiency gst efficiency

ex cuda d 0.0630682 0.0630686

ey cuda d 0.0613636 0.0613641

ez cuda d 0.0630682 0.0630686

hx cuda d 0.06 0.0600002

hy cuda d 0.065625 0.0656253

hz cuda d 0.061875 0.0618752

Table 7.5: Results from visual profiling on three-dimensional FDTD simula-

tion.

These results were unexpected because the arrays were allocated using

cudaMalloc3D() as recommended by the CUDA Programming Guide [6] and

the CUDA Best Practices Guide [13].

7.7 Memory Coalescing for 3-D Arrays

The CUDA documentations [6, 13, 20] were consulted but there were no

mention on how to index three-dimensional arrays in the kernel for coalesced

memory accesses. The NVIDIA forums was not helpful in this matter as

well.

However, by slowly analysing and debugging the CUDA kernel, a simple

69


solution was found on how to index arrays in the kernel while maintaning

aligned accessed to global memory. The only limitation with this solution is

that the dimensions of threads in a block must be the same. Specifically, the

x- and y-dimensions of a block must be the same.

This solution was a huge milestone in the development of the three-

dimensional code and the speed-ups achieved are listed in Table 7.6. Figures

7.2 and 7.3 show the plots for the speed-up and throughput respectively.


128 × 128 × 128 25,842.8480 19.9635

160 × 160 × 160 35,609.7410 26.6169

192 × 192 × 192 83,756.5080 18.9629

224 × 224 × 224 142,680.5420 17.1877

256 × 256 × 256 126,128.8680 28.6197

Table 7.6: Results for three-dimensional FDTD simulation on CUDA (with

coalesced memory access).

A throughput of over 130 Mcells/s and a speedup of over 28x was achieved

for a simulation size of 16 Mcells. While CUDA’s execution time for the

three-dimensional FDTD is significantly better than that of the CPU, the

throughput is still much slower compared to the two-dimensional FDTD

executed on CUDA. Again, the use of PMLs as absorbing regions can be

attributed to low throughput. Thus, alternative absorbing boundary condi-

tions are used in replacement of the PMLs for comparison. This is discussed

in the following section.

The Visual Profiler was executed on the improved kernels to check the

efficiency of the global memory accesses. The results are shown in Table 7.7.

Comparing the results of Table 7.5 and Table 7.7, it is obvious that there is

70


2,097,152 4,096,000 7,077,888 11,239,424 16,777,2160

5

10

15

20

25

30

Speed−

up


Figure 7.2: Speed-up for three-dimensional FDTD simulation.

71


2,097,152 4,096,000 7,077,888 11,239,424 16,777,2160

20

40

60

80

100

120

140

Th

rou

gh

pu

t, M

ce

lls/s


Figure 7.3: Throughput for three-dimensional FDTD simulation running on

CUDA.

72


a huge improvement in with the memory accesses in the new kernels.

In Table 7.7, there are efficiencies that exceed 1. This could be due to

inaccuracies in the Visual Profiler program. The CUDA Compute Visual

Profiler User Guide confirms that the efficiency should be between 0 and 1.

Method gld efficiency gst efficiency

ex cuda d 0.81388 1.01576

ey cuda d 0.785047 0.984379

ez cuda d 0.75882 1.00782

hx cuda d 0.857143 0.666667

hy cuda d 0.847646 0.658067

hz cuda d 0.941704 0.681822

Table 7.7: Results from visual profiling on three-dimensional FDTD simula-

tion (with new indexing of arrays in kernel).

Also, the plots of Figure 7.2 and Figure 7.3 show an irregularity in perfor-

mance in terms of simulation size. In one- and two-dimensional FDTD simu-

lations, the trend is an increase in throughput and speedup with increase in

simulation size. However, this trend did not extend to the three-dimensional

FDTD simulations. The results generated from the Visual Profiler showed

no obvious reason for this occurrence. Thus, the cause of this still needs to

be investigated.


The similar tests for discrepancy between CPU and GPU as in the one-

dimensional FDTD simulation detailed in Section 5.4 are performed here.

The results are shown in Table 7.8 below.

73



128 × 128 × 128 2.980232 × 10−7 0.113527%

160 × 160 × 160 2.980232 × 10−7 0.018938%

192 × 192 × 192 2.980232 × 10−7 0.136416%

224 × 224 × 224 2.980232 × 10−7 0.079594%

256 × 256 × 256 2.980232 × 10−7 0.068311%

Table 7.8: Discrepancy in results between CPU and CUDA simulation of the

three-dimensional FDTD method.

While the CUDA framework does not fully conform to the IEEE 784-

2008 standard for floating-point computation (as discussed in Section 3.2),

the results prove that the discrepancy of results between CUDA and the CPU

are relatively small. The speed-ups and throughput that can be achieved by

CUDA are more than sufficient to disregard the small discrepancies. This is

proven in all one-, two- and three-dimensional space of the FDTD method.

Moreover, newer CUDA GPUs are designed to conform to the IEEE 784-2008

standard.

Figure 7.4 and Figure 7.5 show plots of the data generated from the

simulation of 160 × 160 × 160 cells. The plots show that the discrepancy

between the CPU and CUDA is neither significant nor noticeable.

7.9 Alternative Absorbing Boundaries

Berenger’s PMLs have been used as the absorbing region for the three-

dimensional FDTD simulation up until this point. Just like in the case of

the two-dimensional FDTD method, it was found that the PMLs were not

particularly well-suited for running in CUDA. Thus, alternative absorbing

74


0 200 400 600 800 1000−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Time−step

Magnitude

CPU

GPU (CUDA)



tion size 160 × 160 × 160 cells. The location of the probe is at cell (60, 60,

60) as specified in Table 7.1

75


300 310 320 330 340 350−0.05

−0.045

−0.04

−0.035

−0.03

−0.025

−0.02

−0.015

−0.01

−0.005

0

Time−step

Magnitude

CPU

GPU (CUDA)



tion size 160 × 160 × 160 cells. The location of the probe is at cell (60, 60,

60) as specified in Table 7.1. The plot is centered between time-steps 300

and 350.

76


boundary conditions were used to investigate how much more speed-up and

throughput can be achieved.

The three ABC’s used are the Mur’s first order, Mur’s second order and

Liao’s ABC. This is similar to what was done for the two-dimensional FDTD

simulation in Section 6.8. The equations for the ABC’s are shown below.

The equations assume that the boundary is at x = 0. Although only Ez is

considered, the equations have to be applied to Ex and Ey as well.

Mur’s 1st order ABC:

Ez

∣∣∣∣n+1

0,j,k

= Ez

∣∣∣∣n1,j,k

+c∆t− ∆z

c∆t+ ∆z

(Ez

∣∣∣∣n+1

1,j,k

− Ez

∣∣∣∣n0,j,k

) �� 7.1

Mur’s 2st order ABC:

Ez

∣∣∣∣n+1

0,j,k

= Ez

∣∣∣∣n−1

1,j,k

+ EQ1 + EQ2 + EQ3 + EQ4

�� 7.2

EQ1 =c∆t− ∆x

c∆t+ ∆x

(Ez

∣∣∣∣n+1

1,j,k

− Ez

∣∣∣∣n−1

0,j,k

) �� 7.3

EQ2 =2∆x

c∆t+ ∆x

(Ez

∣∣∣∣n0,j,k

− Ez

∣∣∣∣n1,j,k

) �� 7.4

EQ3 =∆x(c∆t)2

2(∆y)2(c∆t+ ∆x)(Ca + Cb)

�� 7.5

EQ4 =∆x(c∆t)2

2(∆z)2(c∆t+ ∆x)(Cc + Cd)

�� 7.6

Ca = Ez

∣∣∣∣n0,j+1,k

− 2Ez

∣∣∣∣n0,j,k

+ Ez

∣∣∣∣n0,j−1,k

�� 7.7

Cb = Ez

∣∣∣∣n1,j+1,k

− 2Ez

∣∣∣∣n1,j,k

+ Ez

∣∣∣∣n1,j−1,k

�� 7.8

Cc = Ez

∣∣∣∣n0,j,k+1

− 2Ez

∣∣∣∣n0,j,k

+ Ez

∣∣∣∣n0,j,k−1

�� 7.9

Cd = Ez

∣∣∣∣n1,j,k+1

− 2Ez

∣∣∣∣n1,j,k

+ Ez

∣∣∣∣n1,j,k−1

�� 7.10

77


Liao’s ABC:

Ez

∣∣∣∣n+1

0,j,k

=N∑m=1

(−1)m+1CNmEz

∣∣∣∣n+1−(m−1)

−mc∆t,j,k

�� 7.11

where N is the order of the boundary condition and CNm is the binomial

coefficient given by:

CNm =

N !

m!(N −m)!

�� 7.12

The speed-up and throughput for the three-dimensional FDTD simulation

on CUDA are shown in Figure 7.6 and Figure 7.7 respectively.

It is apparent from the results that the irregularity in performance for

varying simulation sized noted with the PMLs still exist when other ABCs are

used. However, Mur’s ABCs performed better than the PMLs. It must also

be noted that Liao’s ABC were implemented using doubles while the other

ABCs were implemented using floats. From the results, it can be argued

that Liao’s ABC performed reasonably well. This could be attributed to the

simplicity of Liao’s ABC compared to the PML and the Mur’s ABCs.

All three types of ABCs produced results that were reasonable similar to

that of the PML. Also, the PML width used for simulations is only 4 cells

and should be taken into consideration when analysing the discrepancy. If

the width of the PML region were increased, there would be less reflection

at the boundaries but this would cause the execution time to increase.

78


2,097,152 4,096,000 7,077,888 11,239,424 16,777,2160

5

10

15

20

25

30

35

40

45

50

55

Sp

ee

d−

up


PML

1st Order Mur

2nd Order Mur

Liao

Figure 7.6: Comparison of speed-up for various ABCs in three-dimensional


79


2,097,152 4,096,000 7,077,888 11,239,424 16,777,2160

50

100

150

200

250

300

Thro

ughput, M

cells

/s


PML

1st Order Mur

2nd Order Mur

Liao

Figure 7.7: Comparison of throughput for various ABCs in three-dimensional


80


0 200 400 600 800 1000−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Time−step

Magnitude

PML

1st Order Mur

2nd Order Mur

Liao


tween various ABCs. The plots are generated from the result from simulation

size 160× 160× 160 cells. The location of the probe is at cell (60, 60, 60) as

specified in Table 7.1.

81


400 450 500 550 600 650 700−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

Time−step

Magnitude

PML

1st Order Mur

2nd Order Mur

Liao


tween various ABCs. The plots are generated from the result from simulation

size 160× 160× 160 cells. The location of the probe is at cell (60, 60, 60) as

specified in Table 7.1. The plot is centered between time-steps 400 and 700.

82

7.10. CONCLUSIONS

7.10 Conclusions

Porting the FDTD method to execute on CUDA produced significant speed-

ups. However, it is significantly more difficult to obtain optimal performance

on CUDA compared to one- and two-dimensional FDTD simulations. One

reason for this is that CUDA does not support three-dimensional kernels for

a large number of threads.

Although the throughput achieved was less than that of the one- and

two-dimensional code, the speed-up achieved is encouraging. For example,

it takes the CPU more than an hour to complete the simulation for 256 ×

256× 256 cells but CUDA takes just over two minutes to complete the same

simulation.

Also, as was discovered with the two-dimensional FDTD simulations,

there are other ABCs that perform better on CUDA compared to the PMLs.

The PMLs are not as well-suited to leverage CUDA’s parallel architecture.

83

8Conclusions

8.1 Thesis Conclusions

The thesis finds that the FDTD method is very well suited to run on par-

allel architectures such as CUDA. Throughputs of over 1,000 Mcell/s can

be achieved on CUDA when the CPU only manages 4 Mcells/s. However,

achieving the speed-ups can be difficult. It is in fact significantly more dif-

ficult to write code to run optimally on CUDA than on the CPU. Most of

the problems that were faced during development were related to memory-

accesses on CUDA. There are various documentations and tools that are

available to developers but occasionally, these resources are insufficient.

However, it is the author’s opinion that the difficulty in producing optimal

CUDA programs is not a strong deterrent from using CUDA to execute the

FDTD method. Taken into perspective the results that have been achieved

from the thesis, the use of CUDA is without doubt a good choice. It is

hard to imagine waiting an hour for the CPU to complete a simulation of

256× 256× 256 cells while it be done in just over two minutes using CUDA.

It should also be noted that not only is the computational capabilities of

84

8.2. FUTURE WORK

GPUs continuing to rise at a faster pace compared to CPUs [8], the CUDA

framework is also consistently being improved. With these improvements, it

is almost certain that programming on CUDA will become easier and higher

speed-ups will be achieved in the near future.

As for the progress of the thesis, the simulations for three-dimensional

FDTD method took longer than expected due to the complexity involved

that was unforeseen. As the development progressed, it was found that

experimenting with the use of alternative ABCs in place of the PMLs would

be beneficial. Thus, Mur’s first order, Mur’s second order and Liao’s ABCs

were implemented as advanced work for the thesis.

8.2 Future Work

This thesis explored the capabilities and prospects of using CUDA as an

alternative to CPUs. Thus, the implementation of the CUDA program was

relatively straightforward. There is still a lot of work that can be done such

as investigating the use of shared memory and texture memory. While the

thesis focused on the higher-priority recommendations of the CUDA Best

Practices Guide [13], there are numerous other optimization that can be

explored to produced a more optimal CUDA program.

For the FDTD method on CUDA, the use of Mur’s and Liao’s ABCs can

definitely reduce simulation times compared to using PMLs. This is another

topic that can be studied in the future.

Although the focus of this thesis is on NVIDIA’s CUDA architecture,

there are other parallel architecture’s that in the market today. GPU man-

ufacturers including NVIDIA and AMD support the Open Computing Lan-

guage (OpenCL) which is a royalty-free standard for parallel computing

85

8.2. FUTURE WORK

[21, 22]. Research into OpenCL and comparisons with CUDA can also be

done as future work.

86

Bibliography

[1] Kane S. Yee. Numerical solution of initial boundary value problems

involving maxwell’s equations in isotropic media. Antennas and Propa-

gation, IEEE Transactions on, 14(3):302–307, 1966.

[2] K. L. Shlager and J. B. Schneider. A selective survey of the finite-

difference time-domain literature. Antennas and Propagation Magazine,

IEEE, 37(4):39–57, 1995.

[3] L.M. Angelone, S. Tulloch, G. Wiggins, S. Iwaki, N. Makris, and G. Bon-

massar. New high resolution head model for accurate electromagnetic

field computation. In ISMRM Thirteenth Scientific Meeting, page 881,

Miami, FL, USA, 2005.

[4] Wenhua Yu. Parallel finite-difference time-domain method. Artech

House electromagnetic analysis series. Artech House, Boston, MA, 2006.

2006045967 GBA639156 013443102 Wenhua Yu ... [et al.]. ill. ; 24 cm.

Includes bibliographical references and index.

[5] Dennis Michael Sullivan, IEEE Microwave Theory, and Techniques So-

ciety. Electromagnetic simulation using the FDTD method. IEEE Press

series on RF and microwave technology. IEEE Press, New York, 2000.

00038922 Dennis M. Sullivan. ill. ; 27 cm. ”IEEE Microwave Theory and

Techniques Society, sponsor.” Includes bibliographical references and in-

dex.

[6] NVIDIA. Cuda c programming guide version 3.1.1, 2010.

[7] NVIDIA. Fermi compute architecture white paper, 2009.

87

BIBLIOGRAPHY

[8] Ong Cen Yen, M. Weldon, S. Quiring, L. Maxwell, M. Hughes, C. Whe-

lan, and M. Okoniewski. Speed it up. Microwave Magazine, IEEE,

11(2):70–78, 2010.

[9] NVIDIA. Tesla c2050/c2070 gpu computing processor at 1/10th the

cost, 2010.

[10] Karl E. Hillesland and Anselmo Lastra. Gpu floating-point paranoia,

2004.

[11] United States General Accounting Office. Patriot missile defense: Soft-

ware problem led to system failure at dhahran, saudi arabia, February

4, 1992 1992.

[12] Robert Sedgewick and Kevin Daniel Wayne. Floating point, 2010.

[13] NVIDIA. Cuda c best practices version 3.1, 2010.

[14] Paulius Micikevicius. 3d finite difference computation on gpus using

cuda, 2009.

[15] James F. Stack. Accelerating the finite difference time domain (fdtd)

method with cuda, 2010.

[16] Alvaro Valcarce and Jie Zhang. Implementing a 2d fdtd scheme with

cpml on a gpu using cuda, 2010.

[17] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Mor-

ton, E. Phillips, Zhang Yao, and V. Volkov. Parallel computing experi-

ences with cuda. Micro, IEEE, 28(4):13–27, 2008.

[18] NVIDIA. Compute visual profiler, 2010.

[19] Rob Farber. Cuda, supercomputing for the masses: Part 6, 2008.

88

BIBLIOGRAPHY

[20] NVIDIA. Cuda reference manual 3.1, 2010.

[21] Khronos Group. Khronos launches heterogeneous computing initiative,

2008.

[22] Khronos Group. The khronos group releases opencl 1.0 specification,

2008.

[23] S. Adams, J. Payne, and R. Boppana. Finite difference time do-

main (fdtd) simulations using graphics processors. In DoD High Per-

formance Computing Modernization Program Users Group Conference,

2007, pages 334–338, 2007.

[24] G. Cummins, R. Adams, and T. Newell. Scientific computation through

a gpu. In Southeastcon, 2008. IEEE, pages 244–246, 2008.

[25] Atef Z. Elsherbeni and Veysel Demir. The finite-difference time-domain

method for electromagnetics with MATLAB simulations. SciTech Pub.,

Raleigh, NC :, 2009.

[26] C. D. Moss, F. L. Teixeira, and Kong Jin Au. Analysis and compensa-

tion of numerical dispersion in the fdtd method for layered, anisotropic

media. Antennas and Propagation, IEEE Transactions on, 50(9):1174–

1184, 2002.

[27] NVIDIA. Cuda architecture introduction & overview, 2009.

[28] Robert Sedgewick and Kevin Daniel Wayne. Introduction to program-

ming in Java : an interdisciplinary approach. Pearson Addison-Wesley,

Boston, 2008. 2007020235 Robert Sedgewick and Kevin Wayne. ill. ; 24

cm. Includes index.

89

BIBLIOGRAPHY

[29] Allen Taflove. Computational electrodynamics : the finite-difference

time-domain method. Artech House, Boston, 1995. 95015008 Allen

Taflove. ill. ; 24 cm. Includes bibliographical references and indexes.

[30] F. Zheng and Z. Chen. Numerical dispersion analysis of the uncondi-

tionally stable 3-d adi-fdtd method. Microwave Theory and Techniques,

IEEE Transactions on, 49(5):1006–1009, 2001.

90

AEmory Specifications

These specifications are obtained by executing the ‘deviceQuery’ program

bundled together with the NVIDIA GPU Computing SDK.

CUDA Device Query (Runtime API) version (CUDART static linking)

There are 2 devices supporting CUDA

Device 0: "Tesla T10 Processor"

CUDA Driver Version: 3.10

CUDA Runtime Version: 3.10

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294770688

bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

91

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535

x 1

Maximum memory pitch: 2147483647

bytes

Texture alignment: 256 bytes

Clock rate: 1.30 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default

(multiple host threads can use this device simultaneously)

Concurrent kernel execution: No

Device has ECC support enabled: No

Device 1: "Tesla T10 Processor"

CUDA Driver Version: 3.10

CUDA Runtime Version: 3.10

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294770688

bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

92

Maximum sizes of each dimension of a grid: 65535 x 65535

x 1

Maximum memory pitch: 2147483647

bytes

Texture alignment: 256 bytes

Clock rate: 1.30 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default

(multiple host threads can use this device simultaneously)

Concurrent kernel execution: No

Device has ECC support enabled: No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.10,

CUDA Runtime Version = 3.10, NumDevs = 2, Device = Tesla T10

Processor, Device = Tesla T10 Processor

93

BUncoalesed Memory Access Test

1 /*

2 * Test cudaMallocPitch() usage.

3 *

4 * ./binary <size_x> <size_y>

5 *

6 */

7

8 #include <stdlib.h>

9 #include <stdio.h>

10 #include <cuda.h>

11 #include <math.h>

12

13 #define DIM_SIZE 16

14

15 #define NUM_REPS 10

16

17



19 unsigned int yid = blockIdx.x * blockDim.x + threadIdx.x;

94

20 unsigned int xid = blockIdx.y * blockDim.y + threadIdx.y;

21




25 }

26 }

27

28 // http://www.drdobbs.com/high-performance-computing/207603131

29 void checkCUDAError(const char *msg)

30 {

31 cudaError_t err = cudaGetLastError();

32 if( err != cudaSuccess )

33 {

34 fprintf(stderr, "Cuda error: %s: %s.\n", msg,

35 cudaGetErrorString( err) );

36 exit(EXIT_FAILURE);

37 }

38 }

39

40 int main (int argc, char** argv) {

41 int size_x = 64;

42 int size_y = 64;

43

44 if ( argc > 2 ) {

45 sscanf(argv[1], "%d", &size_x);

46 sscanf(argv[2], "%d", &size_y);

47 }

48

49 printf("size_x: %d\n", size_x);

50 printf("size_y: %d\n", size_y);

51

52 // execution configuration parameters

95

53 dim3 grid(ceil((float)size_x/DIM_SIZE), ceil((float)size_y

/DIM_SIZE), 1), threads(DIM_SIZE, DIM_SIZE, 1);

54

55 // size of memory required to store the matrix

56 const int mem_size = sizeof(float) * size_x * size_y;

57

58 // allocate host memory

59 float *h_idata = (float*) malloc(mem_size);

60 float *h_odata = (float*) malloc(mem_size);

61

62 // allocate device memory

63 float *d_idata, *d_odata;

64 size_t pitch;

65 cudaMallocPitch((void**)&d_idata, &pitch, size_y * sizeof(

float), size_x); checkCUDAError("cudaMallocPitch");

66 cudaMallocPitch((void**)&d_odata, &pitch, size_y * sizeof(


67

68 printf("dpitch: %d\n", pitch);

69

70 // initalize host data

71 for ( int i = 0; i < size_x*size_y; i++ ) {

72 h_idata[i] = (float)i;

73 h_odata[i] = 0.0f;

74 }

75

76 // copy host data to device

77 cudaMemcpy2D(d_idata, pitch, h_idata, size_y * sizeof(

float), size_y * sizeof(float), size_x,


78 checkCUDAError("cudaMemcpy2D H to D");

79

80 for ( int i=0; i < NUM_REPS; i++) {

96


size_y);

82 }

83

84 // copy device data to host

85 cudaMemcpy2D(h_odata, size_y * sizeof(float), d_odata,

pitch, size_y * sizeof(float), size_x,


86 checkCUDAError("cudaMemcpy2D D to H");

87

88 for ( int k = 0; k < size_x * size_y; k++ ) {

89 if ( h_odata[k] != h_idata[k] * 2.0 ) {

90 printf("Mismatch!\n");

91 printf("h_idata[%d] = %f\n", k, h_idata[k]);

92 printf("h_odata[%d] = %f\n", k, h_odata[k]);

93

94 printf("---result---\n");

95 for ( int i = 0; i < size_x; i++ ) {

96 for ( int j = 0; j < size_y; j++ ) {

97 printf("%d ", (int)h_odata[i*size_y+j]);

98 }

99 printf("\n");

100 }

101 break;

102 }

103 }

104

105 free(h_idata);

106 free(h_odata);

107

108 cudaFree(d_idata);

109 cudaFree(d_odata);

110

97

111 printf("Completed.\n");

112 return 0;

113 }

Listing B.1: Source code for testing uncoalesced memory access.

98

#C

UD

AP

RO

FIL

EL

OG

VE

RSIO

N2.

0

#C

UD

AD

EV

ICE

0T

esla

T10

Pro

cess

or

#C

UD

AP

RO

FIL

EC

SV

1

met

hod

gputi

me

cputi

me

gld

coher

ent

gld

inco

her

ent

gst

coher

ent

gst

inco

her

ent

mem

cpyH

toD

3000

.768

3602

Z4c

opyP

fSiii

7579

.68

7635

1049

600

2099

200

Z4c

opyP

fSiii

7488

.16

7508

1049

600

2099

200

Z4c

opyP

fSiii

7576

.928

7597

1047

040

2094

080

Z4c

opyP

fSiii

7628

.576

7647

1049

600

2099

200

Z4c

opyP

fSiii

7511

.275

3010

4704

020

9408

0

Z4c

opyP

fSiii

7417

.344

7436

1049

600

2099

200

Z4c

opyP

fSiii

7609

.152

7629

1049

600

2099

200

Z4c

opyP

fSiii

7517

.76

7536

1047

040

2094

080

Z4c

opyP

fSiii

7514

.816

7530

1049

600

2099

200

Z4c

opyP

fSiii

7563

.744

7580

1047

040

2094

080

mem

cpyD

toH

2666

.496

3253

Tab

leB

.1:

CU

DA

Tex

tual

Pro

file

rre

sult

sfr

omte

stin

gunco

ales

ced

mem

ory

acce

ss.

99

CCoalesed Memory Access Test

1 /*

2 * Test cudaMallocPitch() usage.

3 *

4 * ./binary <size_x> <size_y>

5 *

6 */

7

8 #include <stdlib.h>

9 #include <stdio.h>

10 #include <cuda.h>

11 #include <math.h>

12

13 #define DIM_SIZE 16

14

15 #define NUM_REPS 10

16

17




100


21




25 }

26 }

27

28 // http://www.drdobbs.com/high-performance-computing/207603131

29 void checkCUDAError(const char *msg)

30 {

31 cudaError_t err = cudaGetLastError();

32 if( err != cudaSuccess )

33 {

34 fprintf(stderr, "Cuda error: %s: %s.\n", msg,

35 cudaGetErrorString( err) );

36 exit(EXIT_FAILURE);

37 }

38 }

39

40 int main (int argc, char** argv) {

41 int size_x = 64;

42 int size_y = 64;

43

44 if ( argc > 2 ) {

45 sscanf(argv[1], "%d", &size_x);

46 sscanf(argv[2], "%d", &size_y);

47 }

48

49 printf("size_x: %d\n", size_x);

50 printf("size_y: %d\n", size_y);

51

52 // execution configuration parameters

101

53 dim3 grid(ceil((float)size_y/DIM_SIZE), ceil((float)size_x

/DIM_SIZE), 1), threads(DIM_SIZE, DIM_SIZE, 1);

54

55 // size of memory required to store the matrix

56 const int mem_size = sizeof(float) * size_x * size_y;

57

58 // allocate host memory

59 float *h_idata = (float*) malloc(mem_size);

60 float *h_odata = (float*) malloc(mem_size);

61

62 // allocate device memory

63 float *d_idata, *d_odata;

64 size_t pitch;

65 cudaMallocPitch((void**)&d_idata, &pitch, size_y * sizeof(


66 cudaMallocPitch((void**)&d_odata, &pitch, size_y * sizeof(


67

68 printf("dpitch: %d\n", pitch);

69

70 // initalize host data

71 for ( int i = 0; i < size_x*size_y; i++ ) {

72 h_idata[i] = (float)i;

73 h_odata[i] = 0.0f;

74 }

75

76 // copy host data to device

77 cudaMemcpy2D(d_idata, pitch, h_idata, size_y * sizeof(

float), size_y * sizeof(float), size_x,


78 checkCUDAError("cudaMemcpy2D H to D");

79

80 for ( int i=0; i < NUM_REPS; i++) {

102


size_y);

82 }

83

84 // copy device data to host

85 cudaMemcpy2D(h_odata, size_y * sizeof(float), d_odata,

pitch, size_y * sizeof(float), size_x,


86 checkCUDAError("cudaMemcpy2D D to H");

87

88 for ( int k = 0; k < size_x * size_y; k++ ) {

89 if ( h_odata[k] != h_idata[k] * 2.0 ) {

90 printf("Mismatch!\n");

91 printf("h_idata[%d] = %f\n", k, h_idata[k]);

92 printf("h_odata[%d] = %f\n", k, h_odata[k]);

93

94 printf("---result---\n");

95 for ( int i = 0; i < size_x; i++ ) {

96 for ( int j = 0; j < size_y; j++ ) {

97 printf("%d ", (int)h_odata[i*size_y+j]);

98 }

99 printf("\n");

100 }

101 break;

102 }

103 }

104

105 free(h_idata);

106 free(h_odata);

107

108 cudaFree(d_idata);

109 cudaFree(d_odata);

110

103

111 printf("Completed.\n");

112 return 0;

113 }

Listing C.1: Source code for testing coalesced memory access.

104

#C

UD

AP

RO

FIL

EL

OG

VE

RSIO

N2.

0

#C

UD

AD

EV

ICE

0T

esla

T10

Pro

cess

or

#C

UD

AP

RO

FIL

EC

SV

1

met

hod

gputi

me

cputi

me

gld

coher

ent

gld

inco

her

ent

gst

coher

ent

gst

inco

her

ent

mem

cpyH

toD

3002

.436

15s

Z4c

opyP

fSiii

161.

792

201

6560

026

240

0

Z4c

opyP

fSiii

161.

9219

065

600

2624

00

Z4c

opyP

fSiii

159.

456

178

6544

026

176

0

Z4c

opyP

fSiii

161.

472

181

6560

026

240

0

Z4c

opyP

fSiii

162.

272

181

6544

026

176

0

Z4c

opyP

fSiii

161.

152

177

6560

026

240

0

Z4c

opyP

fSiii

160.

224

179

6560

026

240

0

Z4c

opyP

fSiii

160.

576

176

6544

026

176

0

Z4c

opyP

fSiii

160.

3217

965

600

2624

00

Z4c

opyP

fSiii

162.

112

190

6544

026

176

0

mem

cpyD

toH

2678

.368

3280

Tab

leC

.1:

CU

DA

Tex

tual

Pro

file

rre

sult

sfr

omte

stin

gco

ales

ced

mem

ory

acce

ss.

105

FDTD CUDA

Documents

includes bibliographical

ecc support

kevin daniel

locked memory

separate memory

maxwells curl

alternative

absorbing