Page 1
Finite-Difference Time-Domain MethodImplemented on the CUDA Architecture
Wei Chern TEE
School of Information Technology and Electrical Engineering
University of Queensland
Submitted for the degree of Bachelor of Engineering (Honours)
in the division of Electrical and Electronic Engineering
June 2011
Page 3
Statement of Originality
June 3, 2011
Head of School
School of Information Technology and Electrical Engineering
University of Queensland
St Lucia, Q 4072
Dear Professor Paul Strooper,
In accordance with the requirements of the degree of Bachelor of Engi-
neering (Honours) in the division of Electrical and Electronic Engineering, I
present the following thesis entitled “Finite-Difference Time-Domain Method
Implemented on the CUDA Architecture”. This work was performed under
the supervision of Dr. David Ireland.
I declare that the work submitted in this thesis is my own, except as ac-
knowledged in the text and footnotes, and has not been previously submitted
for a degree at the University of Queensland or any other institution.
Yours sincerely,
—————————
Wei Chern TEE
i
Page 4
Abstract
The finite-difference time-domain (FDTD) method is a numerical method
that is relatively simple, robust and accurate. Moreover, it lends itself well
to a parallel implementation. Modern FDTD simulations however, are often
time-consuming and can take days or months to complete depending on the
complexity of the problem. A potential way of reducing this simulation time
is in the use of graphical processor units (GPUs). This thesis thus studies
the challenges of using GPUs for solving the FDTD algorithm.
GPUs are no longer used just to render graphics. Recent advancements in
the use of GPUs for general-purpose and scientific computing have sparked an
interest in the use of FDTD. New graphics processors such as NVIDIA CUDA
GPUs provide a cost-effective alternative to traditional supercomputers and
cluster computers. The parallel nature of the FDTD algorithm coupled with
the use of GPUs can potentially and significantly reduce simulation time
compared to the CPU.
The focus of the thesis is to utilize NVIDIA CUDA GPUs to implement
the FDTD method. A brief study is done on CUDAs architecture and how
it is capable of reducing the FDTD simulation time. The thesis examines
implementations of the FDTD in one, two and three dimensions using CUDA
and the CPU. Comparisons of code-complexity, accuracy and simulation time
are made in order to provide substantial arguments for concluding whether
the implementation of FDTD on CUDA is beneficial. In summary, speed-ups
of over 20x, 60x and 50x were achieved for one-, two- and three-dimensions
respectively. However, there are challenges involved in using CUDA which
are investigated in the thesis.
ii
Page 5
Acknowledgements
I would like to acknowledge my supervisor, Dr. David Ireland
especially for his patience and guidance.
I would also like to acknowledge Dr. Konstanty Bialkowski
for his ideas and help.
To my parents, without whom I would be sorely put.
Thank you.
Special thanks to
Ahmad Faiz, Tan Jon Wen, Christina Lim, Anne-Sofie Pederson,
Franciss Chuah, Kenny Heng and Lee Kam Heng.
For friendship.
iii
Page 6
Contents
1 Introduction 1
1.1 Thesis Introduction . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.1 Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.2 Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.3 Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.4 Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.5 Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.6 Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Finite-Difference Time-Domain (FDTD) 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 One-Dimensional FDTD Equations . . . . . . . . . . . . . . . 7
2.3 Two-Dimensional FDTD Equations . . . . . . . . . . . . . . . 8
2.4 Three-Dimensional FDTD Equations . . . . . . . . . . . . . . 9
3 Compute Unified Device Architecture (CUDA) 11
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Computation Capability of Graphics Processing Units . . . . . 14
3.3 Memory Structure . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Literature Review 21
4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
iv
Page 7
CONTENTS
5 1-D FDTD Results 23
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Test Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.4 Discrepancy In Results . . . . . . . . . . . . . . . . . . . . . . 32
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6 2-D FDTD Results 37
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Test Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.3 Initial Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.4 Updating PML Using CUDA . . . . . . . . . . . . . . . . . . 42
6.5 Compute Profiler . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.6 CUDA Memory Restructuring . . . . . . . . . . . . . . . . . . 47
6.7 Discrepancy In Results . . . . . . . . . . . . . . . . . . . . . . 55
6.8 Alternative Absorbing Boundaries . . . . . . . . . . . . . . . . 55
6.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7 3-D FDTD Results 61
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.2 Test Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.3 CUDA Block & Grid Configurations . . . . . . . . . . . . . . 65
7.4 Three-Dimensional Arrays In CUDA . . . . . . . . . . . . . . 67
7.5 Initial Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.6 Visual Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.7 Memory Coalescing for 3-D Arrays . . . . . . . . . . . . . . . 69
7.8 Discrepancy In Results . . . . . . . . . . . . . . . . . . . . . . 73
7.9 Alternative Absorbing Boundaries . . . . . . . . . . . . . . . . 74
v
Page 8
CONTENTS
7.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8 Conclusions 84
8.1 Thesis Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 84
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Bibliography 87
A Emory Specifications 91
B Uncoalesed Memory Access Test 94
C Coalesed Memory Access Test 100
vi
Page 9
List of Figures
2.1 Position of the electric and magnetic fields in Yee’s scheme.
(a) Electric element. (b) Relationship between the electric
and magnetic elements. Source: [4]. . . . . . . . . . . . . . . . 7
3.1 A multithreaded program is partitioned into blocks of threads
that execute independently from each other, so that a GPU
with more cores will automatically execute the program in less
time than a GPU with fewer cores. Source: [6]. . . . . . . . . 12
3.2 The GPU devotes more transistors to data processing. Source:
[6]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Hierarchy of threads, blocks and grid in CUDA. Source: [6]. . 15
3.4 Growth in single precision computing capability of NVIDIA’s
GPUs compared to Intel’s CPUs. Source: [8]. . . . . . . . . . 16
3.5 Hierarchy of various types of memory in CUDA. Source: [6]. . 18
5.1 Speed-up for one-dimensional FDTD simulation. . . . . . . . . 29
5.2 Throughput for one-dimensional FDTD simulation running on
CUDA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Plot of 1-D FDTD simulation results to compare accuracy
between CPU and GPU. The plots are generated from the
result from simulation size of 100 cells. The location of the
probe is at x = 30. . . . . . . . . . . . . . . . . . . . . . . . . 34
vii
Page 10
LIST OF FIGURES
5.4 Plot of 1-D FDTD simulation results to compare accuracy
between CPU and GPU. The plots are generated from the
result from simulation size of 100 cells. The location of the
probe is at x = 30. The plot is centered between time-steps
50 and 90. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1 Flowchart for two-dimensional FDTD simulation. . . . . . . . 39
6.2 Screen-shot of Visual Profiler GUI from CUDA Toolkit 3.2
running on Windows 7. The data in the screen-shot is im-
ported from the results of the memory test in Listing 6.4.
These results are also available in Appendix B. . . . . . . . . . 44
6.3 Simple program to determine advantages of cudaMallocPitch
(). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.4 Throughput for two-dimensional FDTD simulation running on
CUDA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.5 Plot of 2-D FDTD simulation results to compare accuracy
between CPU and GPU. The plots are generated from the
result from simulation size 128 × 128. The location of the
probe is at cell (25, 25) as specified in Table 6.1. . . . . . . . . 56
6.6 Comparison of speed-up for various ABCs in two-dimensional
FDTD simulation running on CUDA. . . . . . . . . . . . . . . 57
6.7 Comparison of throughput for various ABCs in two-dimensional
FDTD simulation running on CUDA. . . . . . . . . . . . . . . 58
7.1 Flowchart for three-dimensional FDTD simulation. . . . . . . 63
7.2 Speed-up for three-dimensional FDTD simulation. . . . . . . . 71
7.3 Throughput for three-dimensional FDTD simulation running
on CUDA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
viii
Page 11
LIST OF FIGURES
7.4 Plot of 3-D FDTD simulation results to compare accuracy
between CPU and GPU. The plots are generated from the
result from simulation size 160× 160× 160 cells. The location
of the probe is at cell (60, 60, 60) as specified in Table 7.1 . . 75
7.5 Plot of 3-D FDTD simulation results to compare accuracy
between CPU and GPU. The plots are generated from the
result from simulation size 160× 160× 160 cells. The location
of the probe is at cell (60, 60, 60) as specified in Table 7.1.
The plot is centered between time-steps 300 and 350. . . . . . 76
7.6 Comparison of speed-up for various ABCs in three-dimensional
FDTD simulation running on CUDA. . . . . . . . . . . . . . . 79
7.7 Comparison of throughput for various ABCs in three-dimensional
FDTD simulation running on CUDA. . . . . . . . . . . . . . . 80
7.8 Plot of 3-D FDTD simulation results to compare accuracy be-
tween various ABCs. The plots are generated from the result
from simulation size 160× 160× 160 cells. The location of the
probe is at cell (60, 60, 60) as specified in Table 7.1. . . . . . . 81
7.9 Plot of 3-D FDTD simulation results to compare accuracy be-
tween various ABCs. The plots are generated from the result
from simulation size 160× 160× 160 cells. The location of the
probe is at cell (60, 60, 60) as specified in Table 7.1. The plot
is centered between time-steps 400 and 700. . . . . . . . . . . 82
ix
Page 12
List of Tables
5.1 Specifications of the test platform. . . . . . . . . . . . . . . . . 24
5.2 Results for one-dimensional FDTD simulation. . . . . . . . . . 28
5.3 Discrepancy in results between CPU and CUDA simulation of
the one-dimensional FDTD method. . . . . . . . . . . . . . . . 33
6.1 Set-up for two-dimensional FDTD simulation. . . . . . . . . . 40
6.2 Results for two-dimensional FDTD simulation on CPU. . . . . 41
6.3 Results for two-dimensional FDTD simulation on CUDA (ini-
tial run). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4 Results for two-dimensional FDTD simulation on CUDA (us-
ing CUDA to update PML). . . . . . . . . . . . . . . . . . . . 43
6.5 Textual profiling results for investigation into memory coalesc-
ing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.6 Results for two-dimensional FDTD simulation on CUDA (us-
ing cudaMallocPitch() for Ex memory allocation). . . . . . . 48
6.7 Snippet of results from textual profiling on uncoalesced mem-
ory access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.8 Snippet of results from textual profiling on coalesced memory
access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.9 Results for two-dimensional FDTD simulation on CUDA (us-
ing cudaMallocPitch() and column-major indexing). . . . . . 53
6.10 Discrepancy in results between CPU and CUDA simulation of
the two-dimensional FDTD method. . . . . . . . . . . . . . . 55
7.1 Set-up for three-dimensional FDTD simulation. . . . . . . . . 64
7.2 Results for three-dimensional FDTD simulation on CPU. . . . 64
x
Page 13
LIST OF TABLES
7.3 Limitations of the block and grid configuration for CUDA for
Compute Capability 1.x [6]. . . . . . . . . . . . . . . . . . . . 65
7.4 Results for three-dimensional FDTD simulation on CUDA (ini-
tial run). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.5 Results from visual profiling on three-dimensional FDTD sim-
ulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.6 Results for three-dimensional FDTD simulation on CUDA (with
coalesced memory access). . . . . . . . . . . . . . . . . . . . . 70
7.7 Results from visual profiling on three-dimensional FDTD sim-
ulation (with new indexing of arrays in kernel). . . . . . . . . 73
7.8 Discrepancy in results between CPU and CUDA simulation of
the three-dimensional FDTD method. . . . . . . . . . . . . . . 74
B.1 CUDA Textual Profiler results from testing uncoalesced mem-
ory access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
C.1 CUDA Textual Profiler results from testing coalesced memory
access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
xi
Page 14
1Introduction
1.1 Thesis Introduction
The finite-difference time-domain (FDTD) modelling technique is used to
solve Maxwell’s equations in the time-domain. The FDTD method was in-
troduced by Yee in 1966 [1] and interest on the topic has increased almost
exponentially over the past 30 years [2]. The FDTD method provides a
relatively simple mathematical solution to the Maxwell’s equations but sim-
ulation can take days or months to complete depending on the complexity
of the problem. However, the FDTD technique is also highly parallel which
allows it to leverage parallel processing architectures to achieve speed-ups.
This project involves the use of graphics processing units (GPUs) which have
parallel architectures to implement the FDTD method in order to reduce
computation time.
Historically, the GPU is used as a co-processor to the main processor of
a computer, the central processing unit (CPU). The GPU is designed with
a mathematically-intensive, highly parallel architecture for rendering graph-
ics. Modern GPUs however, are becoming increasing popular for performing
1
Page 15
1.2. MOTIVATION
general purpose computations instead of just graphics-processing. By utiliz-
ing the GPU’s massively parallel architecture with high memory bandwidth,
applications running on the GPU can achieve speed-ups in orders of mag-
nitude compared to CPU implementations. The utilization of GPUs for
applications other than graphics rendering is known as general-purpose com-
puting on graphics processing units (GPGPU). Therefore, the thesis focuses
on combining the parallel nature of the FDTD technique with the parallel
architecture of the GPU to achieve speed-ups compared to the traditional
use of CPU for computation.
1.2 Motivation
The Microwave and Optical Communications (MOC) Group at the Uni-
versity of Queensland require computer simulations of very large, realistic
models of the human anatomy and tissues interacting with electromagnetic
energy. The FDTD technique is used for the simulation and takes a long
period of time to solve due to the large simulation models. Therefore, the
MOC group is interested in utilizing the GPU to reduce simulation time.
The FDTD technique is gaining popularity in vast areas of study such as
telecommunications, optoelectronics, biomedical engineering and geophysics.
Thus, the outcome of the thesis and the understanding of GPUs will be
valuable towards the reduction of computation time in scientific applications.
As many engineering practice require design by repetitive simulation such
as in design optimization, a more thorough optimization procedure can be
achieved with a faster simulation time.
2
Page 16
1.3. SIGNIFICANCE
1.3 Significance
From the results of the thesis, a conclusion will be made on the feasibility
of using GPUs as an alternative to the CPU. If the outcome of the thesis
is a successful implementation of the FDTD method using the CUDA archi-
tecture, the thesis could be a motivation for further research into both the
FDTD method and GPU acceleration.
While the focus of this thesis is on obtaining speed-ups for the FDTD
method, the results could be used as a gauge for the benefits of GPU ac-
celeration in other applications. Other applications can also benefit from
leveraging a technology that already exists in our computers today.
1.4 Thesis Outline
A review of the remaining chapters of the thesis is given here.
1.4.1 Chapter 2
This chapter serves to introduce the finite-difference time-domain method.
A short introduction is given for one-, two- and three-dimensional FDTD
equations.
1.4.2 Chapter 3
The CUDA architecture is introduced in this chapter. A short summary of
the various types of memory available on a GPU are given here. The differ-
ences between a CUDA GPU and a CPU are discussed. CUDA’s potential
in reducing computation time for the FDTD method is also explored.
3
Page 17
1.4. THESIS OUTLINE
1.4.3 Chapter 4
A review of existing literature are selectively given in this chapter. Focus
is given on literature focusing on implementations of the FDTD method on
GPUs and their findings.
1.4.4 Chapter 5
The implementation of the FDTD method for one-dimension is discussed
in this chapter. While this chapter serves as more of an introduction into
programming with CUDA, significant speed-ups of over 20x were achieved.
Details of the test platform used throughout the thesis are also listed in this
chapter.
1.4.5 Chapter 6
The implementation of the two-dimensional FDTD method is discussed in
this chapter. The chapter explores the use of CUDA’s Compute Profiler as
a tool for determining the efficiency of the kernel code. Memory coalesc-
ing requirements are also discussed and this chapter starts to introduce the
complexity involved in programming using the CUDA framework.
The use of various absorbing boundary conditions (ABCs) as an alterna-
tive to perfectly-matched layers (PMLs) are also discussed.
1.4.6 Chapter 7
In this chapter, the FDTD method for three-dimensions is explored. The
chapter explains the difficulties involved in implementing three-dimensional
4
Page 18
1.4. THESIS OUTLINE
grids and blocks of threads on CUDA. The use of the CUDA Visual Profiler
as an effective tool for debugging CUDA applications are discussed.
As with Chapter 6, alternative ABCs are explored in order to produce
faster execution times and better throughput.
5
Page 19
2Finite-Difference Time-Domain
(FDTD)
2.1 Introduction
The FDTD method is a numerical method introduced by Yee in 1966 [1] to
solve the differential form of Maxwell’s equations in time-domain. Although
the method has existed for over four decades, enhancements to improve the
FDTD are continuously being published [2].
The FDTD method discretizes the Maxwell’s curl equations into time and
spatial domains. The electric fields are generally located at the edge of the
‘Yee Cell’ and the magnetic fields are located at the centre of the Yee Cell.
This is shown in Figure 2.1. In three dimensions, the number of cells in one
time step can easily be in orders of millions. A dimension of 100× 100× 100
cells already yields a total of one million cells. For example, in [3], a high
resolution head model of an adult male has a total of 4,642,730 Yee cells with
each cell having a dimension of 1 × 1 × 1 mm3.
6
Page 20
2.2. ONE-DIMENSIONAL FDTD EQUATIONS
Figure 2.1: Position of the electric and magnetic fields in Yee’s scheme.
(a) Electric element. (b) Relationship between the electric and magnetic
elements. Source: [4].
2.2 One-Dimensional FDTD Equations
The Maxwell’s curl equations in free space for one-dimension are
δExδt
= − 1
ε0
δHy
δz
�� ��2.1
δHy
δt= − 1
µ0
δExδz
�� ��2.2
Using the finite-difference method of central approximation and rearranging
[5], the equations become
Ex
∣∣∣∣n+1/2
k
= Ex
∣∣∣∣n−1/2
k
− ∆t
ε0 · ∆x
(Hy
∣∣∣∣nk+1/2
−Hy
∣∣∣∣nk−1/2
) �� ��2.3
Hy
∣∣∣∣n+1
k+1/2
= Hy
∣∣∣∣nk+1/2
− ∆t
µ0 · ∆x
(Ex
∣∣∣∣n+1/2
k+1
− Ex
∣∣∣∣n+1/2
k
) �� ��2.4
The electric fields and magnetic fields are calculated alternately over the
entire spatial domain at time-step n, and this process the continued over all
7
Page 21
2.3. TWO-DIMENSIONAL FDTD EQUATIONS
the time-steps until convergence is achieved.
Depending on the size of the computation domain, millions of iterations
could be required to solve the differential equations. However, the benefit
of using the FDTD method is that it only requires exchange of data with
neighbouring cells. The equations above show that Ex∣∣n+1/2
kis only updated
from Ex∣∣n−1/2
k, Hy
∣∣nk+1/2
and Hy
∣∣nk−1/2
. Similarly, Hy
∣∣n+1
k+1/2is updated from
Hy
∣∣nk+1/2
, Ex∣∣n+1/2
k+1and Ex
∣∣n+1/2
k. Therefore, the FDTD algorithm is highly
parallel in nature and significant speed-ups can be achieved by harnessing
the power of parallel processing. Further reading on parallel FDTD can be
found in [4].
2.3 Two-Dimensional FDTD Equations
For two-dimensional FDTD method in transverse-magnetic (TM) mode , the
update equations are
Hz
∣∣n+1/2
i,j= Da
∣∣i,jHz
∣∣n−1/2
i,j+Db
∣∣i,j
[(Ex∣∣ni,j+1/2
− Ex∣∣ni,j−1/2
∆y
)−(
Ey∣∣ni+1/2,j
− Ey∣∣ni−1/2,j
∆x
)] �� ��2.5
Da
∣∣i,j
=1 − σ∗
i,j∆t
2µi,j
1 +σ∗i,j∆t
2µi,j
�� ��2.6
Db
∣∣i,j
=
∆tµi,j
1 +σ∗i,j∆t
2µi,j
�� ��2.7
Ex∣∣n+1
i,j= Ca
∣∣i,jEx∣∣ni,j
+ Cb∣∣i,j
(Hz
∣∣n+1/2
i,j+1/2−Hz
∣∣n+1/2
i,j−1/2
∆y
) �� ��2.8
Ey∣∣n+1
i,j= Ca
∣∣i,jEy∣∣ni,j
+ Cb∣∣i,j
(Hz
∣∣n+1/2
i+1/2,j−Hz
∣∣n+1/2
i−1/2,j
∆x
) �� ��2.9
8
Page 22
2.4. THREE-DIMENSIONAL FDTD EQUATIONS
Ca∣∣i,j
=1 − σi,j∆t
2εi,j
1 +σi,j∆t
2εi,j
�� ��2.10
Cb∣∣i,j
=
∆tεi,j
1 +σi,j∆t
2εi,j
�� ��2.11
where σ and σ∗ are the electric and magnetic conductivity respectively. ε
and µ are the permittivity and permeability respectively.
2.4 Three-Dimensional FDTD Equations
For three-dimensional FDTD method, the update equations are
Ex∣∣n+1
i,j,k= Ca
∣∣i,j,k
Ex∣∣ni,j,k
+ Cb∣∣i,j,k
[(Hz
∣∣n+1/2
i,j+1/2,k−Hz
∣∣n+1/2
i,j−1/2,k
∆y
)−
(Hy
∣∣n+1/2
i,j,k+1/2−Hy
∣∣n+1/2
i,j,k−1/2
∆z
)] �� ��2.12
Ey∣∣n+1
i,j,k= Ca
∣∣i,j,k
Ey∣∣ni,j,k
+ Cb∣∣i,j,k
[(Hx
∣∣n+1/2
i,j,k+1/2−Hx
∣∣n+1/2
i,j,k−1/2
∆z
)−
(Hz
∣∣n+1/2
i+1/2,j,k−Hz
∣∣n+1/2
i−1/2,j,k
∆x
)] �� ��2.13
Ez∣∣n+1
i,j,k= Ca
∣∣i,j,k
Ez∣∣ni,j,k
+ Cb∣∣i,j,k
[(Hy
∣∣n+1/2
i+1/2,j,k−Hy
∣∣n+1/2
i−1/2,j,k
∆x
)−
(Hz
∣∣n+1/2
i,j+1/2,k−Hz
∣∣n+1/2
i,j−1/2,k
∆y
)] �� ��2.14
Ca∣∣i,j,k
=1 − σi,j,k∆t
2εi,j,k
1 +σi,j,k∆t
2εi,j,k
�� ��2.15
Cb∣∣i,j,k
=
∆tεi,j,k
1 +σi,j,k∆t
2εi,j,k
�� ��2.16
9
Page 23
2.4. THREE-DIMENSIONAL FDTD EQUATIONS
Hx
∣∣n+1/2
i,j,k= Da
∣∣i,j,k
Hx
∣∣n−1/2
i,j,k+Db
∣∣i,j,k
[(Ey∣∣ni,j,k+1/2
− Ey∣∣ni,j,k−1/2
∆z
)−(
Ez∣∣ni,j+1/2,k
− Ez∣∣ni,j−1/2,k
∆y
)] �� ��2.17
Hy
∣∣n+1/2
i,j,k= Da
∣∣i,j,k
Hy
∣∣n−1/2
i,j,k+Db
∣∣i,j,k
[(Ez∣∣ni+1/2,j,k
− Ez∣∣ni−1/2,j,k
∆x
)−(
Ex∣∣ni,j,k+1/2
− Ex∣∣ni,j,k−1/2
∆z
)] �� ��2.18
Hz
∣∣n+1/2
i,j,k= Da
∣∣i,j,k
Hz
∣∣n−1/2
i,j,k+Db
∣∣i,j,k
[(Ex∣∣ni,j+1/2,k
− Ex∣∣ni,j−1/2,k
∆y
)−(
Ey∣∣ni+1/2,j,k
− Ey∣∣ni−1/2,j,k
∆x
)] �� ��2.19
Da
∣∣i,j,k
=1 − σ∗
i,j,k∆t
2µi,j,k
1 +σ∗i,j,k∆t
2µi,j,k
�� ��2.20
Db
∣∣i,j,k
=
∆tµi,j,k
1 +σ∗i,j,k∆t
2µi,j,k
�� ��2.21
where σ and σ∗ are the electric and magnetic conductivity respectively. ε
and µ are the permittivity and permeability respectively.
10
Page 24
3Compute Unified Device Architecture
(CUDA)
3.1 Introduction
CUDA is the hardware and software architecture introduced by NVIDIA in
November 2006 [6] to provide developers with access to the parallel computa-
tional elements of NVIDIA GPUs. The CUDA architecture enables NVIDIA
GPUs to execute programs written in various high-level languages such as
C, Fortran, OpenCL and DirectCompute. The newest architecture of GPUs
by NVIDIA (codenamed ‘Fermi’) also fully supports programming through
the C++ language [7].
Because of advancements in technology, the processing power and paral-
lelism of GPUs are continuously increasing. CUDA’s scalable programming
model makes it easy to provide this abstraction to software developers, allow-
ing the program the automatically scale according to the capabilities of the
GPU without any change in code, unlike traditional graphics programming
languages such as OpenGL [8]. This is illustrated in Figure 3.1.
11
Page 25
3.1. INTRODUCTION
Figure 3.1: A multithreaded program is partitioned into blocks of threads
that execute independently from each other, so that a GPU with more cores
will automatically execute the program in less time than a GPU with fewer
cores. Source: [6].
12
Page 26
3.1. INTRODUCTION
Figure 3.2: The GPU devotes more transistors to data processing. Source:
[6].
Because the GPU and CPU both serve different purposes in a computer,
their microprocessor architecture as shown in Figure 3.2 is very different.
While CPUs currently have up to six processor cores (Intel Core i7-970), a
GPU has hundreds. For example, the NVIDIA Tesla 20-series has 448 CUDA
cores [7].
Compared to the CPU, the GPU devotes more transistors to data pro-
cessing rather than data caching and flow control. This allows GPUs to spe-
cialize in math-intensive, highly parallel operations compared to the CPU
which serves as a multi-purpose microprocessor. Therefore, calculations of
the FDTD algorithm are potentially much faster when executed on the GPU
instead of the CPU. This is becoming increasingly true as graphics card ven-
dors such as NVIDIA and AMD are now developing more graphics card for
high performance computing (HPC) such as the NVIDIA Tesla [9].
CUDA has a single-instruction multiple-thread (SIMT) execution model
where multiple independent threads execute concurrently using a single in-
struction [7]. CUDA GPUs have a hierarchy of grids, threads and blocks
as shown in Figure 3.3. Each thread has its own private memory. Shared
13
Page 27
3.2. COMPUTATION CAPABILITY OF GRAPHICS PROCESSINGUNITS
memory is available per-block and global memory is accessible by all threads.
This multi-threaded architecture model puts focus on data calculations rather
than data caching. Thus, it can sometimes be faster to recalculate rather
than cache on a GPU.
A CUDA program is called a kernel and the kernel is invoked by a CPU
program. The CUDA programming model assumes that CUDA threads ex-
ecute on a physically separate device (GPU). The device is a co-processor
to the host (CPU) which runs the program. CUDA also assumes that the
host and device both have separate memory spaces: host memory and device
memory, respectively. Because host and device both have their own separate
memory spaces, there is potentially a lot of memory allocation, deallocation
and data transfer between host and device. Thus, memory management is a
key issue in GPGPU computing. Inefficient use of memory can significantly
increase the computation time and mask the speed-ups obtained by the data
calculations.
3.2 Computation Capability of Graphics Processing
Units
The number of floating points operations per second (flops) of a computer is
one of the measures of the computational abilities of a computer. This is an
important measure especially in scientific calculations as it is an indication
of a computer’s arithmetic capabilities.
While a high-performance CPU can have a double precision computation
capability of 140 Gflops (Intel Nehalem architecture) [8], an NVIDIA Tesla
20-series (NVIDIA Fermi architecture) GPU has a peak single precision per-
formance of 1.03 Tflops and a peak double precision performance of 515
14
Page 28
3.2. COMPUTATION CAPABILITY OF GRAPHICS PROCESSINGUNITS
Figure 3.3: Hierarchy of threads, blocks and grid in CUDA. Source: [6].
15
Page 29
3.2. COMPUTATION CAPABILITY OF GRAPHICS PROCESSINGUNITS
Figure 3.4: Growth in single precision computing capability of NVIDIA’s
GPUs compared to Intel’s CPUs. Source: [8].
Gflops [9]. Furthermore, Figure 3.4 shows that the computation capability
of a GPU is growing at a much faster pace compared to the CPU.
Although the compute capability of a GPU is impressive when compared
to the CPU, it has one significant disadvantage in scientific applications. Not
all GPUs fully conform to the IEEE standard for floating point operations
[10]. Although the floating point arithmetic of NVIDIA graphics cards is
similar to the IEEE 754-2008 standard used by many CPU vendors, it is not
quite the same especially for double precision [6].
In computers, the natural form of representation of numbers is in binary
(1’s and 0’s). Thus, computers cannot accurately represent real numbers.
There are standards to represent floating-point numbers in computers and
the most widely used is the IEEE 754 standard. Accuracy in representation
16
Page 30
3.3. MEMORY STRUCTURE
of floating point numbers in computers is important in scientific applications.
There have been many cases where errors in floating-point representation
have caused catastrophes. One example is the failure of the American Pa-
triot Missile defence system to intercept an incoming Iraqi Scud missile at
Dharan, Saudi Arabia on February 25, 1991. This resulted in the death of
28 Americans [11]. The cause of this was determined to be the loss of accu-
racy from conversion of an integer to real number in the Patriot’s computer.
Other examples of catastrophes resulting from floating-point representation
errors can be found at [12].
Thus, accurate floating-point representation is important in scientific
computations. Most CPU manufacturers now use the IEEE 754 floating-
point standard. As developments in GPGPUs continue, GPU vendors will
inevitably conform to the IEEE 754 standard for floating-point representa-
tion as well. This is evident with the newest Fermi architecture from NVIDIA
which implements the IEEE 754-2008 floating point for both single and dou-
ble precision arithmetic [7].
3.3 Memory Structure
The CUDA memory hierarchy is shown in Figure 3.5.
These different types of memory differ in size, access times and restric-
tions. Detailed descriptions of the various memory types are available in the
CUDA Programming Guide [6] and the CUDA Best Practices Guide [13].
In short, global memory is the largest in size and is located off the GPU
chip and can be accessed by any thread. Because it is off-chip, its access
time is the slowest amongst all the various types of memory. Shared memory
is located on-chip which makes memory access very fast compared to global
17
Page 31
3.3. MEMORY STRUCTURE
Figure 3.5: Hierarchy of various types of memory in CUDA. Source: [6].
18
Page 32
3.4. CONCLUSIONS
memory. However, it is limited in size and shared memory access is only on
block-level. This means that a thread cannot access shared memory that is
allocated outside its block. Local memory is located off-chip and thus has a
long latency. It is used to store variables when there are insufficient registers
available.
Other different types of memory not shown in Figure 3.5 are registers and
constant memory. Registers are located on-chip and are scarce. Registers
are not shared between threads. Constant memory is located off-chip but is
cached. Caching makes memory accesses to constant memory fast although
it is located off-chip.
While global memory is located off-chip and has the longest latency, there
are techniques available that can reduce the amount of GPU clock cycles
required to access large amounts of memory at one time. This can be done
through memory coalescing. Memory coalescing refers to the alignment of
threads and memory. For example, if memory access is coalesced, it takes
only one memory request to read 64-bytes of data. On the other hand, if it
is not coalesced, it could take up to 16 memory requests depending on the
GPU’s compute capability. This is further explained in Section 3.2.1 of the
CUDA Best Practices Guide [13].
3.4 Conclusions
GPUs have a parallel architecture with the capability of executing thousands
of threads simultaneously. This gives the GPU advantage over the CPU
when it comes to intensive computations on large amounts of data. With
the CUDA framework, developers have access to CUDA-enabled NVIDIA
GPUs. This allows developers to leverage this computation capability for
19
Page 33
3.4. CONCLUSIONS
applications other than graphics-rendering. Along with this, CUDA provides
a relatively cheap alternative to supercomputing.
20
Page 34
4Literature Review
In the NVIDIA GPU Computing Sofware Development Kit (SDK), there is
a sample code for three-dimensional FDTD simulation. However, the FDTD
method implemented is not the conventional FDTD method as discussed in
Chapter 2. In [14], the code was tested on an NVIDIA Tesla S1070 and a
throughput of nearly 3,000 Mcell/s were achieved. While the code is not of
much use to the FDTD method explored in this thesis, the throughput it
achieved provides a good indication of the NVIDIA Tesla S1070’s capability.
In [15], the use of CUDA on the FDTD method is explored and a short
summary of CUDA’s architecture is given. In this paper, four NVIDIA Tesla
C1060’s were used for testing and this resulted in a throughput of almost
2,000 Mcell/s.
In [16], a two-dimensional FDTD simulation for mobile communications
systems was implemented on CUDA. In this paper, the Convolutional Per-
fectly Matched Layer (CPML) was used as the absorbing region. The paper
discusses the use of shared memory, and configuring block sizes for optimal
performance. The results from the simulation on an NVIDIA Tesla C870 pro-
duced a throughput of 760 Mcells/s. MATLAB was also used for comparison
21
Page 35
4.1. SUMMARY
and it was the slower compared to both the CPU and CUDA.
In the article by Garland, M. et al. [17], a detailed explanation of CUDA’s
architecture is given. The article also summarises a few applications that are
suited for running on CUDA such as molecular dynamics, medical imaging
and fluid dynamics.
In [8], the author discusses the three-dimensional FDTD algorithm in-
cluding its implementation on CUDA. Applications of the FDTD method
such as in microwave systems and biomedicine are discussed. The author ar-
gues that “there are an infinite number of ways in which an algorithm can be
partitioned for parallel execution on a GPU”. On an NVIDIA Tesla S1070, a
maximum speed-up of 1,680 Mcells/s was achieved and the optimum simula-
tion size is said to be up to 380 Mcells. By using a cluster of Tesla S1070’s,
a throughput of over 15,000 Mcells/s was achieved.
4.1 Summary
In summary, while there are many existing literature on the topic of ac-
celerating the FDTD method using CUDA, there are few that detail the
difficulty and complexity involved in developing the program. All of the ex-
isting literature reviewed show that significant speed-ups were achieved by
using CUDA to accelerate computation. Thus, this thesis investigates the
complexity involved in programming using the CUDA framework to achieve
the speed-ups.
22
Page 36
51-D FDTD Results
5.1 Introduction
The development of the code for running the FDTD method using CUDA was
done incrementally, starting from one-dimension (1-D), followed by 2-D and
finally 3-D; these are given in later chapters. For each phase, new methods
and techniques were used to provide more speed-up as the algorithm became
more complex and the amount of data processed increased. For consistency,
the platform used for testing was not changed throughout the development.
The specifications of the test platform are listed in the following section.
Each chapter will also present the various methods and techniques used
to achieve speed-ups. In each phase, various simulation sizes tested. The
results of the simulations are compared and explained. Execution times and
speed-ups are listed. To address concerns in the accuracy of the CUDA
implementation, the variation of results between CPU and GPU will also be
detailed.
Throughout the thesis, speed-ups and throughputs are used to quantify
23
Page 37
5.2. TEST PLATFORM
performance. They are defined as
Speed-up =CPU Execution Time
GPU Execution Time
�� ��5.1
Throughput (Mcells/s) =Number of cells × Number of time-steps
106 × Execution Time (s)
�� ��5.2
where Mcells is a million-cells.
5.2 Test Platform
The finite-difference time-domain implementation for running on CUDA was
tested on the computer ‘Emory’. The specifications of the computer are
shown in Table 5.1.
Operating System 64-bit CentOS Linux
Memory (RAM) 32 GB
CPU Dual Intel Xeon E54x30
Clock Speed: 2.66GHz
Number of cores: 4
GPU Dual NVIDIA Tesla S1070
Clock Speed: 1.30GHz
Number of processors: 4
Number of cores per processor: 240
Table 5.1: Specifications of the test platform.
The NVIDIA Tesla S1070 has a CUDA Compute Capability of 1.3. De-
tailed specifications of the Tesla S1070 are listed in Appendix A.
As shown in Table 5.1, there are 4 processors in the Tesla S1070. However,
for the purpose of this thesis, only one of the processors is utilized.
24
Page 38
5.3. RESULTS
5.3 Results
The one-dimensional FDTD algorithm that is used for porting to CUDA
is a simulation of a wave travelling in free space with absorbing boundary
conditions. The computer equations for the algorithm are
1 ex[i] = ca[i] * ex[i] - cb[i] * (hy[i] - hy[i-1]);
2 hy[i] = da[i] * hy[i] - db[i] * (ex[i+1] - ex[i]);
Listing 5.1: Main update equations for the 1-D FDTD Method.
where ex is the electric field and hy is the magnetic field. ca, cb, da and
db are coefficients which are pre-calculated before running the main update
equations of Listing 5.1. These coefficients remain constant throughout the
main update loop.
As this is the first attempt at getting the FDTD to run on CUDA, the
algorithm in Listing 5.1 was made to run in a CUDA kernel with very little
modifications to other segments of the code. The initialization routines and
pre-calculating of coefficients (ca, cb, da and db) are still done by the CPU.
After all initialization, all necessary data (ex, hy, ca, cb, da and db) are
transferred from the CPU to the GPU’s global memory. Then, the CUDA
kernel which contains the update equations of Listing 5.1 is executed.
To compare execution time, CUDA’s timer functions are utilized. Only
the time taken to run the main loop of the FDTD update equations is
recorded.
In order to analyse both the accuracy and the execution time between
CPU and GPU, the CUDA code has to be executed twice. This is because
the GPU works with the data stored in its global memory and the data has
to be transferred to the CPU before it can be analysed. Thus, after each
time-step, the data in GPU’s global memory is transferred to the CPU for
25
Page 39
5.3. RESULTS
processing.
However, with these memory transfers, the execution time for the main
loop cannot be accurately obtained. Thus, the CUDA code is executed again,
this time without the memory transfers. This will provide a more accurate
and fair comparison against the CPU’s execution time.
Listing 5.2 and Listing 5.3 show the C code that utilizes the CPU and
CUDA respectively. Listing 5.4 shows the C code for the CUDA kernel.
1 for (int n = 0; n < Nmax; n++) {
2 int m;
3
4 pulse = exp(-0.5 * pow((no - n)/spread,2)) ;
5
6 for ( m = 1; m < Ncells ; m++)
7 ex[m] = ca[m] * ex[m] - cb[m] * (hy[m] - hy[m-1]);
8
9 ex[Location] = ex[Location] + pulse;
10
11 for ( m = 0; m < Ncells - 1; m++)
12 hy[m] = da[m] * hy[m] - db[m] * (ex[m+1] - ex[m]);
13 }
Listing 5.2: Main update loop for the 1-D FDTD method using CPU.
1 for (int n = 0; n < Nmax; n++) {
2 calcOneTimeStep_cuda<<<numberOfBlocks, threadsPerBlock>>>(ex
, hy, ca_d, cb_d, da_d, db_d, Ncells, Ncells/2, n);
3 }
Listing 5.3: Main update loop for the 1-D FDTD method using CUDA.
26
Page 40
5.3. RESULTS
1 __global__ void calcOneTimeStep_cuda(float *ex, float* hy,
float* ca, float* cb, float* da, float* db, int Ncells,
int Location, int Nmax) {
2 float pulse;
3 float no = 40;
4 float spread = 12;
5
6 int i = blockIdx.x * blockDim.x + threadIdx.x;
7
8 pulse = exp(-0.5 * pow((no - n)/spread,2)) ;
9
10 if ( i > 0 && i < Ncells )
11 ex[i] = ca[i] * ex[i] - cb[i] * (hy[i] - hy[i-1]);
12
13 if ( i == Location )
14 ex[i] = ex[i] + pulse;
15
16 __syncthreads();
17
18 if ( i < Ncells - 1 )
19 hy[i] = da[i] * hy[i] - db[i] * (ex[i+1] - ex[i]);
20 }
Listing 5.4: CUDA kernel for the 1-D FDTD method.
27
Page 41
5.3. RESULTS
Table 5.2 shows the results obtained from running the code on Emory.
The number of time steps is 3,000 and the execution time is an average of
five runs.
Simulation Size Execution Time (ms)Speed-up
(number of cells) CPU CUDA
100 5.368 178.111 0.0301
512 25.340 188.3199 0.1346
1,000 49.781 187.775 0.2651
1,024 51.324 183.427 0.2798
5,120 263.243 193.365 1.3614
10,240 520.865 203.430 2.5604
51,200 2,674.100 315.700 8.4704
102,400 5,125.420 456.463 11.2286
512,000 28,962.292 1,739.171 16.6529
1,024,000 57,911.306 2,938.229 19.7096
5,120,000 290,531.271 13,897.174 20.9058
10,240,000 558,871.438 27,616.772 20.2367
Table 5.2: Results for one-dimensional FDTD simulation.
There are a few interesting observations that can be made from the re-
sults. Firstly, as expected, higher speed-ups are obtained when the simulation
size is increased. At the simulation size of approximately ten million cells,
the CPU takes more than nine minutes to run while the GPU only takes
slightly more than 27 seconds.
This speed-up will be more appreciable as simulation size is increased
and the FDTD is done in three dimensions instead of only one dimension.
Simulation that will take hours to run on CPU could potentially take only
28
Page 42
5.3. RESULTS
0.005 0.01 0.05 0.1 0.5 1 5 100
5
10
15
20
25
Speed−
up
Simulation Size (Mcells)
Figure 5.1: Speed-up for one-dimensional FDTD simulation.
29
Page 43
5.3. RESULTS
0.005 0.01 0.05 0.1 0.5 1 5 100
200
400
600
800
1000
1200
Th
roug
hpu
t, M
cells
/s
Simulation Size (Mcells)
Figure 5.2: Throughput for one-dimensional FDTD simulation running on
CUDA.
30
Page 44
5.3. RESULTS
minutes to run on a GPU.
Secondly, the results show that for small simulation sizes such as 100 cells,
the CPU runs much faster than the CPU. This is a result of the latency of
memory transfers between host (CPU) and device (GPU) as discussed in
Section 3.1. Because both host and device have separate memory spaces,
data has to be transferred between host and device in order for the GPU
to perform the calculations. However, when the simulation size is small, the
fast speed of the GPU in arithmetic operations is obscured by the time taken
to transfer the data.
Also, the runtime of the GPU is longer when the simulation size is 1,000
instead of 1,024. The reason for this is with the thread and block organization
of the GPU. For this simulation, the number of threads per block was set
constant at 512. The number of blocks required is then obtained by dividing
the simulation size with the threads per block. Simulation sizes of 100 and
1,024 cells both require two blocks but the former does not fully use all
threads in the one of the blocks.
CUDA uses a SIMT model as discussed in Section 3.1. Multiple threads
run the same instruction simultaneously. With the simulation size of 1,000
cells, not all the threads in the blocks are used. Specifically, 24 threads are
not performing the calculations. This breaks homogeneity and causes the
CUDA GPU which has a SIMT model to perform slower. One method to
prevent breaking homogeneity would be to program the GPU to calculate on
all threads but ignore the results from the 24 unused threads.
It is also interesting to note that from Figure 5.2, the throughput appears
to saturate at around 1,100 Mcells/s.
31
Page 45
5.4. DISCREPANCY IN RESULTS
5.4 Discrepancy In Results
The results from CUDA and the CPU were compared by obtaining the differ-
ence between the CPU and CUDA electric fields. The percentage in change
was also calculated.
Difference =
∣∣∣∣Ex,CPU ∣∣nk − Ex,CUDA∣∣nk
∣∣∣∣ �� ��5.3
Percentage change =
∣∣∣∣∣Ex,CPU∣∣nk− Ex,CUDA
∣∣nk
Ex,CPU∣∣nk
∣∣∣∣× 100%�� ��5.4
where n is the time step and k is the spatial location. Instead of analysing all
the differences at every time step and at every cell, only the largest difference
and the largest percentage of change over all the time steps in the whole
simulation space is recorded. This is sufficient to determine whether CUDA
is performing correctly. The results are shown in Table 5.3.
To illustrate the significance of the discrepancy in results, MATLAB was
used to the plot the results generated from both the CPU and CUDA. Only
the results for the simulation size of 100 cells are plotted. It is redundant to
plot the other simulation sizes because all parameters including the type of
wave used are set to be constant for all the simulation sizes. The only change
is in spatial size.
From the plots, it can be concluded that there are no appreciable dif-
ferences in the results. The differences are small enough to ignore and in-
significant considering the speed-ups obtained. Although Table 5.3 shows
a discrepancy of almost 50%, the magnitude of this difference is less than
9 × 10−7. The plot confirms that the differences are negligible.
However so, it must be said that these comparisons are crucial and were
made throughout the development. There are many parts of the program that
could easily go wrong and these checks are necessary in order to ensure that
32
Page 46
5.4. DISCREPANCY IN RESULTS
Simulation Size Difference Percentage change
100 8.940697 × 10−7 46.901276%
512 1.184060 × 10−6 20.490625%
1,000 5.365000 × 10−6 37.937469%
1,024 5.598064 × 10−6 18.774035%
5,120 4.082790 × 10−5 2.390608%
10,240 4.082790 × 10−5 2.390608%
51,200 4.082790 × 10−5 2.390608%
102,400 4.082790 × 10−5 2.390608%
512,000 4.082790 × 10−5 2.390608%
1,024,000 4.082790 × 10−5 2.390608%
5,120,000 4.082790 × 10−5 2.390608%
10,240,000 4.082790 × 10−5 2.390608%
Table 5.3: Discrepancy in results between CPU and CUDA simulation of the
one-dimensional FDTD method.
33
Page 47
5.4. DISCREPANCY IN RESULTS
0 500 1000 1500 2000 2500 3000−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
Time−step
Magnitude
CPU
GPU (CUDA)
Figure 5.3: Plot of 1-D FDTD simulation results to compare accuracy be-
tween CPU and GPU. The plots are generated from the result from simula-
tion size of 100 cells. The location of the probe is at x = 30.
34
Page 48
5.4. DISCREPANCY IN RESULTS
50 55 60 65 70 75 80 85 900
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time−step
Magnitude
CPU
GPU (CUDA)
Figure 5.4: Plot of 1-D FDTD simulation results to compare accuracy be-
tween CPU and GPU. The plots are generated from the result from simu-
lation size of 100 cells. The location of the probe is at x = 30. The plot is
centered between time-steps 50 and 90.
35
Page 49
5.5. CONCLUSIONS
the changes made to the program do not cause CUDA to produce incorrect
results.
5.5 Conclusions
Porting the one-dimension FDTD method to CUDA has demonstrated that
there is a lot of potential in GPU acceleration for the FDTD method. Speed-
ups of over 20x and throughputs of over 1,100 Mcells/s (in comparison to a
CPU throughput of only 54 Mcells/s) are convincing.
The results have also shown that CUDA is best-suited for repetitive pro-
cessing on large amounts of data. For a small data set, CUDA does not
perform well due to the memory latency issues.
Concerns of discrepancy between the CPU and GPU results discussed in
Section 3.2 were addressed and the simulations performed shows no signifi-
cant difference.
Although the results are convincing, the one-dimension FDTD method is
relatively simple compared to its two- and three-dimension counterparts. The
difficulty in extracting speed-ups from CUDA increases with the number of
dimensions and various optimizations have to be performed. This is discussed
in the following chapters.
36
Page 50
62-D FDTD Results
6.1 Introduction
Just as in the one-dimensional FDTD method, a wave emanating through
free space from a point source is simulated. The computer code for the
two-dimensional FDTD method’s main field update equations are shown in
Listing 6.1.
1 ex[i][j] = caex[i][j] * ex[i][j] + cbex[i][j] * (hz[i][j] - hz
[i][j-1]);
2 ey[i][j] = caey[i][j] * ey[i][j] + cbey[i][j] * (hz[i-1][j] -
hz[i][j]);
3 hz[i][j] = dahz[i][j] * hz[i][j] + dbhz[i][j] * (ex[i][j+1] -
ex[i][j] + ey[i][j] - ey[i+1][j]);
Listing 6.1: Main update equations for the 2-D FDTD Method.
Perfectly-match layers (PML) were used as the artificial absorbing layer
to prevent the travelling waves from reflecting at the boundaries of the simu-
lation space. This causes the main update loop to be more complex compared
to the 1-D loop. This is because the equations for the PML region are differ-
37
Page 51
6.2. TEST PARAMETERS
ent from the FDTD update equations. Apart from that, the use of the PMLs
introduces four more regions in addition to the main field domain—the PML
regions on the top, bottom, left and right of the central FDTD region.
The flowchart in Figure 6.1 provides an illustration of the main update
loop for the two-dimensional FDTD method with PMLs as the absorbing
region.
As the PML update equations are too long, they will not be listed here.
However, it is important to note that the PML update equations and main
field update equations are independent of each other. This allows the PML
and the field equations to be updated concurrently.
Obtaining a noticeable speed-up for the two-dimensional FDTD method
proved to be more challenging compared to the one-dimensional FDTD method.
As was discovered, the main factors that were causing CUDA to run slowly
were related to memory. However, all problems faced during the development
and methods used to improve the speed-up will be discussed in the following
sections.
6.2 Test Parameters
The program execution set-up used for testing is shown in Table 6.1.
As in the one-dimensional FDTD simulation, there needs to be a way of
determining whether CUDA is performing correctly and producing accurate
results. To do this, the magnetic field, Hz, is monitored at a particular
location. This probe location could be anywhere in the FDTD region. For
consistency, the cell at location (25, 25) is used. The magnetic field calculated
by the CPU and by CUDA at this location is recorded throughout all the
time-steps.
38
Page 52
6.2. TEST PARAMETERS
Initialize Model
Update electric
fields ex and
ey in main grid
Update ex in all
PML regions
Update ey in all
PML regions
Update magnetic
field hz in main grid
Update hzx in
all PML regions
Update hzy in
all PML regions
Has
time-step
ended?
End
no
yes
Figure 6.1: Flowchart for two-dimensional FDTD simulation.
39
Page 53
6.2. TEST PARAMETERS
Number of time steps 1000
x-location of wave source 75
y-location of wave source 75
x-location of probe 25
y-location of probe 25
CUDA x-dimension of a block 16
CUDA y-dimension of a block 16
Simulation Sizes 128 × 128
256 × 256
512 × 512
1024 × 1024
2048 × 2048
Thickness of PML 8 cells
Table 6.1: Set-up for two-dimensional FDTD simulation.
40
Page 54
6.3. INITIAL RUN
The original CPU execution times are shown in Table 6.2. Throughout
this chapter, the speed-ups calculated are based on the CPU execution times
in this table.
Simulation Size CPU Execution Time (ms)
128 × 128 5,177.621
256 × 256 18,342.557
512 × 512 75,437.734
1024 × 1024 288,112.225
2048 × 2048 1,129,977.275
Table 6.2: Results for two-dimensional FDTD simulation on CPU.
6.3 Initial Run
Similar to the development of the 1-D code, the 2-D code was written to
execute the main update equation using CUDA. Specifically, Ex, Ey and Hz
update equations are executed using CUDA. The field updates in the PML
regions however were still executed using the CPU. In order to do this, the Ex
and Ey fields had to be transferred from the host to CUDA before executing
a CUDA kernel and then transferred back from CUDA to the host after the
kernel has finished executing. As expected from the discussion in Section
3.3, the memory operations were costly and resulted in a slower execution
compared to the CPU. The results are shown in Table 6.3.
The results show that as the simulation sized increases, the speed-up
decreases. This is because at a larger simulation sizes, the amount of mem-
ory transferred per time-step is larger. To reduce the memory transactions,
CUDA is used to update the PML regions as well.
41
Page 55
6.4. UPDATING PML USING CUDA
Simulation Size CUDA Execution Time (ms) Speed-up
128 × 128 5,327.528 0.97186181
256 × 256 19,307.889 0.950003219
512 × 512 108,445.182 0.695630109
1024 × 1024 392,133.552 0.734729848
2048 × 2048 1,550,150.542 0.728946799
Table 6.3: Results for two-dimensional FDTD simulation on CUDA (initial
run).
6.4 Updating PML Using CUDA
By using CUDA to update the PML region, the need for any data transfers
between the host and the device in the main update loop is eliminated. How-
ever, this prevents us from probing the magnetic field in order to determine
the accuracy of CUDA’s execution as discussed in Section 6.2. This is solved
in a similar fashion as the 1-D FDTD simulation. The program is executed
twice. First without probing and second with probing. The first execution
will be timed so that an accurate speed-up can be calculated. In the second
execution, the magnetic field is saved so that a comparison can be made
between CUDA and the CPU.
Comparing the results of Table 6.3 and Table 6.4, there is a noticeable
improvement to the execution times when CUDA is used to calculate the
PML. However, CUDA’s perfomance is still very poor and this is certainly not
to be expected given the results from the one-dimensional FDTD simulation.
The reason behind this is not immediately apparent but further investigation
into the CUDA execution model reveals the reason and it is discussed is
Section 6.6.
42
Page 56
6.5. COMPUTE PROFILER
Simulation Size CUDA Execution Time (ms) Speed-up
128 × 128 2,388.178 2.168021637
256 × 256 7,920.042 2.315967187
512 × 512 67,085.505 1.124501243
1024 × 1024 286,012.760 1.007340458
2048 × 2048 1,157,616.333 0.976124163
Table 6.4: Results for two-dimensional FDTD simulation on CUDA (using
CUDA to update PML).
6.5 Compute Profiler
In order to understand what is happening in the GPU throughout the execu-
tion of the program and determine where the bottlenecks occur, the NVIDIA
Compute Visual Profiler tool is used. This tool is bundled together with the
NVIDIA CUDA Toolkit. The Compute Visual Profiler is “used to measure
performance and find potential opportunities for optimization in order to
achieve maximum performance from NVIDIA GPUs” [18].
Since the testing of CUDA on Emory is performed through a command-
line interface, a textual method of profiling the CUDA program is used in-
stead of the graphical user interface (GUI) method shown in Figure 6.2.
The article in [19] provides an excellent tutorial on textual profiling. For
the purpose of profiling the FDTD program, focus is given on the following
parameters:
gld incoherent Number of non-coalesced global memory loads
gld coherent Number of coalesced global memory loads
gst incoherent Number of non-coalesced global memory stores
43
Page 57
6.5. COMPUTE PROFILER
Figure 6.2: Screen-shot of Visual Profiler GUI from CUDA Toolkit 3.2 run-
ning on Windows 7. The data in the screen-shot is imported from the results
of the memory test in Listing 6.4. These results are also available in Appendix
B.
44
Page 58
6.5. COMPUTE PROFILER
gst coherent Number of coalesced global memory stores
The results of the profiling is expected to show a high number of non-
coalesced memory loads and non-coalesced memory stores as there has been
no optimizations to the utilization of CUDA’s memory yet. However, the
results of the profiling showed that all global memory loads and stores were
coalesced. The source code for the test and the profiling results are shown
in Appendix B.
The reason behind this could be the CUDA compute capability of Emory’s
GPU. As shown in Appendix A, the NVIDIA Tesla S1070’s compute capa-
bility is 1.3. The memory coalescing requirements for compute capability
1.3 are relaxed. This is explained in Section 3.2.1 of CUDA Best Practices
Guide 3.1 [13]. Although the profiling shows that all memory accesses are
coalesced, it does not mean memory accesses are optimised. Improvements
can be made and a simple program based on Section 3.2.1.3 of the CUDA
Best Practices Guide is tested on Emory. Table 6.5 summarises the textual
profiling results for the program.
The results in Table 6.5 show 0 incoherent memory loads (gld incoherent)
and 0 incoherent memory stores (gst incoherent). This is inconsistent with
what is expected from the program. The program should show uncoalesed
memory accesses at offsets other than 0 and 16.
Therefore, the example above clearly shows that on Emory’s GPU, textual
profiling does not provide sufficient information on whether memory accesses
are optimized.
45
Page 59
6.5. COMPUTE PROFILER
Offset gld coherent gld incoherent gst coherent gst incoherent
0 128 0 512 0
1 192 0 512 0
2 192 0 512 0
3 192 0 512 0
4 192 0 512 0
5 192 0 512 0
6 192 0 512 0
7 192 0 512 0
8 192 0 512 0
9 128 0 512 0
10 192 0 512 0
11 192 0 512 0
12 192 0 512 0
13 192 0 512 0
14 192 0 512 0
15 192 0 512 0
16 128 0 512 0
Table 6.5: Textual profiling results for investigation into memory coalescing.
46
Page 60
6.6. CUDA MEMORY RESTRUCTURING
6.6 CUDA Memory Restructuring
As explained in Section 3.3, there are different types of memory in CUDA and
they all have different advantages and disadvantages. While global memory is
the largest, it is not cached and the memory is located off-chip. This causes
access to global memory to be slow. Therefore, to improve the results of
the simulation, CUDA’s memory structure is investigated in order to obtain
better speed-ups. It is worth noting that in the CUDA C Best Practices
Guide [13], memory optimizations are recommended as high-priority.
In Section 6.4, the amount of memory transfers between host and device
has been reduced by using CUDA to perform the calculations in the PML
regions. This eliminated the need for any data transfers within the main
update loop. However, the data was not stored in a way to assist in coalesced
memory accesses. Although the simulation is for 2-D space, memory was only
allocated in 1-D arrays in CUDA. An example of this is shown in Listing 6.2.
1 cudaMalloc(&ex_d, ie*jb*sizeof(float));
2 cudaMemcpy(ex_d, ex, ie*jb*sizeof(float),
cudaMemcpyHostToDevice);
3 cudaMemcpy(ex, ex_d, ie*jb*sizeof(float),
cudaMemcpyDeviceToHost);
Listing 6.2: 1-D array in CUDA.
To improve on this, the cudaMallocPitch() function is used for allo-
cating memory instead of cudaMalloc(). This function is recommended by
the NVIDIA CUDA Programming Guide 3.1.1 for allocating 2D memory.
cudaMallocPitch() ensures that memory allocation is padded properly to
meet alignment requirements for coalesced memory access. Further expla-
nation on this can be found in Section 5.3.2.1.2 of the Programming Guide.
Listing 6.3 shows how cudaMallocPitch() is used in place of the code in
47
Page 61
6.6. CUDA MEMORY RESTRUCTURING
Listing 6.2.
1 cudaMallocPitch((void**)&ex_d, &ex_pitch, jb * sizeof(float),
ie);
2 cudaMemcpy2D(ex_d, ex_pitch, ex, jb * sizeof(float), jb *
sizeof(float), ie, cudaMemcpyHostToDevice);
3 cudaMemcpy2D(ex, jb * sizeof(float), ex_d, ex_pitch, jb *
sizeof(float), ie, cudaMemcpyDeviceToHost);
Listing 6.3: 1-D array in CUDA allocated using cudaMallocPitch().
To test this, the allocation of the electric field Ex in CUDA is changed to
use cudaMallocPitch() instead of cudaMalloc(). The result of this change is
shown in Table 6.6. However, the results are not convincing. There is little
difference compared the the previous result from using cudaMalloc().
Simulation Size CUDA Execution Time (ms) Speed-up
128 × 128 1,380.316 3.751039236
256 × 256 5,523.732 3.320682159
512 × 512 49,619.559 1.520322553
1024 × 1024 211,769.401 1.360499787
2048 × 2048 879,721.688 1.284471318
Table 6.6: Results for two-dimensional FDTD simulation on CUDA (using
cudaMallocPitch() for Ex memory allocation).
To investigate the cause of this, a simple program was developed to deter-
mine whether there are any advantages of using cudaMallocPitch() for 2-D
arrays. The algorithm for the program is shown in Figure 6.3. The kernel
invocation and kernel definition is shown in Listing 6.4. A full-listing of the
source code is attached in Appendix B.
48
Page 62
6.6. CUDA MEMORY RESTRUCTURING
1 // Kernel definition
2 __global__ void copy(float *odata, float* idata, int pitch,
int size_x, int size_y) {
3 unsigned int yid = blockIdx.x * blockDim.x + threadIdx.x;
4 unsigned int xid = blockIdx.y * blockDim.y + threadIdx.y;
5
6 if ( xid < size_y && yid < size_x ) {
7 int index = ( yid * pitch / sizeof(float) ) + xid;
8 odata[index] = idata[index] * 2.0;
9 }
10 }
11
12 int main() {
13 ...
14 dim3 grid(ceil((float)size_x/16), ceil((float)size_y/16), 1)
;
15 dim3 threads(16, 16, 1);
16
17 //Kernel invocation
18 copy<<<grid, threads>>>(d_odata, d_idata, pitch, size_x,
size_y);
19 }
Listing 6.4: Simple program to determine advantages of using
cudaMallocPitch().
49
Page 63
6.6. CUDA MEMORY RESTRUCTURING
Run initialization
Create 2-D array in CPU
and fill with random data
Create 2-D array in CUDA
using cudaMallocPitch().
Copy array from CPU to CUDA
Multiply each element in
array by 2 using CUDA
Copy array from CUDA to CPU
Check that CUDA has mul-
tiplied all elements correctly
End
Figure 6.3: Simple program to determine advantages of cudaMallocPitch().
A snippet of the textual profiling results are shown in the Table 6.7. For
the full results, refer to Appendix B.
Again, the results from executing this program were unconvincing. Al-
though there were no indication of uncoalesced memory reads or writes, the
execution time is still quite slow and does not improve although cudaMallocPitch
50
Page 64
6.6. CUDA MEMORY RESTRUCTURING
method Z4copyPfS iii
gputime 7579.68
gld coherent 104960
gld incoherent 0
gst coherent 209920
gst incoherent 0
Table 6.7: Snippet of results from textual profiling on uncoalesced memory
access.
() was used. Thus, this program was profiled on another computer with
Compute Capability 1.0 and the CUDA Compute Profiler showed that that
memory accesses were still uncoalesced.
This was not expected and further testing showed that CUDA memory
allocation in CUDA was column-major. To fix the problem, the code from
Listing 6.4 is replaced with the code in Listing 6.5. The only changes made
to the source code are in Lines 3, 4 and 18. The full source code is available
in Appendix C.
A snippet of the test results are shown in the Table 6.8. For the full
results, refer to Appendix C.
This change to column-major produced a significantly faster result—46
times faster in this test. It is also interesting to note that the number of
global memory loads recorded in the profiling has dropped by 16 times. This
indicates that for the uncoalesced memory test, although the profiler did not
show incoherent memory access, there were 16 times more memory accessed
than were required for a coalesced kernel. The number 16 is equivalent to a
half-warp of threads. This is consistent with the explanation in Section 3.2.1
of the CUDA Best Practices Guide [13].
51
Page 65
6.6. CUDA MEMORY RESTRUCTURING
1 // Kernel definition
2 __global__ void copy(float *odata, float* idata, int pitch,
int size_x, int size_y) {
3 unsigned int xid = blockIdx.x * blockDim.x + threadIdx.x;
4 unsigned int yid = blockIdx.y * blockDim.y + threadIdx.y;
5
6 if ( xid < size_y && yid < size_x ) {
7 int index = ( yid * pitch / sizeof(float) ) + xid;
8 odata[index] = idata[index] * 2.0;
9 }
10 }
11
12 int main() {
13 ...
14 dim3 grid(ceil((float)size_y/16), ceil((float)size_x/16), 1)
;
15 dim3 threads(16, 16, 1);
16
17 //Kernel invocation
18 copy<<<grid, threads>>>(d_odata, d_idata, pitch, size_x,
size_y);
19 }
Listing 6.5: Simple program to determine advantages of using
cudaMallocPitch().
52
Page 66
6.6. CUDA MEMORY RESTRUCTURING
method Z4copyPfS iii
gputime 161.792
gld coherent 6560
gld incoherent 0
gst coherent 26240
gst incoherent 0
Table 6.8: Snippet of results from textual profiling on coalesced memory
access.
This change to the memory access pattern is applied to the 2-D FDTD
code. The results are shown in Table 6.9.
Simulation Size CUDA Execution Time (ms) Speed-up
128 × 128 855.447 6.05253435
256 × 256 1,273.917 14.39854914
512 × 512 2,514.767 29.99790577
1024 × 1024 6,739.438 42.75018872
2048 × 2048 22,020.758 51.31418658
Table 6.9: Results for two-dimensional FDTD simulation on CUDA (using
cudaMallocPitch() and column-major indexing).
A plot of the throughput for CUDA is shown in Figure 6.4. The through-
put for CUDA is over 190 Mcells/s. This is significantly faster than the
CPU’s throughput of less than 4 Mcells/s.
53
Page 67
6.6. CUDA MEMORY RESTRUCTURING
16,384 65,536 262,144 1,048,576 4,192,3040
20
40
60
80
100
120
140
160
180
200
Th
rou
gh
pu
t, M
ce
lls/s
Simulation Size (number of cells)
Figure 6.4: Throughput for two-dimensional FDTD simulation running on
CUDA.
54
Page 68
6.7. DISCREPANCY IN RESULTS
6.7 Discrepancy In Results
The similar tests for discrepancy between CPU and GPU as in the one-
dimensional FDTD simulation detailed in Section 5.4 are performed here.
The results are shown in Table 6.10 below.
Simulation Size Difference Percentage change
128 × 128 4.507601 × 10−7 0.299014%
256 × 256 3.613532 × 10−7 0.304905%
512 × 512 3.613532 × 10−7 0.304905%
1024 × 1024 3.613532 × 10−7 0.304905%
2048 × 2048 3.613532 × 10−7 0.304905%
Table 6.10: Discrepancy in results between CPU and CUDA simulation of
the two-dimensional FDTD method.
The conversion to utilize CUDA for computation has not caused any
significant change in the FDTD results. The discrepancy is only around
0.3% and Figure 6.5 shows no noticeable difference in the plot of CPU and
CUDA results.
6.8 Alternative Absorbing Boundaries
During the development of the code, it became obvious that the PML used
to absorb the waves at the boundaries of the simulation space was complex.
The PMLs did not appear to be good candidate for running on CUDA and
thus, other absorbing boundary conditions (ABCs) were explored. Mur’s
first order ABC, Mur’s second order ABC and Liao’s ABC were tested as
replacement for the PMLs.
Figure 6.6 shows that both versions of Mur’s ABC have better speed-ups
55
Page 69
6.8. ALTERNATIVE ABSORBING BOUNDARIES
0 1000 2000 3000 4000 5000−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
Time−step
Ma
gn
itu
de
CPU
GPU (CUDA)
Figure 6.5: Plot of 2-D FDTD simulation results to compare accuracy be-
tween CPU and GPU. The plots are generated from the result from simula-
tion size 128 × 128. The location of the probe is at cell (25, 25) as specified
in Table 6.1.
56
Page 70
6.8. ALTERNATIVE ABSORBING BOUNDARIES
16,384 65,536 262,144 1,048,576 4,192,3040
10
20
30
40
50
60
70
Speedup
Simulation Size (number of cells)
PML
1st Order Mur
2nd Order Mur
Liao
Figure 6.6: Comparison of speed-up for various ABCs in two-dimensional
FDTD simulation running on CUDA.
57
Page 71
6.8. ALTERNATIVE ABSORBING BOUNDARIES
16,384 65,536 262,144 1,048,576 4,192,3040
200
400
600
800
1000
1200
Th
rou
gh
pu
t, M
ce
lls/s
Simulation Size (number of cells)
PML
1st Order Mur
2nd Order Mur
Liao
Figure 6.7: Comparison of throughput for various ABCs in two-dimensional
FDTD simulation running on CUDA.
58
Page 72
6.9. CONCLUSIONS
than the PML while Liao’s ABC has the lowest speedup. However, this does
not mean that Liao’s ABC is the slowest. In fact, Liao’s ABC has a much
higher throughput than the PML as shown in Figure 6.7. The reason Liao’s
ABC performs slower than Mur’s ABC is in the use of doubles instead of
floats for the field variables. The use of floats causes Liao’s ABC to become
unstable.
The results confirm that the PML’s complexity significantly reduces the
performance of the FDTD simulations. The PML is widely-used because its
accuracy is better than the ABCs. However, the performance benefit that
can be obtained from using other ABCs on CUDA makes using the PML
less attractive. At 2048 × 2048 cells, it takes 22 seconds for the 2-D FDTD
simulation using PMLs to complete while using Mur’s second order ABC
reduces that time to less than four seconds. Furthermore, for Mur’s second
order ABC and Liao’s ABC, there was no significant discrepancy noticed
when compared to the PML.
6.9 Conclusions
In the one-dimension FDTD method, speed-ups of over 20x and a throughput
of over 1,100 Mcells/s for CUDA was recorded. In the two-dimension FDTD
method, similarly convincing results were achieved with over 60x speedup
and a throughput of over 1,400 Mcells. However, achieving those results
comes with increasing complexity and difficulty.
While a simple port to CUDA with almost no optimizations yielded sig-
nificant speed-ups for 1-D, the same cannot be said for the 2-D. It became
obvious that programming on CUDA’s framework required many optimiza-
tions for optimal performance. It also becomes clear that the question of
59
Page 73
6.9. CONCLUSIONS
whether it is worth taking the time to obtain these speed-ups has to be
answered.
Alternative ABCs were explored in this chapter because the PML is not
well-suited to run on CUDA. The throughput was increased by more than
five times when Mur’s second order ABC was used. Also, Mur’s second order
and Liao’s ABC did not show significant discrepancies compared to the PML
implementation.
60
Page 74
73-D FDTD Results
7.1 Introduction
The three-dimension FDTD method has six main equations as listed in Sec-
tion 2.4. The equations are converted into computer code as shown in Listing
7.1.
1 ex[i][j][k] = ca[id] * ex[i][j][k] + cby[id] * (hz[i][j][k] -
hz[i][j-1][k]) - cbz[id] * (hy[i][j][k] - hy[i][j][k-1]);
2 ey[i][j][k] = ca[id] * ey[i][j][k] + cbz[id] * (hx[i][j][k] -
hx[i][j][k-1]) - cbx[id] * (hz[i][j][k] - hz[i-1][j][k]);
3 ez[i][j][k] = ca[id] * ez[i][j][k] + cbx[id] * (hy[i][j][k] -
hy[i-1][j][k]) - cby[id] * (hx[i][j][k] - hx[i][j-1][k]);
4 hx[i][j][k] = da[id] * hx[i][j][k] + dbz[id] * (ey[i][j][k+1]
- ey[i][j][k]) - dby[id] * (ez[i][j+1][k] - ez[i][j][k]);
5 hy[i][j][k] = da[id] * hy[i][j][k] + dbx[id] * (ez[i+1][j][k]
- ez[i][j][k]) - dbz[id] * (ex[i][j][k+1] - ex[i][j][k]);
6 hz[i][j][k] = da[id] * hz[i][j][k] + dbx[id] * (ex[i][j+1][k]
- ex[i][j][k]) - dbz[id] * (ey[i+1][j][k] - ey[i][j][k]);
Listing 7.1: Main update equations for the 3-D FDTD Method.
61
Page 75
7.2. TEST PARAMETERS
The flowchart in Figure 7.1 illustrates the main update loop of the FDTD
method for three-dimensional space.
For development and testing purposes, a Gaussian wave propagating from
a dipole antenna into free-space is simulated. After stable and fast speed-
ups were obtained on the CUDA GPU, a head model was then simulated.
The head model was chosen because it was the motivation of the project.
Simulations of the head model were taking a long time to complete using
CPU and thus, it is only appropriate that the findings of this project be used
to improve the execution time on the head model.
While the two-dimension FDTD method was more challenging to work
with on CUDA compared to the one-dimension FDTD method, the three-
dimension FDTD method proves to be significantly more difficult. As will be
discussed in this chapter, the CUDA architecture does not support running a
large number of threads in three-dimensional and thus, alternative methods
had to be used to get around this.
As was found in the two-dimension FDTD simulations, PMLs as an ab-
sorbing region is not best suited to run on CUDA. In fact, it was found that
updating the PMLs were taking longer to run than updating the main grid.
Thus, alternative absorbing boundary conditions are explored and the results
are presented at the end of this chapter.
7.2 Test Parameters
Table 7.1 shows the configuration used for the three-dimension FDTD simu-
lations during development.
The original CPU execution times are shown in Table 7.2 and these exe-
cution times are used as a base for the calculation of speed-ups.
62
Page 76
7.2. TEST PARAMETERS
Initialize Model
Update electric fields ex,
ey and ez in main grid
Update ex in
all PML regions
Update ey in
all PML regions
Update ez in
all PML regions
Update magnetic field hx,
hy and hz in main grid
Update hx in
all PML regions
Update hy in
all PML regions
Update hz in
all PML regions
Has
time-step
ended?
End
no
yes
Figure 7.1: Flowchart for three-dimensional FDTD simulation.
63
Page 77
7.2. TEST PARAMETERS
Number of time steps 1,000
Center of dipole antenna (60, 60, 60)
Location of probe (60, 60, 60)
CUDA x-dimension of a block 16
CUDA y-dimension of a block 16
Simulation Sizes 128 × 128 × 128
160 × 160 × 160
192 × 192 × 192
224 × 224 × 224
256 × 256 × 256
Thickness of PML 4 cells
Table 7.1: Set-up for three-dimensional FDTD simulation.
Simulation Size CPU Execution Time (ms)
128 × 128 × 128 515,913.3910
160 × 160 × 160 947,819.1530
192 × 192 × 192 1,588,263.3060
224 × 224 × 224 2,452,357.4220
256 × 256 × 256 3,609,773.9260
Table 7.2: Results for three-dimensional FDTD simulation on CPU.
64
Page 78
7.3. CUDA BLOCK & GRID CONFIGURATIONS
7.3 CUDA Block & Grid Configurations
In the one-dimensional FDTD method, the blocks and grids of the CUDA
kernel were configured as one-dimension. Similarly, in the two-dimensional
FDTD method, the CUDA kernel was configured for two-dimension blocks
and grids. This is because it simplifies programming the CUDA kernel. How-
ever, this could not be extended to the three-dimensional FDTD method. At
this point, it worthy to note the limitations in CUDA’s block-grid configura-
tion as shown in Table 7.3.
Maximum x- or y-dimension of a grid of thread blocks 65,535
Maximum z-dimension of a grid of thread blocks 1
Maximum x- or y-dimension of a block 512
Maximum z-dimension of a block 64
Table 7.3: Limitations of the block and grid configuration for CUDA for
Compute Capability 1.x [6].
While CUDA does support three-dimensional blocks, the number of threads
in the z-dimension of a block is limited to only 64 compared to the 512 threads
for x- and y-dimension. More importantly, CUDA does not support three-
dimensional grids.
To circumvent this limitation, a two-dimensional configuration of blocks
and threads are used. Then, within the CUDA kernel, a for-loop is used to
cycle through the z-dimension. An example of this is shown in Listing 7.2.
65
Page 79
7.3. CUDA BLOCK & GRID CONFIGURATIONS
1 // Kernel definition
2 __global__ void kernel(float *odata, int size_x, int size_y,
int size_z) {
3 unsigned int xid = blockIdx.x * blockDim.x + threadIdx.x;
4 unsigned int yid = blockIdx.y * blockDim.y + threadIdx.y;
5
6 if ( xid < size_x && yid < size_y ) {
7 for ( int zid = 0; zid < size_z; zid++ ) {
8 ...
9 odata[index] = ...;
10 }
11 }
12 }
13
14 int main() {
15 ...
16 dim3 grid(ceil((float)size_x/16), ceil((float)size_y/16), 1)
;
17 dim3 threads(16, 16, 1);
18
19 //Kernel invocation
20 kernel<<<grid, threads>>>(d_odata, size_x, size_y, size_z);
21 }
Listing 7.2: Looping through a three-dimensional array using two-
dimensional blocks and grids in CUDA.()
66
Page 80
7.4. THREE-DIMENSIONAL ARRAYS IN CUDA
7.4 Three-Dimensional Arrays In CUDA
In the two-dimensional FDTD simulation, cudaMallocPitch() was used to
allocate memory for the two-dimensional arrays. The use of cudaMallocPitch
() ensured that the arrays were properly aligned to meet CUDA’s require-
ments for coalesced memory access. For three-dimensional arrays, the cudaMalloc3D
() and make_cudaExtent() functions are used. Listing 7.3 shows how these
functions are used.
1 int main() {
2 ...
3 float* ptr;
4
5 cudaExtent extent = make_cudaExtent(width * sizeof(float),
height, depth);
6 cudaMalloc3D(&ptr, extent);
7
8 ...
9 }
Listing 7.3: Allocating a three-dimensional array in CUDA.
Further details of these functions are described in the CUDA Reference
Manual [20].
7.5 Initial Run
Table 7.4 shows the results obtained from converting the main update loop
of Figure 7.1 to utilize CUDA.
At 256 × 256 × 256 cells, the throughput was calculated to be less than
19 Mcells/s. This is significantly less than what was achieved for the two-
dimensional FDTD—approximately 222 Mcells/s for the same number of
67
Page 81
7.6. VISUAL PROFILING
Simulation Size CUDA Execution Time (ms) Speed-up
128 × 128 × 128 107,880.5770 4.7823
160 × 160 × 160 194,591.8580 4.8708
192 × 192 × 192 568,707.0920 2.7928
224 × 224 × 224 881,243.1030 2.7828
256 × 256 × 256 890,487.4880 4.0537
Table 7.4: Results for three-dimensional FDTD simulation on CUDA (initial
run).
cells (4096 × 4096).
However, it is worthy to note that although there hasn’t been much op-
timizations done yet, it takes CUDA less than 15 minutes to complete a
simulation that takes an hour to complete on the CPU.
7.6 Visual Profiling
To improve on the speed-ups, the CUDA textual profiler was used. The
results were similar to those obtained in Chapter 6. The textual profiler
did not show any uncoalesced memory accesses and provided insufficient
information to determine where the bottleneck was.
Later on in the project, it was discovered that although Emory does not
have a GUI natively, the Visual Profiler could be executed executed using X11
forwarding. A secondary computer was used to connect to Emory through
an SSH client with X11 forwarding enabled. This allowed the CUDA Visual
Profiler to show on the secondary computer.
The Visual Profiler proved to be very useful and immediately the bottle-
neck for the three-dimensional FDTD method was found. The Visual Profiler
68
Page 82
7.7. MEMORY COALESCING FOR 3-D ARRAYS
showed that global memory load and global memory store efficiency was just
0.06 for most of the CUDA kernels. The profiling results for the six main
kernels are shown in Table 7.5.
On the other hand, when the Visual Profiler was executed for the two-
dimensional FDTD method developed in Chapter 6, the results showed global
memory load and global memory store efficiencies that were close to one.
Method gld efficiency gst efficiency
ex cuda d 0.0630682 0.0630686
ey cuda d 0.0613636 0.0613641
ez cuda d 0.0630682 0.0630686
hx cuda d 0.06 0.0600002
hy cuda d 0.065625 0.0656253
hz cuda d 0.061875 0.0618752
Table 7.5: Results from visual profiling on three-dimensional FDTD simula-
tion.
These results were unexpected because the arrays were allocated using
cudaMalloc3D() as recommended by the CUDA Programming Guide [6] and
the CUDA Best Practices Guide [13].
7.7 Memory Coalescing for 3-D Arrays
The CUDA documentations [6, 13, 20] were consulted but there were no
mention on how to index three-dimensional arrays in the kernel for coalesced
memory accesses. The NVIDIA forums was not helpful in this matter as
well.
However, by slowly analysing and debugging the CUDA kernel, a simple
69
Page 83
7.7. MEMORY COALESCING FOR 3-D ARRAYS
solution was found on how to index arrays in the kernel while maintaning
aligned accessed to global memory. The only limitation with this solution is
that the dimensions of threads in a block must be the same. Specifically, the
x- and y-dimensions of a block must be the same.
This solution was a huge milestone in the development of the three-
dimensional code and the speed-ups achieved are listed in Table 7.6. Figures
7.2 and 7.3 show the plots for the speed-up and throughput respectively.
Simulation Size CUDA Execution Time (ms) Speed-up
128 × 128 × 128 25,842.8480 19.9635
160 × 160 × 160 35,609.7410 26.6169
192 × 192 × 192 83,756.5080 18.9629
224 × 224 × 224 142,680.5420 17.1877
256 × 256 × 256 126,128.8680 28.6197
Table 7.6: Results for three-dimensional FDTD simulation on CUDA (with
coalesced memory access).
A throughput of over 130 Mcells/s and a speedup of over 28x was achieved
for a simulation size of 16 Mcells. While CUDA’s execution time for the
three-dimensional FDTD is significantly better than that of the CPU, the
throughput is still much slower compared to the two-dimensional FDTD
executed on CUDA. Again, the use of PMLs as absorbing regions can be
attributed to low throughput. Thus, alternative absorbing boundary condi-
tions are used in replacement of the PMLs for comparison. This is discussed
in the following section.
The Visual Profiler was executed on the improved kernels to check the
efficiency of the global memory accesses. The results are shown in Table 7.7.
Comparing the results of Table 7.5 and Table 7.7, it is obvious that there is
70
Page 84
7.7. MEMORY COALESCING FOR 3-D ARRAYS
2,097,152 4,096,000 7,077,888 11,239,424 16,777,2160
5
10
15
20
25
30
Speed−
up
Simulation Size (number of cells)
Figure 7.2: Speed-up for three-dimensional FDTD simulation.
71
Page 85
7.7. MEMORY COALESCING FOR 3-D ARRAYS
2,097,152 4,096,000 7,077,888 11,239,424 16,777,2160
20
40
60
80
100
120
140
Th
rou
gh
pu
t, M
ce
lls/s
Simulation Size (number of cells)
Figure 7.3: Throughput for three-dimensional FDTD simulation running on
CUDA.
72
Page 86
7.8. DISCREPANCY IN RESULTS
a huge improvement in with the memory accesses in the new kernels.
In Table 7.7, there are efficiencies that exceed 1. This could be due to
inaccuracies in the Visual Profiler program. The CUDA Compute Visual
Profiler User Guide confirms that the efficiency should be between 0 and 1.
Method gld efficiency gst efficiency
ex cuda d 0.81388 1.01576
ey cuda d 0.785047 0.984379
ez cuda d 0.75882 1.00782
hx cuda d 0.857143 0.666667
hy cuda d 0.847646 0.658067
hz cuda d 0.941704 0.681822
Table 7.7: Results from visual profiling on three-dimensional FDTD simula-
tion (with new indexing of arrays in kernel).
Also, the plots of Figure 7.2 and Figure 7.3 show an irregularity in perfor-
mance in terms of simulation size. In one- and two-dimensional FDTD simu-
lations, the trend is an increase in throughput and speedup with increase in
simulation size. However, this trend did not extend to the three-dimensional
FDTD simulations. The results generated from the Visual Profiler showed
no obvious reason for this occurrence. Thus, the cause of this still needs to
be investigated.
7.8 Discrepancy In Results
The similar tests for discrepancy between CPU and GPU as in the one-
dimensional FDTD simulation detailed in Section 5.4 are performed here.
The results are shown in Table 7.8 below.
73
Page 87
7.9. ALTERNATIVE ABSORBING BOUNDARIES
Simulation Size Difference Percentage change
128 × 128 × 128 2.980232 × 10−7 0.113527%
160 × 160 × 160 2.980232 × 10−7 0.018938%
192 × 192 × 192 2.980232 × 10−7 0.136416%
224 × 224 × 224 2.980232 × 10−7 0.079594%
256 × 256 × 256 2.980232 × 10−7 0.068311%
Table 7.8: Discrepancy in results between CPU and CUDA simulation of the
three-dimensional FDTD method.
While the CUDA framework does not fully conform to the IEEE 784-
2008 standard for floating-point computation (as discussed in Section 3.2),
the results prove that the discrepancy of results between CUDA and the CPU
are relatively small. The speed-ups and throughput that can be achieved by
CUDA are more than sufficient to disregard the small discrepancies. This is
proven in all one-, two- and three-dimensional space of the FDTD method.
Moreover, newer CUDA GPUs are designed to conform to the IEEE 784-2008
standard.
Figure 7.4 and Figure 7.5 show plots of the data generated from the
simulation of 160 × 160 × 160 cells. The plots show that the discrepancy
between the CPU and CUDA is neither significant nor noticeable.
7.9 Alternative Absorbing Boundaries
Berenger’s PMLs have been used as the absorbing region for the three-
dimensional FDTD simulation up until this point. Just like in the case of
the two-dimensional FDTD method, it was found that the PMLs were not
particularly well-suited for running in CUDA. Thus, alternative absorbing
74
Page 88
7.9. ALTERNATIVE ABSORBING BOUNDARIES
0 200 400 600 800 1000−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time−step
Magnitude
CPU
GPU (CUDA)
Figure 7.4: Plot of 3-D FDTD simulation results to compare accuracy be-
tween CPU and GPU. The plots are generated from the result from simula-
tion size 160 × 160 × 160 cells. The location of the probe is at cell (60, 60,
60) as specified in Table 7.1
75
Page 89
7.9. ALTERNATIVE ABSORBING BOUNDARIES
300 310 320 330 340 350−0.05
−0.045
−0.04
−0.035
−0.03
−0.025
−0.02
−0.015
−0.01
−0.005
0
Time−step
Magnitude
CPU
GPU (CUDA)
Figure 7.5: Plot of 3-D FDTD simulation results to compare accuracy be-
tween CPU and GPU. The plots are generated from the result from simula-
tion size 160 × 160 × 160 cells. The location of the probe is at cell (60, 60,
60) as specified in Table 7.1. The plot is centered between time-steps 300
and 350.
76
Page 90
7.9. ALTERNATIVE ABSORBING BOUNDARIES
boundary conditions were used to investigate how much more speed-up and
throughput can be achieved.
The three ABC’s used are the Mur’s first order, Mur’s second order and
Liao’s ABC. This is similar to what was done for the two-dimensional FDTD
simulation in Section 6.8. The equations for the ABC’s are shown below.
The equations assume that the boundary is at x = 0. Although only Ez is
considered, the equations have to be applied to Ex and Ey as well.
Mur’s 1st order ABC:
Ez
∣∣∣∣n+1
0,j,k
= Ez
∣∣∣∣n1,j,k
+c∆t− ∆z
c∆t+ ∆z
(Ez
∣∣∣∣n+1
1,j,k
− Ez
∣∣∣∣n0,j,k
) �� ��7.1
Mur’s 2st order ABC:
Ez
∣∣∣∣n+1
0,j,k
= Ez
∣∣∣∣n−1
1,j,k
+ EQ1 + EQ2 + EQ3 + EQ4
�� ��7.2
EQ1 =c∆t− ∆x
c∆t+ ∆x
(Ez
∣∣∣∣n+1
1,j,k
− Ez
∣∣∣∣n−1
0,j,k
) �� ��7.3
EQ2 =2∆x
c∆t+ ∆x
(Ez
∣∣∣∣n0,j,k
− Ez
∣∣∣∣n1,j,k
) �� ��7.4
EQ3 =∆x(c∆t)2
2(∆y)2(c∆t+ ∆x)(Ca + Cb)
�� ��7.5
EQ4 =∆x(c∆t)2
2(∆z)2(c∆t+ ∆x)(Cc + Cd)
�� ��7.6
Ca = Ez
∣∣∣∣n0,j+1,k
− 2Ez
∣∣∣∣n0,j,k
+ Ez
∣∣∣∣n0,j−1,k
�� ��7.7
Cb = Ez
∣∣∣∣n1,j+1,k
− 2Ez
∣∣∣∣n1,j,k
+ Ez
∣∣∣∣n1,j−1,k
�� ��7.8
Cc = Ez
∣∣∣∣n0,j,k+1
− 2Ez
∣∣∣∣n0,j,k
+ Ez
∣∣∣∣n0,j,k−1
�� ��7.9
Cd = Ez
∣∣∣∣n1,j,k+1
− 2Ez
∣∣∣∣n1,j,k
+ Ez
∣∣∣∣n1,j,k−1
�� ��7.10
77
Page 91
7.9. ALTERNATIVE ABSORBING BOUNDARIES
Liao’s ABC:
Ez
∣∣∣∣n+1
0,j,k
=N∑m=1
(−1)m+1CNmEz
∣∣∣∣n+1−(m−1)
−mc∆t,j,k
�� ��7.11
where N is the order of the boundary condition and CNm is the binomial
coefficient given by:
CNm =
N !
m!(N −m)!
�� ��7.12
The speed-up and throughput for the three-dimensional FDTD simulation
on CUDA are shown in Figure 7.6 and Figure 7.7 respectively.
It is apparent from the results that the irregularity in performance for
varying simulation sized noted with the PMLs still exist when other ABCs are
used. However, Mur’s ABCs performed better than the PMLs. It must also
be noted that Liao’s ABC were implemented using doubles while the other
ABCs were implemented using floats. From the results, it can be argued
that Liao’s ABC performed reasonably well. This could be attributed to the
simplicity of Liao’s ABC compared to the PML and the Mur’s ABCs.
All three types of ABCs produced results that were reasonable similar to
that of the PML. Also, the PML width used for simulations is only 4 cells
and should be taken into consideration when analysing the discrepancy. If
the width of the PML region were increased, there would be less reflection
at the boundaries but this would cause the execution time to increase.
78
Page 92
7.9. ALTERNATIVE ABSORBING BOUNDARIES
2,097,152 4,096,000 7,077,888 11,239,424 16,777,2160
5
10
15
20
25
30
35
40
45
50
55
Sp
ee
d−
up
Simulation Size (number of cells)
PML
1st Order Mur
2nd Order Mur
Liao
Figure 7.6: Comparison of speed-up for various ABCs in three-dimensional
FDTD simulation running on CUDA.
79
Page 93
7.9. ALTERNATIVE ABSORBING BOUNDARIES
2,097,152 4,096,000 7,077,888 11,239,424 16,777,2160
50
100
150
200
250
300
Thro
ughput, M
cells
/s
Simulation Size (number of cells)
PML
1st Order Mur
2nd Order Mur
Liao
Figure 7.7: Comparison of throughput for various ABCs in three-dimensional
FDTD simulation running on CUDA.
80
Page 94
7.9. ALTERNATIVE ABSORBING BOUNDARIES
0 200 400 600 800 1000−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time−step
Magnitude
PML
1st Order Mur
2nd Order Mur
Liao
Figure 7.8: Plot of 3-D FDTD simulation results to compare accuracy be-
tween various ABCs. The plots are generated from the result from simulation
size 160× 160× 160 cells. The location of the probe is at cell (60, 60, 60) as
specified in Table 7.1.
81
Page 95
7.9. ALTERNATIVE ABSORBING BOUNDARIES
400 450 500 550 600 650 700−0.02
−0.015
−0.01
−0.005
0
0.005
0.01
0.015
0.02
Time−step
Magnitude
PML
1st Order Mur
2nd Order Mur
Liao
Figure 7.9: Plot of 3-D FDTD simulation results to compare accuracy be-
tween various ABCs. The plots are generated from the result from simulation
size 160× 160× 160 cells. The location of the probe is at cell (60, 60, 60) as
specified in Table 7.1. The plot is centered between time-steps 400 and 700.
82
Page 96
7.10. CONCLUSIONS
7.10 Conclusions
Porting the FDTD method to execute on CUDA produced significant speed-
ups. However, it is significantly more difficult to obtain optimal performance
on CUDA compared to one- and two-dimensional FDTD simulations. One
reason for this is that CUDA does not support three-dimensional kernels for
a large number of threads.
Although the throughput achieved was less than that of the one- and
two-dimensional code, the speed-up achieved is encouraging. For example,
it takes the CPU more than an hour to complete the simulation for 256 ×
256× 256 cells but CUDA takes just over two minutes to complete the same
simulation.
Also, as was discovered with the two-dimensional FDTD simulations,
there are other ABCs that perform better on CUDA compared to the PMLs.
The PMLs are not as well-suited to leverage CUDA’s parallel architecture.
83
Page 97
8Conclusions
8.1 Thesis Conclusions
The thesis finds that the FDTD method is very well suited to run on par-
allel architectures such as CUDA. Throughputs of over 1,000 Mcell/s can
be achieved on CUDA when the CPU only manages 4 Mcells/s. However,
achieving the speed-ups can be difficult. It is in fact significantly more dif-
ficult to write code to run optimally on CUDA than on the CPU. Most of
the problems that were faced during development were related to memory-
accesses on CUDA. There are various documentations and tools that are
available to developers but occasionally, these resources are insufficient.
However, it is the author’s opinion that the difficulty in producing optimal
CUDA programs is not a strong deterrent from using CUDA to execute the
FDTD method. Taken into perspective the results that have been achieved
from the thesis, the use of CUDA is without doubt a good choice. It is
hard to imagine waiting an hour for the CPU to complete a simulation of
256× 256× 256 cells while it be done in just over two minutes using CUDA.
It should also be noted that not only is the computational capabilities of
84
Page 98
8.2. FUTURE WORK
GPUs continuing to rise at a faster pace compared to CPUs [8], the CUDA
framework is also consistently being improved. With these improvements, it
is almost certain that programming on CUDA will become easier and higher
speed-ups will be achieved in the near future.
As for the progress of the thesis, the simulations for three-dimensional
FDTD method took longer than expected due to the complexity involved
that was unforeseen. As the development progressed, it was found that
experimenting with the use of alternative ABCs in place of the PMLs would
be beneficial. Thus, Mur’s first order, Mur’s second order and Liao’s ABCs
were implemented as advanced work for the thesis.
8.2 Future Work
This thesis explored the capabilities and prospects of using CUDA as an
alternative to CPUs. Thus, the implementation of the CUDA program was
relatively straightforward. There is still a lot of work that can be done such
as investigating the use of shared memory and texture memory. While the
thesis focused on the higher-priority recommendations of the CUDA Best
Practices Guide [13], there are numerous other optimization that can be
explored to produced a more optimal CUDA program.
For the FDTD method on CUDA, the use of Mur’s and Liao’s ABCs can
definitely reduce simulation times compared to using PMLs. This is another
topic that can be studied in the future.
Although the focus of this thesis is on NVIDIA’s CUDA architecture,
there are other parallel architecture’s that in the market today. GPU man-
ufacturers including NVIDIA and AMD support the Open Computing Lan-
guage (OpenCL) which is a royalty-free standard for parallel computing
85
Page 99
8.2. FUTURE WORK
[21, 22]. Research into OpenCL and comparisons with CUDA can also be
done as future work.
86
Page 100
Bibliography
[1] Kane S. Yee. Numerical solution of initial boundary value problems
involving maxwell’s equations in isotropic media. Antennas and Propa-
gation, IEEE Transactions on, 14(3):302–307, 1966.
[2] K. L. Shlager and J. B. Schneider. A selective survey of the finite-
difference time-domain literature. Antennas and Propagation Magazine,
IEEE, 37(4):39–57, 1995.
[3] L.M. Angelone, S. Tulloch, G. Wiggins, S. Iwaki, N. Makris, and G. Bon-
massar. New high resolution head model for accurate electromagnetic
field computation. In ISMRM Thirteenth Scientific Meeting, page 881,
Miami, FL, USA, 2005.
[4] Wenhua Yu. Parallel finite-difference time-domain method. Artech
House electromagnetic analysis series. Artech House, Boston, MA, 2006.
2006045967 GBA639156 013443102 Wenhua Yu ... [et al.]. ill. ; 24 cm.
Includes bibliographical references and index.
[5] Dennis Michael Sullivan, IEEE Microwave Theory, and Techniques So-
ciety. Electromagnetic simulation using the FDTD method. IEEE Press
series on RF and microwave technology. IEEE Press, New York, 2000.
00038922 Dennis M. Sullivan. ill. ; 27 cm. ”IEEE Microwave Theory and
Techniques Society, sponsor.” Includes bibliographical references and in-
dex.
[6] NVIDIA. Cuda c programming guide version 3.1.1, 2010.
[7] NVIDIA. Fermi compute architecture white paper, 2009.
87
Page 101
BIBLIOGRAPHY
[8] Ong Cen Yen, M. Weldon, S. Quiring, L. Maxwell, M. Hughes, C. Whe-
lan, and M. Okoniewski. Speed it up. Microwave Magazine, IEEE,
11(2):70–78, 2010.
[9] NVIDIA. Tesla c2050/c2070 gpu computing processor at 1/10th the
cost, 2010.
[10] Karl E. Hillesland and Anselmo Lastra. Gpu floating-point paranoia,
2004.
[11] United States General Accounting Office. Patriot missile defense: Soft-
ware problem led to system failure at dhahran, saudi arabia, February
4, 1992 1992.
[12] Robert Sedgewick and Kevin Daniel Wayne. Floating point, 2010.
[13] NVIDIA. Cuda c best practices version 3.1, 2010.
[14] Paulius Micikevicius. 3d finite difference computation on gpus using
cuda, 2009.
[15] James F. Stack. Accelerating the finite difference time domain (fdtd)
method with cuda, 2010.
[16] Alvaro Valcarce and Jie Zhang. Implementing a 2d fdtd scheme with
cpml on a gpu using cuda, 2010.
[17] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Mor-
ton, E. Phillips, Zhang Yao, and V. Volkov. Parallel computing experi-
ences with cuda. Micro, IEEE, 28(4):13–27, 2008.
[18] NVIDIA. Compute visual profiler, 2010.
[19] Rob Farber. Cuda, supercomputing for the masses: Part 6, 2008.
88
Page 102
BIBLIOGRAPHY
[20] NVIDIA. Cuda reference manual 3.1, 2010.
[21] Khronos Group. Khronos launches heterogeneous computing initiative,
2008.
[22] Khronos Group. The khronos group releases opencl 1.0 specification,
2008.
[23] S. Adams, J. Payne, and R. Boppana. Finite difference time do-
main (fdtd) simulations using graphics processors. In DoD High Per-
formance Computing Modernization Program Users Group Conference,
2007, pages 334–338, 2007.
[24] G. Cummins, R. Adams, and T. Newell. Scientific computation through
a gpu. In Southeastcon, 2008. IEEE, pages 244–246, 2008.
[25] Atef Z. Elsherbeni and Veysel Demir. The finite-difference time-domain
method for electromagnetics with MATLAB simulations. SciTech Pub.,
Raleigh, NC :, 2009.
[26] C. D. Moss, F. L. Teixeira, and Kong Jin Au. Analysis and compensa-
tion of numerical dispersion in the fdtd method for layered, anisotropic
media. Antennas and Propagation, IEEE Transactions on, 50(9):1174–
1184, 2002.
[27] NVIDIA. Cuda architecture introduction & overview, 2009.
[28] Robert Sedgewick and Kevin Daniel Wayne. Introduction to program-
ming in Java : an interdisciplinary approach. Pearson Addison-Wesley,
Boston, 2008. 2007020235 Robert Sedgewick and Kevin Wayne. ill. ; 24
cm. Includes index.
89
Page 103
BIBLIOGRAPHY
[29] Allen Taflove. Computational electrodynamics : the finite-difference
time-domain method. Artech House, Boston, 1995. 95015008 Allen
Taflove. ill. ; 24 cm. Includes bibliographical references and indexes.
[30] F. Zheng and Z. Chen. Numerical dispersion analysis of the uncondi-
tionally stable 3-d adi-fdtd method. Microwave Theory and Techniques,
IEEE Transactions on, 49(5):1006–1009, 2001.
90
Page 104
AEmory Specifications
These specifications are obtained by executing the ‘deviceQuery’ program
bundled together with the NVIDIA GPU Computing SDK.
CUDA Device Query (Runtime API) version (CUDART static linking)
There are 2 devices supporting CUDA
Device 0: "Tesla T10 Processor"
CUDA Driver Version: 3.10
CUDA Runtime Version: 3.10
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294770688
bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
91
Page 105
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535
x 1
Maximum memory pitch: 2147483647
bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default
(multiple host threads can use this device simultaneously)
Concurrent kernel execution: No
Device has ECC support enabled: No
Device 1: "Tesla T10 Processor"
CUDA Driver Version: 3.10
CUDA Runtime Version: 3.10
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294770688
bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
92
Page 106
Maximum sizes of each dimension of a grid: 65535 x 65535
x 1
Maximum memory pitch: 2147483647
bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default
(multiple host threads can use this device simultaneously)
Concurrent kernel execution: No
Device has ECC support enabled: No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.10,
CUDA Runtime Version = 3.10, NumDevs = 2, Device = Tesla T10
Processor, Device = Tesla T10 Processor
93
Page 107
BUncoalesed Memory Access Test
1 /*
2 * Test cudaMallocPitch() usage.
3 *
4 * ./binary <size_x> <size_y>
5 *
6 */
7
8 #include <stdlib.h>
9 #include <stdio.h>
10 #include <cuda.h>
11 #include <math.h>
12
13 #define DIM_SIZE 16
14
15 #define NUM_REPS 10
16
17
18 __global__ void copy(float *odata, float* idata, int pitch,
int size_x, int size_y) {
19 unsigned int yid = blockIdx.x * blockDim.x + threadIdx.x;
94
Page 108
20 unsigned int xid = blockIdx.y * blockDim.y + threadIdx.y;
21
22 if ( xid < size_y && yid < size_x ) {
23 int index = ( yid * pitch / sizeof(float) ) + xid;
24 odata[index] = idata[index] * 2.0;
25 }
26 }
27
28 // http://www.drdobbs.com/high-performance-computing/207603131
29 void checkCUDAError(const char *msg)
30 {
31 cudaError_t err = cudaGetLastError();
32 if( err != cudaSuccess )
33 {
34 fprintf(stderr, "Cuda error: %s: %s.\n", msg,
35 cudaGetErrorString( err) );
36 exit(EXIT_FAILURE);
37 }
38 }
39
40 int main (int argc, char** argv) {
41 int size_x = 64;
42 int size_y = 64;
43
44 if ( argc > 2 ) {
45 sscanf(argv[1], "%d", &size_x);
46 sscanf(argv[2], "%d", &size_y);
47 }
48
49 printf("size_x: %d\n", size_x);
50 printf("size_y: %d\n", size_y);
51
52 // execution configuration parameters
95
Page 109
53 dim3 grid(ceil((float)size_x/DIM_SIZE), ceil((float)size_y
/DIM_SIZE), 1), threads(DIM_SIZE, DIM_SIZE, 1);
54
55 // size of memory required to store the matrix
56 const int mem_size = sizeof(float) * size_x * size_y;
57
58 // allocate host memory
59 float *h_idata = (float*) malloc(mem_size);
60 float *h_odata = (float*) malloc(mem_size);
61
62 // allocate device memory
63 float *d_idata, *d_odata;
64 size_t pitch;
65 cudaMallocPitch((void**)&d_idata, &pitch, size_y * sizeof(
float), size_x); checkCUDAError("cudaMallocPitch");
66 cudaMallocPitch((void**)&d_odata, &pitch, size_y * sizeof(
float), size_x); checkCUDAError("cudaMallocPitch");
67
68 printf("dpitch: %d\n", pitch);
69
70 // initalize host data
71 for ( int i = 0; i < size_x*size_y; i++ ) {
72 h_idata[i] = (float)i;
73 h_odata[i] = 0.0f;
74 }
75
76 // copy host data to device
77 cudaMemcpy2D(d_idata, pitch, h_idata, size_y * sizeof(
float), size_y * sizeof(float), size_x,
cudaMemcpyHostToDevice);
78 checkCUDAError("cudaMemcpy2D H to D");
79
80 for ( int i=0; i < NUM_REPS; i++) {
96
Page 110
81 copy<<<grid, threads>>>(d_odata, d_idata, pitch, size_x,
size_y);
82 }
83
84 // copy device data to host
85 cudaMemcpy2D(h_odata, size_y * sizeof(float), d_odata,
pitch, size_y * sizeof(float), size_x,
cudaMemcpyDeviceToHost);
86 checkCUDAError("cudaMemcpy2D D to H");
87
88 for ( int k = 0; k < size_x * size_y; k++ ) {
89 if ( h_odata[k] != h_idata[k] * 2.0 ) {
90 printf("Mismatch!\n");
91 printf("h_idata[%d] = %f\n", k, h_idata[k]);
92 printf("h_odata[%d] = %f\n", k, h_odata[k]);
93
94 printf("---result---\n");
95 for ( int i = 0; i < size_x; i++ ) {
96 for ( int j = 0; j < size_y; j++ ) {
97 printf("%d ", (int)h_odata[i*size_y+j]);
98 }
99 printf("\n");
100 }
101 break;
102 }
103 }
104
105 free(h_idata);
106 free(h_odata);
107
108 cudaFree(d_idata);
109 cudaFree(d_odata);
110
97
Page 111
111 printf("Completed.\n");
112 return 0;
113 }
Listing B.1: Source code for testing uncoalesced memory access.
98
Page 112
#C
UD
AP
RO
FIL
EL
OG
VE
RSIO
N2.
0
#C
UD
AD
EV
ICE
0T
esla
T10
Pro
cess
or
#C
UD
AP
RO
FIL
EC
SV
1
met
hod
gputi
me
cputi
me
gld
coher
ent
gld
inco
her
ent
gst
coher
ent
gst
inco
her
ent
mem
cpyH
toD
3000
.768
3602
Z4c
opyP
fSiii
7579
.68
7635
1049
600
2099
200
Z4c
opyP
fSiii
7488
.16
7508
1049
600
2099
200
Z4c
opyP
fSiii
7576
.928
7597
1047
040
2094
080
Z4c
opyP
fSiii
7628
.576
7647
1049
600
2099
200
Z4c
opyP
fSiii
7511
.275
3010
4704
020
9408
0
Z4c
opyP
fSiii
7417
.344
7436
1049
600
2099
200
Z4c
opyP
fSiii
7609
.152
7629
1049
600
2099
200
Z4c
opyP
fSiii
7517
.76
7536
1047
040
2094
080
Z4c
opyP
fSiii
7514
.816
7530
1049
600
2099
200
Z4c
opyP
fSiii
7563
.744
7580
1047
040
2094
080
mem
cpyD
toH
2666
.496
3253
Tab
leB
.1:
CU
DA
Tex
tual
Pro
file
rre
sult
sfr
omte
stin
gunco
ales
ced
mem
ory
acce
ss.
99
Page 113
CCoalesed Memory Access Test
1 /*
2 * Test cudaMallocPitch() usage.
3 *
4 * ./binary <size_x> <size_y>
5 *
6 */
7
8 #include <stdlib.h>
9 #include <stdio.h>
10 #include <cuda.h>
11 #include <math.h>
12
13 #define DIM_SIZE 16
14
15 #define NUM_REPS 10
16
17
18 __global__ void copy(float *odata, float* idata, int pitch,
int size_x, int size_y) {
19 unsigned int xid = blockIdx.x * blockDim.x + threadIdx.x;
100
Page 114
20 unsigned int yid = blockIdx.y * blockDim.y + threadIdx.y;
21
22 if ( xid < size_y && yid < size_x ) {
23 int index = ( yid * pitch / sizeof(float) ) + xid;
24 odata[index] = idata[index] * 2.0;
25 }
26 }
27
28 // http://www.drdobbs.com/high-performance-computing/207603131
29 void checkCUDAError(const char *msg)
30 {
31 cudaError_t err = cudaGetLastError();
32 if( err != cudaSuccess )
33 {
34 fprintf(stderr, "Cuda error: %s: %s.\n", msg,
35 cudaGetErrorString( err) );
36 exit(EXIT_FAILURE);
37 }
38 }
39
40 int main (int argc, char** argv) {
41 int size_x = 64;
42 int size_y = 64;
43
44 if ( argc > 2 ) {
45 sscanf(argv[1], "%d", &size_x);
46 sscanf(argv[2], "%d", &size_y);
47 }
48
49 printf("size_x: %d\n", size_x);
50 printf("size_y: %d\n", size_y);
51
52 // execution configuration parameters
101
Page 115
53 dim3 grid(ceil((float)size_y/DIM_SIZE), ceil((float)size_x
/DIM_SIZE), 1), threads(DIM_SIZE, DIM_SIZE, 1);
54
55 // size of memory required to store the matrix
56 const int mem_size = sizeof(float) * size_x * size_y;
57
58 // allocate host memory
59 float *h_idata = (float*) malloc(mem_size);
60 float *h_odata = (float*) malloc(mem_size);
61
62 // allocate device memory
63 float *d_idata, *d_odata;
64 size_t pitch;
65 cudaMallocPitch((void**)&d_idata, &pitch, size_y * sizeof(
float), size_x); checkCUDAError("cudaMallocPitch");
66 cudaMallocPitch((void**)&d_odata, &pitch, size_y * sizeof(
float), size_x); checkCUDAError("cudaMallocPitch");
67
68 printf("dpitch: %d\n", pitch);
69
70 // initalize host data
71 for ( int i = 0; i < size_x*size_y; i++ ) {
72 h_idata[i] = (float)i;
73 h_odata[i] = 0.0f;
74 }
75
76 // copy host data to device
77 cudaMemcpy2D(d_idata, pitch, h_idata, size_y * sizeof(
float), size_y * sizeof(float), size_x,
cudaMemcpyHostToDevice);
78 checkCUDAError("cudaMemcpy2D H to D");
79
80 for ( int i=0; i < NUM_REPS; i++) {
102
Page 116
81 copy<<<grid, threads>>>(d_odata, d_idata, pitch, size_x,
size_y);
82 }
83
84 // copy device data to host
85 cudaMemcpy2D(h_odata, size_y * sizeof(float), d_odata,
pitch, size_y * sizeof(float), size_x,
cudaMemcpyDeviceToHost);
86 checkCUDAError("cudaMemcpy2D D to H");
87
88 for ( int k = 0; k < size_x * size_y; k++ ) {
89 if ( h_odata[k] != h_idata[k] * 2.0 ) {
90 printf("Mismatch!\n");
91 printf("h_idata[%d] = %f\n", k, h_idata[k]);
92 printf("h_odata[%d] = %f\n", k, h_odata[k]);
93
94 printf("---result---\n");
95 for ( int i = 0; i < size_x; i++ ) {
96 for ( int j = 0; j < size_y; j++ ) {
97 printf("%d ", (int)h_odata[i*size_y+j]);
98 }
99 printf("\n");
100 }
101 break;
102 }
103 }
104
105 free(h_idata);
106 free(h_odata);
107
108 cudaFree(d_idata);
109 cudaFree(d_odata);
110
103
Page 117
111 printf("Completed.\n");
112 return 0;
113 }
Listing C.1: Source code for testing coalesced memory access.
104
Page 118
#C
UD
AP
RO
FIL
EL
OG
VE
RSIO
N2.
0
#C
UD
AD
EV
ICE
0T
esla
T10
Pro
cess
or
#C
UD
AP
RO
FIL
EC
SV
1
met
hod
gputi
me
cputi
me
gld
coher
ent
gld
inco
her
ent
gst
coher
ent
gst
inco
her
ent
mem
cpyH
toD
3002
.436
15s
Z4c
opyP
fSiii
161.
792
201
6560
026
240
0
Z4c
opyP
fSiii
161.
9219
065
600
2624
00
Z4c
opyP
fSiii
159.
456
178
6544
026
176
0
Z4c
opyP
fSiii
161.
472
181
6560
026
240
0
Z4c
opyP
fSiii
162.
272
181
6544
026
176
0
Z4c
opyP
fSiii
161.
152
177
6560
026
240
0
Z4c
opyP
fSiii
160.
224
179
6560
026
240
0
Z4c
opyP
fSiii
160.
576
176
6544
026
176
0
Z4c
opyP
fSiii
160.
3217
965
600
2624
00
Z4c
opyP
fSiii
162.
112
190
6544
026
176
0
mem
cpyD
toH
2678
.368
3280
Tab
leC
.1:
CU
DA
Tex
tual
Pro
file
rre
sult
sfr
omte
stin
gco
ales
ced
mem
ory
acce
ss.
105