Final Year Research Thesis

Evaluating magnetic susceptibility inHeisenberg chains using OpenCLimplementations of Monte Carlo

methods

Lee O’ Riordan

School of Science

Waterford Institute of Technology

Supervisors: Dr Kieran MurphyDr P.J. Cregg

Submitted in partial fullment of the requirementsfor the degree of BSc (Hons) in Physics with Computing

at Waterford Institute of Technology.

April 2010

I hereby certify that this material, which I now submit for assessment onthe programme of study leading to the award of BSc (Hons) in Physicswith Computing is entirely my own work and has not been taken fromthe work of others save and to the extent that such work has been citedand acknowledged within the text of my work.

Signed: ID No: 20022710

Date: 26 April, 2010

Acknowledgements

To the following people, whom of which have helped me to complete this body of workI give my sincerest gratitude. I wish to thank my supervisor, Dr Kieran Murphy, forgiving me the opportunity to work on such an interesting problem, and also for all of theassistance he has provided me along the route.

To all of you who have helped me stay sane over the past 12 weeks: James and Alan,thanks for helping me out wherever possible in the midst of the insanity — I really ap-preciate it. Peach, for always trying to get me to stop doing work and have fun — yourdedication to it deserves a mention. Conor, for being in an equally challenging positionand still giving it your all — great work man! John, for helping me experience the otherside of the planet, it gave me a lot of perspective on life! Dan, for always making thedays fun, and for all the madness and drama too! Brid, for being there with me fromthe beginning, and never having a harsh word to say. Eugene, for his fantastic anti-stressregimes. Jamie, for helping me maintain my sanity during the journeys. Last, but notleast, Emma - for always helping me to stay smiling, despite what lay ahead :)

A special thank you to John Petrucci, Paul Gilbert, Nobuo Uematsu, Steve Vai, JoeSatriani, Evergrey, Kamelot, Allen-Lande, and the various others who comprise the sound-track to my studies — be it listening, or playing along! Without the noise to drive methe work ceases to flow.

Finally, I wish to thank my family — their collective hard work, support, patience,and unlimited understanding has always helped me to do what needs to be done.

(P.S. An extra special thank you to Libby and Cleo for always welcoming me homeevery night. Thanks girls!)

I dedicate this paper to my parents, without whose support none of this would be possible.

Abstract

The use of general purpose GPU (GPGPU) computing is emerging as a new fieldof study for use in scientific applications. Parallel processing may be implemented onGPU devices to utilise significant computational ability, such as the evaluation of com-plex mathematical functions. Herein we attempt to evaluate a 2N -dimensional classicalHeisenberg spin-chain partition function integral, through the use of Monte Carlo integra-tion methods via a GPU device. With the goal being a significant speedup, the executiontiming of the routines are compared to conventional evaluation using serial computingroutines, and the accuracy of results compared to known special analytical cases of thepartition function.

Contents

1 Introduction 1

2 Literature Review 3

2.1 Platform & Architecture: OpenCL . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Mathematical Methods: Monte-Carlo . . . . . . . . . . . . . . . . . . . . . 5

2.3 Physical Application: Magnetic Susceptibility . . . . . . . . . . . . . . . . 7

3 Background Theory 10

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Magnetic Susceptibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1 Overview and Definitions . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1.1 Paramagnetism . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1.2 Susceptibility . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2.2 Classical Heisenberg Model . . . . . . . . . . . . . . . . . . . . . . 13

3.2.3 Calculation of the Susceptibility . . . . . . . . . . . . . . . . . . . . 15

3.3 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


3.3.2 Comparison to other methods . . . . . . . . . . . . . . . . . . . . . 17

3.3.3 Implementation Considerations . . . . . . . . . . . . . . . . . . . . 19

3.3.3.1 Random Number Generation . . . . . . . . . . . . . . . . 19

3.3.3.2 Parallel Versus Serial Computation . . . . . . . . . . . . . 20

3.3.3.3 Machine Epsilon Consideration . . . . . . . . . . . . . . . 21

3.4 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23


3.4.1.1 Execution Model . . . . . . . . . . . . . . . . . . . . . . . 24

3.4.1.2 Hardware Considerations . . . . . . . . . . . . . . . . . . 26

3.4.1.3 Comparison to NVidia CUDA and ATI Stream . . . . . . 27

3.4.1.4 Overall Program Structure . . . . . . . . . . . . . . . . . . 28

4 Implementation 30

4.1 Platform Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1.1 System Hardware and Software . . . . . . . . . . . . . . . . . . . . 30

4.1.2 OpenCL System Information . . . . . . . . . . . . . . . . . . . . . . 31

i

CONTENTS CONTENTS

4.1.3 Program Structure and Design . . . . . . . . . . . . . . . . . . . . . 33

4.2 Evolution of OpenCL Implementations . . . . . . . . . . . . . . . . . . . . 36

4.2.1 Version I — Uniform Distribution . . . . . . . . . . . . . . . . . . . 36

4.2.2 Version II — Multidimensional Integration of a known function . . 37

4.2.3 Version III — Integral Evaluation on Compute Device . . . . . . . 37

4.2.4 Version IV — Integral of Partition Function . . . . . . . . . . . . . 39

4.2.4.1 Subversion 1 — Single Memory Block . . . . . . . . . . . 39

4.2.4.2 Subversion 2 — Random Value Recycling . . . . . . . . . 41

4.2.4.3 Version 3 — Dual Global Arrays . . . . . . . . . . . . . . 42

5 Results 44

5.1 System Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.1.1 Version I — Uniformity Testing . . . . . . . . . . . . . . . . . . . . 44

5.1.2 Version II — Test of Monte Carlo Integration Method . . . . . . . . 44

5.1.3 Known Integral in 1-Dimension: CPU vs GPU . . . . . . . . . . . . 45

5.1.4 Integration of partition function . . . . . . . . . . . . . . . . . . . . 47

6 Conclusions 52

A Code Listing A-1

A.1 System Information Output . . . . . . . . . . . . . . . . . . . . . . . . . . A-1

A.2 Host Machine Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3

A.3 OpenCL Device Kernels and Functions . . . . . . . . . . . . . . . . . . . . A-27

ii

List of Figures

3.1 OpenCL Program Index Space [32] . . . . . . . . . . . . . . . . . . . . . . 25

3.2 OpenCL Memory Model from ATI OpenCL Intro (source: [20]) . . . . . . . 26

4.1 Statechart of proposed system implementation . . . . . . . . . . . . . . . . 34

4.2 Sequence diagram of proposed system implementation . . . . . . . . . . . . 35

5.1 Single dimensional integral for testing of the Monte Carlo routine compar-

ing CPU and GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2 Partition function, Z2−spin, with zero magnetic field (ξ = 0) and exchange

parameter 0 < K ≤ 50, calculated using Monte Carlo sampling (224 sam-

ples) and analytic formula [10, equ 4.13]. (Insert shows relative error of

Monte Carlo estimates.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.3 Actual Error (compared with analytical result) of the Monte Carlo method

as a function of the number of sample points. . . . . . . . . . . . . . . . . 49

5.4 Workgroup size versus total kernel execution time GPU . . . . . . . . . . . 50

5.5 Workgroup size versus total kernel execution time CPU . . . . . . . . . . . 51

iii

Notation

Physical Notation

NR Number of generated random values

NS Number of sample points of function. NS = NR

2∗n

n Spin chain length

kB Boltzmann constant. kB = 1.3806503× 10−23J/K

µ0 Vacuum permeability. mu0 = 4π × 10−7N/A2

µB Bohr magneton. muB = 9.27400915× 10−24J/T

H Applied magnetic field.

T Temporarily used for Temperature during Chapter 2. Time elsewhere.

g Lande g-Factor. g ≈ 2

s Spin quantum number.

i, j, k Loop index variables for summation and products.

Jk,k+1 Exchange constant between neighbouring spins k, k + 1.

JCk,k+1Exchange parameter between neighbouring spins k, k + 1. JCk,k+1

=

Jk,k+1s(s+ 1).

i, k, l Loop index variables for summation and products.

m Classical magnetic moment. m = gµB√s(s+ 1).

ξ Field parameter. ξ = µ0mHkBT

Computational Notation

Workitem OpenCL unit of work. Analogous to a thread.

Workgroup A collection of OpenCL workitems. Subdivision of global space.

iv

NOTATION NOTATION

Kernel Program code which runs on OpenCL enabled devices. Parallel opera-

tions take place here.

Host Computer system managing and controlling the OpenCL device. Per-

forms all essential setup and data transfer activities

Global ID Index of the global problem space for a particular element. Maximum

size defined by NDRange code identifier. Analogous to an array index.

Local ID Index of an element within a specific workgroup.

localSize Code identifier to number of workitems per workgroup. Also known as

groupSize.

Bus Identifier for the width of the device communication bus.

Bandwidth Maximum data transfer per unit time.

FLOPS Floating point operations per second. A measure of computational per-

formance.

All computational notation derived from program source code in this document shall

be highlighted blue to distinguish from mathematical symbols and constants where nec-

essary.

v

Chapter 1

Introduction

Recent developments in graphics processing unit (GPU) technologies, such as the large

scale increases in processing power, has led to the use of GPU devices in applications

other than their native image processing. This field of study has become widely known

as general purpose GPU (GPGPU) computing, encompassing techniques such as the

acceleration of computationally expensive calculations, which in other means would take

significant time to complete. Compared to the likes of a conventional central processing

unit (CPU), of which modern devices contain multiple cores on a single semiconductor

die with upto four cores being commonplace, GPU’s may contain several multiples of this

number of cores. Treating each core as an individual computing unit, GPU’s may be

treated inherently as parallel computing devices, and such require careful consideration

to the implementation of programs to unleash the full processing abilities of the device.

Many various programming interfaces exist through which access to the GPU for general

purpose applications can be achieved. Example of such interfaces are NVidia’s CUDA

technology [50], or the Khronos Group’s Open Computing Language (OpenCL).

In the work of Cregg et al [9] the partition function, used to calculate the magnetization

and susceptibility of molecular magnets in chains of N molecules, is reduced from a 2N–

dimensional integral into sums of known functions. This reduced form is significantly

easier to evaluate in comparison to using the multi-dimensional integral. However, there

are situations in which the original form of the partition function must be used and we

are faced with evaluating a 2N–dimensional integral. The Monte Carlo family of methods

are typically used to evaluate high-dimensional integrals and of interest here is an efficient

implementation of a Monte Carlo method on a GPU using the OpenCL standard. The

following considerations and interests are undertaken herein:

1. While Monte Carlo methods are trivial to parallelise, naive parallel implementa-

tions tend to under perform due to correlations between the random streams in the

1

Introduction

separate thread. What modifications are needed to Monto Carlo methods in order

to fully utilise the computational capabilities of a GPU when using OpenCL?

2. The comparison of CPU and GPU-based Monte Carlo implementations of the re-

sulting 2N -dimensional integral of the partition function in terms of calculation

times and resulting numerical accuracies.

3. Examine whether the use of the multi-dimensional formulation aids the recovery of

the exchange parameters when fitting to simulated susceptibility data. In the work

of Cregg et al [10] it was shown that, with typical noise levels, it was possible to

recover the exchange parameter values using non-linear curve fitting for chains of

up to 5 molecules.

All investigations carried out herein are for the purpose of acquiring information per-

taining to the above. Of the above, the structure of the three main areas to be explored

will be that of the partition function, and the means through which a solution of the mag-

netic susceptibility may be calculated; the use of Monte Carlo methods for integration

and calculation of results from the partition function and its subsequent derivatives; the

use of OpenCL as a means of achieving a fast and efficient Monte Carlo implementation

on GPU hardware.

2

Chapter 2

Literature Review

The three main areas of interest of my project are the platform and architecture on

which the calculations may be determined, the mathematical methods for performing

such calculations, and the physical application of the calculations. As such, the following

three areas are explored:

2.1 Platform & Architecture: OpenCL

As OpenCL is a newly developing specification documentation and working examples

can become somewhat difficult to source. The most current material on the subject of

OpenCL can be derived from the OpenCL specification, supplied by the Khronos group,

the consortium behind the development of the OpenCL specification [32], with equally

relevant documentation supplied by AMD, Apple & NVidia. Although commissioned

by Apple, the OpenCL standard was developed by the Khronos group. Resulting from

this, the Khronos group provide the most indepth base of information regarding the

OpenCL standard and all implementations thereof. As stated in the Khronos specification,

OpenCL offers a means of achieving parallelization across independent platforms such as

CPU’s, GPU’s, Cell-based broadband engines and DSP devices [32]. The heterogeneous

architecture of the OpenCL standard is described in detail, dealing with the structure of

an OpenCL program with regard to the Platform, Execution, Memory, and Programming

models, amongst all necessary actions and underlying routines [32]. The specification

provides the most indepth source with regard to all necessary programming paradigms

and organisation of an OpenCL application.

Much of the information supplied with the Khronos specification is regularly under-

going revision, and hence updates may occur on a regular basis until the standard has

achieved a certain level of maturity. Although this should not have any direct implications

on developers, the implementations of the main vendors may undergo similar updates over

3

2.1 Platform & Architecture: OpenCL Literature Review

time. Currently, the AMD/ATI Stream Implementation has reached the final production

release 1.0 [17]. Changes from the Beta version exist, wherein all documentation featuring

the Beta v4 release is now obsolete due to changes in the implementation. The newly

supplied documentation covers the changes to the implementation, although there exists

very few drastic alterations. To accompany the new release, AMD have released a series of

tutorials based on the OpenCL standard. In this set of examples provided are simple rou-

tines for a ”Hello World” based application [15], an image convolution example comparing

the implementation to that of a serially executed C program [5], as well as an N-Body

particle simulation demonstration [66]. Included with the AMD/ATI Stream SDK are a

set of sample OpenCL programs, demonstrating the nature of the standard on the vendor-

specific hardware, although any SSE2 enabled x86 processor should be capable of running

the compiled routines. Documentation provided by AMD on the OpenCL standard mainly

discusses the necessary setup conditions for compiling and running OpenCL applications.

Currently there remain unresolved issues with the AMD/ATI OpenCL implementation,

which are listed in the Developer’s Release Notes documentation [17]. Although several

issues exist, they remain very specific issues, and hence can be avoid with careful plan-

ning. Documentation is also provided for performance and optimization of applications,

demonstrating potential means of achieving the most efficient use of memory bandwidth,

programming paradigms, and dealing with N-dimensional work-groups [13].

The NVidia based OpenCL implementation effectively deals with OpenCL in com-

parison to writing an application for its proprietary CUDA platform. All OpenCL based

samples and methods tend to be geared solely towards developers with previous experience

on the CUDA platform, wherein most material covered discusses the porting of applica-

tions, and the differences in programming models between the two API’s [51]. Provided

by NVidia is a ”Best Practices Guide” for the OpenCL standard, showing regions where

bottlenecks may occur in an OpenCL application, and hence means for optimization on

the hardware [49]. Formulae for the calculation of theoretical maximum data throughput

and and potential speed-gains are provided, wherein depending upon the set of values an

application may dynamically organise the most optimal means of carrying out a function.

Memory management and comparisons of various implementations of a matrix multipli-

cation method are given also [53], showing optimisation of the routine for the greatest

4

2.2 Mathematical Methods: Monte-Carlo Literature Review

speed-gains. The mapping of OpenCL work units to areas of memory are given, with

reference to the CUDA-based thread ideology [52]. As the hardware implementing the

OpenCL application will be AMD-based, many aspects of the NVidia documentation are

not applicable, however many of the overviews and the kernel methods may be employed

cross platform.

Documentation provided by Apple provides a clear and concise introduction to the

use of OpenCL on the Mac OS X platform [1]. The materials provided by the Apple

Develope Connection break-down and discuss the required elements for the design and

implementation of an OpenCL application. Providing a subset of the material covered

in the Khronos specification [32], the ADC document discusses qualitatively the use of

each element, with examples cited for demonstration purposes. As an introductory text

to OpenCL this seems to offer the most reader-friendly material to gain an understanding

of the standard. Building upon the material covered for OpenCL on the Mac platform is

the work carried out by Gohara at MacResearch.org [27]. During the series of podcasts

OpenCL is explained from its fundamental roots, moving towards development of applica-

tions and dealing with memory issues. The theory of operation of the OpenCL standard

is provided with examples where necessary, showing the process of parallelizing routines

and the development of optimized kernels for efficient calculations [27, episodes 2–4]. A

demonstration of the OpenCL standard versus a serially executed C program is provided,

showing speed gains of upto two orders of magnitude in the difference [27, episode 3].

Again, as the majority of the system code is written specific to the Apple Mac platform,

many aspects will not be portable cross vendors without some form of modification. The

compute kernel however, may be used across all vendor systems without modification [32].

2.2 Mathematical Methods: Monte-Carlo

The Monte Carlo family of methods revolve around the use of random sampling of values

to determine the results to a specific problem. The most widely known use of such meth-

ods in the fields of science and engineering are for use in statistical methods, especially

the computation of integrals [25, 59]. Many variants exist of Monte Carlo methods, each

with varying degrees of convergence, and often with very specific purposes. Simple, or “di-

5

2.2 Mathematical Methods: Monte-Carlo Literature Review

rect”, Monte Carlo methods such as the “acceptor-rejecter” [11] and “Sample-mean” [63]

offer a relatively straightforward implementation, compared to that of the more compli-

cated Markov-chain (MCMC) based methods [11, 22]. According to all sources consulted

on the use of Monte Carlo methods, the main advantage offered is when dealing with

multi-dimensional integrals, which may take a significant time to compute via standard

numerical integration methods [22, 11]. In the reviewed material given by the Compu-

tational Physics Group at Israel Institute of Technology [54], a theoretically determined

value for the computation of a particular partition function in 3-dimensional space using a

standard numerical integration routine, the computer system would require 1053 seconds,

a number significantly greater than the current estimated age of the universe. Through

the use of Monte Carlo sampling methods, the group demonstrate how this figure may

be reduced drastically, wherein the expected value may be reached in a much shorter

amount of time. As Monte Carlo methods deal with random sampling to achieve results,

a degree of variance may exist between the end result, and that of the true result [6].

The degree of variance of the obtained result with respect to the actual result can, and is

shown to be, generally determined relative to the number of samples taken [11, 54, 22].

The convergence of such routines depends largely on the chosen method implemented,

and the number of samples taken, with a greater number of samples providing a greater

degree of accuracy to the overall result [11].

The lecture series on Monte Carlo simulations given at the University of Helsinki given

by Djurabekova [22] discuss in great detail the various aspects of Monte Carlo methods

applied in the fields of thermodynamics and statistical physics, with reference to the

more complicated Monte Carlo implementations. The use of Markov Chain Monte Carlo

methods, such as that of the Metropolis-Hastings method can achieve results almost in-

distinguishable from the exact result in a significantly less number of samples than that of

an independent and identically-distributed random variable based routine [22, ch5]. Give

the proven nature of the Monte Carlo classification of methods, the series also included

the determination of thermodynamic properties based on Monte Carlo methods [22, ch9].

Discussions are given for the use of Monte Carlo methods in determining the partition

function of a thermodynamic system in 3-dimensional space. The use of the Metropolis-

Hastings algorithm is applied to the Ising model for the study of ferromagnetic properties

6

2.3 Physical Application: Magnetic Susceptibility Literature Review

of a material, and discussions of the determination of various properties therein. The use

of Markov-chain Monte Carlo methods is further discussed by Press et al (2007) [59], with

C implementations given for an MCMC algorithm, offering a worked through example in

the use of such methods, and the requirements to effectively implement an accurate and

efficient routine.

Due to the nature of Monte Carlo methods, they may inherently be implemented

in a parallel means. Various forms of parallelizing Monte Carlo schemes over a set of

heterogeneous systems are discussed by Rosenthal [61], with a critical viewpoint given

on each implementation. The use of i.i.d random variable based versus that of Markov-

chain Monte Carlo methods are discussed, including potential setbacks and performance

issues that may arise therein. A signicant amount of material exists on the parallelization

of the random number generator to run on multi-core and multi-architecture systems.

The use of generators such as Mersenne-Twister seem to be a popular choice for Monte

Carlo based integration methods [58, 65]. The Mersenne Twister PRNG provides an

extremely large period generation, with implementations existing for serial generation,

SIMD optimised, and parallel graphics-processor optimised generators [41]. According to

Matsumoto et al , the Mersenne Twister pseudorandom number generator was developed

to provide a method for large scale and large period based number generation to be

utilised in statistical physics applications [42]. The developers cite the main area of

application for this generator being for use in Monte Carlo based methods, given the

nature of its speed and period versus other generators [42]. The use of the Mersenne

Twister based approach does limit the implementation to that of an i.i.d based Monte

Carlo method which has be shown elsewhere to require a greater number of samples than

that of an MCMC method [22], although in comparison with the Markov-Chain methods

the GPU optimised routine and large period of the Mersenne Twister may offer significant

advantages if optimised efficiently.

2.3 Physical Application: Magnetic Susceptibility

The works of Cregg et al cited herein deal primarily with various aspects of molecular

magnetics, paying particular attention to the partition functions of the systems in each

7


case. The cited papers herein are as follows:

• “Series expansions for the magnetisation of a solid superparamagnetic system of

non-interacting particles with anisotropy” [8]

• “Partition functions of classical Heisenberg spin chains with arbitrary and different

exchange” [9].

• “Low-field susceptibility of classical Heisenberg chains with arbitrary and different

nearest-neighbour exchange” [10]

The articles listed represent much effort that has been carried out in the minimisation

of partition functions [8, 9], and their subsequent use in the determination of various

properties of molecular magnetic systems [10].

Dealing with a system of non-interacting anisotropic superparamagnetic particles,

Cregg et al . [8] show the means in which a reduction of the partition function may be

achieved to a more manageable form . Originating with a typical double integral partition

function, and utilising previous work in the area, this is shown to reduce to a single inte-

gral function. From this new form, the function is further reduced to that of an infinite

series expression of commonly known terms. The aspects for use of this new expression

are discussed, with favour give to the fact that no longer is numerical integration required

to evaluate a solution. However, the use of numerical integration for that of the single

integral form is cited as being a superior choice given that convergence may require a

significant number of terms in the series. Cregg follows by stating that the use of the

reduced forms of the partition function ”offer an improvement on previous methods such

as that outlined . . . which required sums of integrals” [8, §5].

Reduction of the partition function is again examined by Cregg et al [9], considering

classical Heisenberg spin chains of molecular magnets. Applying a similar approach to the

reduction of the partition function regarding superparamagnetic materials [8], Cregg et al

provide a means for the reduction of the partition function for between 2 to 4 chains, as

well as a general N -chain, in the case of the classical Heisenberg model [9]. Beginning with

a Typical 2N -dimensional partition function integral for dealing with spin chains utilising

the classical Heisenberg model, a new expression is achieved by means of reducing these

to sums and products of known functions [9, Section 3]. Comparing the results obtained

8


from the newly derived equations to that of research and experimental data of others

showed agreement under all examined instances. Although the reduction of the partition

function simplifies many aspects of calculations, the use of computer algebra systems

such as Maple may be required alongside specific libraries to achieve results in particular

cases [9, Appendix A].

Following on from the work carried out for the reduction of the partition function

of classical Heinsberg spin chains [9], Cregg et al apply the reduction schemes for the

determination of the magnetic susceptibility, and hence allowing arbitrary exchange pa-

rameters to be recovered from a data of data generated from the resulting equations [10].

Citing Cregg et al on Luban et al , work in the area consisting of utilising the quantum

Heisenberg model for determination of arbitrary exchange parameters has been shown

effective. Cregg provides a means of achieving this via the classical Heisenberg model,

without the need to deal with diagonalising a large quantum Hamiltonian [10, §1]. Taking

2 and 3-spin chains as examples, the method outlined previously [9] is applied, wherein

the partition function is expressed in terms of sums of known functions. Utilising this

result, expressions for the magnetization and susceptibilities are determined. Following

from the applied method, an expression for the susceptibility of an N-Spin chain is pre-

sented. Utilising the resulting expressions for the susceptibility, Cregg et al proceeds to

simulate sets of data for classical spin chains. From the results, exchange parameters my

be extracted effectively upto 4-spin chains [10]. 5-spin chains tended to require more infor-

mation regarding the system to effectively allow a set of results, yet performed effectively

when given so.

9

Chapter 3

Background Theory

3.1 Introduction

The aim of the project described herein is for the evaluation of a problem in physics on

a parallel processing device. The problem involves the solution of a high dimensional

integral, which shall be implemented to be solved on a graphics processing unit (GPU).

The integral to be solved will be based upon that specified by Cregg et al for the partition

function of a classical Heisenberg N-spin chain [9]. The overall goal of the methods utilised

herein are to allow for the recovery of exchange parameters between adjacent spins from

experimental data sets of materials under test. Following on from the work of Cregg et

al [10] experimentally determined data may be analysed, and using the developed program

allow for the recovery of the exchange parameters, JC , for each nearest-neighbour spin

chain. To achieve this, we must derive a means of calculating the material’s magnetization,

and hence susceptibility from the partition function [9].

In order to solve the equation via computational means, a specific technique must be

used which can handle high dimensional integration in a reasonable amount of time, as

with each new spin in the chain the integral dimension increases by a factor of two. Monte

Carlo simulation methods offer the inherent ability to handle high dimensional integration,

and are superior techniques than numerical integration for such circumstances [23]. To

apply Monte Carlo simulation for the computation of an integral, a large set of random

values must be generated, which in turn requires the use of a pseduo-random generator.

A uniformly distributed generator with a large period of values is required for an effective

Monte Carlo simulation, and to offer the most efficient use of computational resources will

require a parallelized implementation to generate a significant amount of numbers at any

given time, allowing required total of generated values to be reached in a much shortened

time period, compared to conventional sequential methods [61].

10

3.2 Magnetic Susceptibility Background Theory

For an effective parallelised GPU implementation, all capable technologies were con-

sidered. The Open Computing Language (OpenCL) was chosen due to the cross platform

and open standard to the technology. OpenCL offers the inherent ability to allow for par-

allel implementations of algorithms on heterogeneous computing platforms [32] with no tie

to any specific vendor. Modern computing systems feature processing units with a various

number of cores, each of which may carry out individual operations, such as Single Instruc-

tion Multiple Data (SIMD) or Multiple Instruction Multiple Data (MIMD) [26]. OpenCL

exploits all available cores to the fullest by executing an implemented algorithm on each

available core at any instance, inherently supporting SIMD operations as standard. Given

that GPU devices feature many cores, the use of such a device for data-parallel operations

has been shown to offer significant acceleration in comparison to sequential CPU-based

methods via the OpenCL standard, and such was chosen as the ideal technology to realise

the problem’s solution [27].

3.2 Magnetic Susceptibility

3.2.1 Overview and Definitions

3.2.1.1 Paramagnetism

Paramagnetic behaviour is defined as a form of magnetism which a material exhibits in

response to an externally applied magnetic field [39]. This occurs in materials where there

exists one or more unpaired electron spins, and as a result yields a dipole moment within

the material. With the application of a magnetic field, these moments tend to align in the

direction of the applied field, wherein the developed field is the sum of the total number

of dipole moments. This occurs as a result of a partial alignment of magnetic moments

in the material as a response to the torque exerted by the applied magnetic field [12].

In the absence of an external magnetic field, the dipole moments are randomly ordered

within the material, and hence the net magnetic field is zero [35]. The magnetization of

the material is a measure of the magnetic dipole moment per unit volume of the material,

defined as

M = χµ0H = Cµ0H

T(3.1)

11


in accordance with Curie’s Law, which states that the magnetization of a material is ap-

proximately directly proportional to the applied field, H [45], and the Curie constant, C, is

defined as Nm2/3kB, where N is the number of magnetic moments in the sample, V is the

volume of the material, m is the value of the magnetic moment with m = gµb√s(s+ 1),

and kB is the Boltzmann constant, and for the purpose of this Chapter, T is tempera-

ture1. Materials of this nature are said to have a positive magnetic susceptibility, χ ≥ 1,

resulting from the fact that the alignment of the moments occurs in the same direction

to that of the applied field. The above equation however is a linear approximation to the

magnetization, which in the non-linear case is given by the more general equation

M = C11

Z

∂Z

∂H(3.2)

where Z represents the partition function of the material, H is the applied field, and C1

is a constant value. For the work on low field susceptibility Cregg et al give the above

equation with C1 = (Nm/V )(1/N), and the applied field, H, encompassed within the

parameter, ξ = µ0mH/kBT [10].

3.2.1.2 Susceptibility

The magnetic susceptibility of a material is a measure of the degree in which it becomes

magnetized in the presence of an external magnetic field [39]. In particular materials,

the application of an external field causes a normally non-magnetic material to acquire

magnetic behaviour (paramagnetism). The susceptibility of a paramagnetic material can

be determined from the following relationship

χ =µ0M

H(3.3)

Here M is the magnetic dipole moment per unit volume of the material (magnetization),

H is the magnetic field strength, both in units of Amps/m. This case however only holds

true for low levels of magnetization, as with the Curie law listed for Equation (3.1). A

more general equation for the magnetic susceptibility is given as the derivative of the

1T is used for time elsewhere within this document

12


magnetization with respect to the magnetic field, given as

χ = C2∂M

∂H(3.4)

Following the work of Cregg et al [10], the above equation is derived by expanding the

Equation (3.2) of the non-linear magnetization to the first order in ξ.

3.2.2 Classical Heisenberg Model

Although in accordance with condensed matter theory paramagnetism requires modeling

via quantum mechanics, it is possible to use classical statistical methods for this pur-

pose [12]. If we are to consider the system as a collection of N particles, each with a

magnetic moment, µ, subjected to an external field, H, in accordance with Kittel [39] we

can obtain a value for the potential energy of the system via the Hamiltonian

H = −µ0mHN∑i=1

cos θi (3.5)

where µ0 is the permeability of free space, m is the classical magnetic moment, H is the

applied field, and θ is the resulting angle of the projected value. This provides a very

simple example of a paramagnetic statistical model of the system [55]. A more effective

description of the system may be provided by the Ising, or that of the classical Heisenberg

model. The classical Heisenberg formulation may be utilised to effectively model chains

of molecules in instances wherein the spin may be sufficiently large, and the temperature

not approaching absolute zero. Following on the formulation provided by Cregg et al [9]

the Hamiltonian of the resulting N -dimensional spin chain is given by

HN−Spin = −µ0m−→H

N∑k=1

−→e k −N−1∑k=1

JCk,k+1

−→e k−→e k+1 (3.6)

where µ0 is the permeability of free space, k indicates the position of the classical spin

along the chain,−→H is the vector corresponding to the external magnetic field, −→e k is

the unit vector of each classical spin, JCK,K+1is the exchange parameter JCK,K+1

=

JK,K+1s(s+ 1) between neighbouring spins k and k + 1, JK,K+1 is the exchange constant

13


between nearest neighbour spins which in the case of paramagnetic materials JK,K+1 > 0,

the classical magnetic moment m = gµb√s(s+ 1), g is the Lande spectroscopic splitting

factor, mub is the Bohr magnetron, and s is the spin quantum number. The first summa-

tion represents the Zeeman splitting energy term, and the second represents the classical

Heisenberg exchange energy.

The corresponding classical partition function of the system in accordance with Cregg

et al [9] is given by

ZN−Spin =1

(4π)N

∫Ω1

. . .

∫ΩN

exp(−HN−Spin/kBT

) N∏l=1

sin θl dθldφl (3.7)

where T represents the absolute temperature in Kelvin; kB represents the Boltzmann

constant. Converting Equation (3.7) to spherical polar coordinates following Cregg et al

yields the Hamiltonian in the following form

HN−Spin = −µ0mHN∑i=1

cos θi−N−1∑k=1

JCK,K+1

(sin θk sin θk+1 cos (φk − φk+1) + cos θk cos θk+1

)(3.8)

where µ0 is the permeability of free space, k indicates the position of the classical spin

along the chain, the exchange parameter JCK,K+1= JK,K+1s(s+ 1) between neighbouring

spins k and k+1, JK,K+1 is the exchange constant between nearest neighbour spins which

in the case of paramagnetic materials JK,K+1 > 0, the classical magnetic moment m =

gµb√s(s+ 1), g is the Lande spectroscopic splitting factor, mub is the Bohr magnetron,

and s is the spin quantum number.

From the Hamiltonian we can, following again the work of Cregg et al [9], therefore

determine the partition function of the N -spin chain by integrating over both the polar,

θ, and azimuthal, φ, angles of each spin as follows

ZN−Spin =1

(4π)N

∫ 2π

θ1=0

∫ π

φ1=0

· · ·∫ 2π

θN =0

∫ π

φN =0

exp

(HkbT

) N∏l=1

sin θl dθldφl (3.9)

The above function requires 2N integrals to evaluate fully, where N represents the number

of spins in the system. To provide an analytical solution of this problem would require

a large amount of effort as the number of integral dimensions increases, which may even

14

3.3 Monte Carlo Methods Background Theory

become impossible in higher dimensions, thus a simulated solution may offer a close

approximation to the desired result, with significantly less effort.

3.2.3 Calculation of the Susceptibility

To determine a value for the susceptibility of the N -Spin chain, we must initially determine

the magnetization. The magnetization of the N -spin chain is derived by Cregg et al [10],

which is derived from the partition function as follows

M =Nm

V

1

N

1

Z

∂Z

∂ξ(3.10)

where N is the number of moments, V is the volume, m is the classical magnetic moment

m = gµb√s(s+ 1), ξ is

(µ0mH/kBT

)Utilising this expression for the Magnetization, the

susceptibility can thus be determined by further taking the derivative of the magnetization

with respect to the applied field, as given by Equation (3.4). This can be evaluated as

χ = C3

(∂M

∂ξ

)(3.11)

where C3 represents a constant of known values. Utilising the expression we now have a

means to determine the susceptibility of an N -Spin classical Heisenberg chain in terms of

the partition function.

3.3 Monte Carlo Methods


The Monte Carlo family of routines are a set of simulation methods which are often used

in the computation of high order integrals. Two basic methods of Monte Carlo integration

exist, that of the “Hit-or-Miss”, and the “Sample-Mean” methods [63]. The hit-or-miss

method is carried out by generating a series of random points within a rectangular region,

the width given by the upper and lower limits of integration, and the height by the function

itself between these extremities. As random points are generated within the limits of the

rectangular space, those bounded by the curve are counted while points outside the curve

15


do not contribute to the count. The number of points which are bounded by the curve

divided by the total number of generated points gives an estimate of the integral between

the specified limits, expressed as a fraction of the total area of the rectangle. In contrast

the sample mean method works via sampling the function at each generated random point,

and averaging all resulting values. Consider, for example, a one-dimensional definite

integral, I, as follows

I =

∫ b

a

f(x) dx (3.12)

where a, b represent the limits of the integration, f(x) represents the function to be inte-

grated. Applying Monte Carlo integration via the sample-mean method, we can approx-

imate the integral as

I ≈ (b− a) 〈f〉 (3.13)

where 〈f〉 represents the average of the function f(x) between the limits of integration a,b.

Determining the average of the function is carried out by the summation of all sampled

values of the function over the range, divided by the total number of samples taken. This

is expressed as

〈f〉 ≡ 1

n

n∑i=1

f(xi) (3.14)

where n is the number of samples taken, and xi represents a random value taken within

the limits of the integration specified in Equation (3.12). For an effective Monte Carlo

simulation, the set of random values supplied to the function, xi above, should be chosen

from a uniform random distribution in the range specified by the limits of integration a,

and b. The use of Monte Carlo integration techniques extends to the evaluation of higher

dimensional integrals. Taking as an example the d-dimensional integral

Id =

∫ b1

a1

. . .

∫ bd

ad

f(x1, . . . xd)dx1 . . . dxd =

∫V

f(x)dV (3.15)

where x represents a vector x1, x2...xd, to be integrated over the hypercube of volume V ,

the total product of the difference between the upper and lower limits of integration for

each integration dimension. The integral can be evaluated by Monte Carlo simulation as

16


follows [59]

Id ≈ V1

n

n∑i=1

f(xi) = V 〈f〉 ± Err (3.16)

To obtain an estimate for the error in Monte Carlo integration, we must first determine

the variance of the method. The variance of sample-mean Monte Carlo is estimated via

σ2n =

1

n− 1

(〈f(xi)

2〉 − 〈f(xi)〉2)

(3.17)

in accordance with the formulation by Teukolsky et al [59, p. 398–400]. The error estimate

associated with Monte Carlo integration using n samples is given by

Err = Vσn√n

(3.18)

showing no dimensionality dependence [23]. Therefore, it can be stated that the error in

Monte Carlo integration does not depend upon the dimensionality of the integral, allowing

this method to be used for effectively evaluating high dimensional integrals. As such, it

can be seen that Monte Carlo simulation error associated with the sample-mean method

is of the order O(n−1/2) [38].

3.3.2 Comparison to other methods

Although Monte Carlo simulation methods are widely used for evaluation of integrals, they

are not the sole means of calculating integrals via computational methods. Quadrature

routines such as numerical integration offer another approach to the evaluation of integrals,

via methods such as those of Simpson’s and Trapezoidal rules. Taking as an example the

Trapezoidal rule for a one-dimensional integral, I, given by

I ≈ h

2

f(a) + f(b) +n∑k=1

f(a+ k ∗ h)

(3.19)

where h is the step-size of the function, given as h = (b − a)/n, a and b are the limits

of integration, and n is the number of samples taken. We can obtain values for the error

bounds and resulting errors of the above routine for a particular function, compared to

17


the resulting values for Monte Carlo simulation. For the single dimensional case, the error

bound associated with the Trapezoidal rule is known to be

|ET | ≤M(b− a)3

12n2(3.20)

where M is a constant such that |f 2(x)| ≤ M , a and b are the limits of the integration,

and n represents the number of sample points in the integration. The error associated

with the trapezoidal rule depends on the second derivative of the function being inte-

grated. The above equation has a known error of O(n−2), where the power 2 denotes the

convergence rate of the method, namely a doubling of n effectively reducing the error to

1/4 its previous value [4]. In a single dimension, the Trapezoidal rule may be utilised

effectively and will yield a significantly faster convergence than that of the known rate

for Monte Carlo simulations, which is of the order O(n−1/2). The so-called “Curse of

dimensionality” however limits the use of such numerical methods as the number of di-

mensions of integration increases. To achieve an error bound of similar order to that of

the one-dimensional integration case, the number of points required will need to increase

with the rising dimensional order.

The rate of convergence of the Trapezoidal rule changes with an increase in dimension

to O(n−2/d), where d represents the dimensional order of the integral [67]. By comparison

the rate of convergence of Monte Carlo integration, given by Equation (3.18), does not

contain a dimensional dependence and hence the overall error of Monte Carlo integration

does not vary with the increase in dimension. Table 3.1 shows a comparison between

Monte Carlo and the Trapezoidal rule from 1 to 10 dimensions. Although initially the

Trapezoidal numerical method does have superiority in the lower dimensional integrals, to

maintain a common error value as the dimensions increase will require a tenfold increase

in the number of points taken per extra dimension. Further examining data obtained by

Djurabekova comparing numerical integration via the midpoint method to that of Monte

Carlo integration [22, chap. 5] the same behaviour can be observed, with the numerical

integration routine losing out to the Monte Carlo simulations as the dimensional order

increases.

18


Table 3.1: Effect of dimension on error estimates for Trapezoidal andMonte Carlo Integration, showing number of points needed to maintain

error bound of 0.01.Dimension (d) MC Error #Values Trap Error #Values

O(n−1/2) O(n−2/d)1 0.010 10000 0.010 102 0.010 10000 0.010 1003 0.010 10000 0.010 10004 0.010 10000 0.010 100005 0.010 10000 0.010 1000006 0.010 10000 0.010 1.00E+0067 0.010 10000 0.010 1.00E+0078 0.010 10000 0.010 1.00E+0089 0.010 10000 0.010 1.00E+00910 0.010 10000 0.010 1.00E+010

3.3.3 Implementation Considerations

3.3.3.1 Random Number Generation

To make use of Monte Carlo integration the most fundamental requirement is a source

of random numbers. Real world random numbers, such as those measured by counting

radioactive decay, offer what we shall term “true random” numbers. For true random

numbers there is no way to predict the next number within the sequence [34]. Unfor-

tunately, for general computational work, the rate of random number counts may be

significantly slow, compared to the required value for an effective algorithm. We instead

must look to a computer for our source of random numbers, provided by some form

of number generation algorithm. Computers are generally deterministic machines, with

predictable outcomes and behaviour. As a consequence of this the generation of truly

random numbers by an algorithmic method is not possible. To overcome this, we must

instead emulate a form of randomness on a computer to provide our source of numbers,

which shall be called termed “pseudorandom” numbers. Given an initial value (seed),

such algorithms can provide a set of numbers which appear to offer random behaviour.

However, such generators tend to have a finite number of generatable values before the

sequence will repeat itself, given by the period of the generator.

For a number generator to be of any use, and to appear random as possible we require

that the generator have a significantly large period before repetition may occur, and also

19


that such numbers are generated with a uniform distribution over the range specified.

Common generators [59] such as those of the Linear Congruential Generator (LGC) are

commonly implemented with periods of 232. Although such periods may seem inherently

large, the greater the period of the generator, the less likelihood of the same number

appearing more than once, thus inherently improving overall accuracy of the method.

Considering Matsumoto et al ’s Mersenne Twister algorithm [42], the period of generation

is a significantly higher 219937 − 1 in 624 dimensions of equidistribution. According to

Matsumoto et al , extremely large period of this generator effectively eliminates repetition

under normal operation, allowing a large number of unique values to be generated [42].

As a result of this, the Mersenne Twister pseudo-random generator offers itself as an ideal

candidate for use in Monte Carlo simulations.

3.3.3.2 Parallel Versus Serial Computation

Conventional single-core computing architectures are based upon a serialised data input

and output. In the simplest sense, one instruction must be completed before the next

may begin. If working on a large data set, a significant amount of time may be re-

quired for all necessary processing to be completed. An improvement upon this design

is that of multithreading, wherein multiple instances of a program execution may be oc-

curring almost simultaneously. The underlying system, however, is continually switching

the running thread context to effectively appear as though multiple instructions are be-

ing carried out simultaneously. Flynn’s Taxonomy [21] refers to the above single-core

design as Single-Instruction Single-Data (SISD), commonly known as the serial von Neu-

mann architecture. Modern multicore CPU designs are classified as multiple-instruction

multiple-data, wherein each processing core may carry out unique operations on differing

sets of data, inherently achieving a form of parallelism. In comparison to this, modern

GPU designs offer what is known as single-instruction multiple data (SIMD). The device

may effectively apply an operation to a data set, with each element being operated on in

parallel [26]. Contrasting the design of SISD with that of SIMD and MIMD designs, we

can see that it may be beneficial to carry out operations in a parallel manner, as it allows

a much higher throughput of effective operations per data sets.

Examining the differences achieved by Gohara using a standard sequential C versus

20


that of a parallelised program [27], we can see that the parallel version offers almost 2

orders of magnitude speedup to determine the same overall result. As Monte Carlo in-

tegration relies largely upon summation and number generation, to achieve an effective

speedup in the overall procedure these areas must be parallelised efficiently. The theo-

retical speedup, S, that can be achieved through parallelising a particular routine over a

sequential version is given by Amdahl’s law [47]

S ≤ 1qp

+ (1− q)(3.21)

where q represents the fraction of operations which can be parallelised, p represents the

number of available processors, and (1 − q) is the fraction of operations which remain

serial. Although no overheads are taken into account, the above equation allows a means

of determining a theoretical limit on the maximum possible speedup such parallelised

routines may achieve. Given the inherently parallel nature of the Monte Carlo routine,

an effective speedup should theoretically be possible utilising a multi-core system, such

as that of MIMD multi-core CPU’s, or accelerator devices such as SIMD GPU’s.

3.3.3.3 Machine Epsilon Consideration

To determine the range and hence granularity of the integration technique utilised herein,

we must know the minimum value which can be represented in our single-precision floating

point number scheme, namely the machine epsilon, ε, value. Over the range of our

integration, each value generated can be distinct from the next logical value by epsilon,

given as

xn+1 = xn + ε (3.22)

where xn+1, xn represent two distinct adjacent numbers. Ruling from this, it can thus be

stated that there will be a finite range of numbers over our integration range which remain

distinct, ND. As the value of numbers generated approaches this value, the probability

of generating numbers previously generated shall will increase as

P (x) =n(x)

N(S)(3.23)

21


where x is the event linked to generation of a given number of unique random values,

n(x) is the number of values generated, N(S) is the number of possible different outcomes

capable of being generated. Taking the number of possible values which may be generated

as the reciprocal of the machine epsilon value, and the number of points to be generated

as 224, or approximately 16.7 million values, we may determine a likely value of the

probability of obtaining a random value which has already been utilised. Since the floating

point standard features a machine epsilon value of 2−24, we can thus say that as the number

of values generated approaches the 16.7 million required, the probability of repetition will

increase to 1. To effectively minimalise this occurring to a given percentage, even in

the case of using large period number generator (Mersenne Twister) it will be essential

to maintain the generated numbers to a value less than the aforementioned theoretical

maximum. These such values are determined via the following means

ND =b− aε

(3.24)

where a,b are the integration range values, ε is the machine epsilon value representing

the minimum distinct number difference. We can see from this the relationship between

number of distinct values and machine epsilon to be inverse. To effectively increase the

number of possible values, we must either expand the range of integration, or to decrease

the machine epsilon value. Increasing the range of integration will effectively change the

properties of the function, and hence is not a desirable approach to use. Decreasing

the epsilon value can be accomplished through the use of double or quadruple precision

floating point representation. Making use of the machine epsilon value associated with

double precision the probability of such issues decreases to that of approximately 2×10−9.

Given, however, that Monte Carlo simulations offer the best possible use in higher

dimensional integrals [59][Pg 398–399], we can effectively ignore this issue when the di-

mension is large. As shown in Equation (3.18) the order of the error in Monte Carlo

methods does not depend upon the dimension of the integral. As a result of this, in

higher dimensional integrals, the number of points generated can be divided evenly be-

tween each integral, where the number of generated points in each case is significantly less

than the saturated limit of generatable points. Maintaining the number of values much

22

3.4 OpenCL Background Theory

less than the 224 unique random number limit will allow for a total number of values to be

generated which is greater than the imposed limit, without affecting the overall accuracy

of the result. If we consider a 20-dimensional integral (10-Spin chain), and we require

1 × 108 values for a required accuracy, then the number of values to be generated per

dimension will be

η =2N

n(3.25)

where η represents the number of values per dimension, N represents the spin chain

length, n represents the total number of points to be generated.

3.4 OpenCL


With the advent of general purpose graphics processing unit (GPGPU) computation in

recent years, the use of graphics processors in scientific applications has seen widespread

acceptance, such as Stanford’s distributed computing program Folding@Home [7], which

now supports GPU devices, in conjunction with conventional CPU’s. As GPU devices

are inherently a large collection of processor cores, designed for processing and accelera-

tion of image-related data, their use in scientific applications as data-parallel processors

offers a significant advantage to conventional CPU-based computation in terms of data

throughput and overall computational efficiency [37]. Although each vendor had its own

implementation for data-parallelism on its devices, such as ATI’s CAL or NVidia’s CUDA,

the code for each application was vendor specific. However, an open standard for achiev-

ing parallelism on heterogeneous computing system was developed by Khronos, the group

behind such open standards as OpenGL [31]. The Open Computing Language, OpenCL,

was designed to be an open standard for heterogeneous system programming, allowing

cross compatibility for code between all OpenCL supported devices. Not limited solely to

GPU computing, OpenCL allows for multicore CPU devices, DSP’s and Cell broadband

engines to be utilised and treated as peers in the context of a system. The OpenCL

specification allows for full exploitation of these multi-core systems with no tie to any

particular vendor, ensuring support across all platforms supported [32].

23


According to the Khronos OpenCL specification [32] in OpenCL terminology CPU’s

and GPU’s are each viewed as Compute Devices, namely processors which can execute

some form of parallelised program. A CPU device consisted of four cores may be viewed

globally as a single compute device, with four individual Compute Units, within each

of which the actual processing work is carried out [27][Cast 2]. CPU’s and GPU’s may

be grouped together into what is commonly known as a Device Group, which can be a

mix of any form of compute devices. This is what essentially allows for the development

of a heterogeneous computing system. A group of devices are contained within a Host,

which is generally the workstation system housing and supporting the devices. The host

may utilise multiple device groups, such as groups of multiple CPU devices, multiple

GPU devices, or a mixture of either. At runtime of an OpenCL program, the host may

determine which device will offer the most efficient means to execute a particular OpenCL

operation, known as a Kernel.

3.4.1.1 Execution Model

The execution of an OpenCL program consists of two separate parts, that of the kernel

executing on one or more OpenCL devices, and the code executing on the host machine

which manages the device kernel execution. An OpenCL kernel can thought of as anal-

ogous to that of a C function. Each instance of a kernel on a compute device is termed

a Work-Item, and is identified by a value in an index space defined when a kernel is sub-

mitted by a host for execution. The index value (globalID) allows this work-item to be

distinguished within the index space. Work-items may be grouped into so-called Work-

Groups, wherein the index space now offers a more coarse granularity. Work-groups each

feature an individual work-group ID, wherein all work-items may now be accessed via

this and local identifier value within the work-group, in addition to the global ID value.

Figure 3.1 shows the division of the index space (NDRange) between that of work-groups

and then further subdivided into work-items for that of a 2 dimensional global problem.

Utilizing this execution model, according to the Khronos specification a wide range of

programming models may be mapped onto the design, such as that of data-parallel and

task-parallel programming models. As we intend to utilise a GPU for SIMD operation,

the programming model being utilised will thus be data-parallel.

24


ND

Ran

ge

size

,G

y

NDRange size, Gx

......

· · ·

· · ·

. . .

work-item

(wxSx + sx, wySy + sy)

(sx, sy) = (0, 0)

work-item


(sx, sy) = (Sx−1, 0)

work-item


(sx, sy) = (0, Sy−1)

work-item


(sx, sy) = (Sx−1, Sy−1)

work-item (wx, wy)

work

-gro

up

size

,S

y

work-group size, Sx

Figure 3.1: OpenCL Program Index Space [32]

Memory Model The OpenCL memory model is defined as a multi-level design, wherein

each memory type has specific access-rights and properties [20]. The memory can be

divided into four categories:

• Global: Available for access by all work-items and groups. Both host and kernel

are allowed read/write access.

• Constant: A region of global memory which remains constant during execution of

a kernel. The host allocates and initializes memory objects.

• Local: Memory available to a specific workgroup only. Data can be shared between

work items in the workgroup.

• Private: This is memory available to a single work item. Any variables defined

herein are not visible to other work items.

Access to each memory region can be shown as follows by Figure 3.2

Host and device memory are independent of one another in general operation, as

the host machine is defined outside of the OpenCL standard. The memories may interact

25


Figure 3.2: OpenCL Memory Model from ATI OpenCL Intro(source: [20])

however, essentially during copies between device and host memory [32][Section 3.3]. Such

copies between device and host however may be time consuming, considering the number

of bytes and the bus bandwidth available. Therefore to avoid unnecessary overhead it is

essential to minimize data transfer between device and host to the absolute minimum [52].

3.4.1.2 Hardware Considerations

For an effective understanding of the various locations wherein the OpenCL runtime may

encounter performance bottlenecks it is essential to have a thorough understanding of the

overall system’s performance. For the system used herein, the graphics processing unit is

that of an ATI Radeon HD5850, and hence all calculations will be carried out utilising

values associated with this device. To calculate the theoretical memory bandwidth of the

device we apply the following formula, adapted from the formulation given by NVidia’s

Best Practice Guide [49]

Bandwidth = fMem ∗ (Bus/8) ∗D (3.26)

where fMem is the frequency of the memory clock, Bus is the width of the data-bus, D is

a constant (2 if device memory type is GDDR3, 4 if GDDR5). Supplying the associated

values provided within ATI’s documentation [14] allow us to calculate a maximum value

as 128 GB/s, which is in accordance with the value provided by the ATI documentation.

26


Next, to determine the overall performance of the device we calculate the number of

floating-point operations per second (FLOPS) which is again adapted from the formula-

tion provided by NVidia [49] as

FLOPS = NStream ∗ fStream ∗ k (3.27)

where NStream is the number of stream processors available on the device, fStream is

the clock frequency of the stream processors, k is the number of instructions per cycle.

Supplying the values given by the ATI documentation yields a theoretical maximum of

approximately 2 TeraFLOPS, or 2× 1012 FLOPS of performance, which is a considerable

value, in comparison to modern CPU values in the range of approximately 10 GigaFLOPS

(1× 1010) given by the MaxxPI benchmark suite [44].

Considering transfers between device and host, the maximum bandwidth available will

be that of the PCI-Express x16 v2.0 standard, given as 16 GB/s by the PCI-Sig Group

PCI-Express v2.0 specification [56]. In comparison, the maximum available bandwidth

between a device and its on-card memory, for the ATI Radeon HD5850, is 128 GB/s,

which is a factor of 8 times higher. Thus, use of private, local and global device memory

will offer a much higher throughput than reads and writes to host memory, and hence

such operations should be avoided where possible.

3.4.1.3 Comparison to NVidia CUDA and ATI Stream

Although OpenCL is the first heterogeneous specification for programming of parallel,

CUDA, which is specific to NVidia hardware technologies, remains a popular choice to

achieve GPGPU computation. NVidia’s CUDA architecture technology offers a C-like

language for achieving data-parallelism, and offers a large back catalogue of supported

vendor-specific hardware [50]. Similarly, ATI’s Stream technology was initially designed

to allow for GPGPU operations to be carried out solely on ATI GPU technologies. In

contrast, OpenCL is an open standard, freely with implementations currently available

for Apple, AMD/ATI, NVidia, and IBM Cell hardware platforms. A host system may

effectively communicate with and utilise the resources provided by a variety of OpenCL

supported hardware devices, such as a system consisting of an ATI GPU, an NVidia GPU,

27


any x86 supported CPU, and a Cell broadband engine accelerator card [29]. In contrast,

CUDA and Stream-based programs may be executed solely on supported vendor-specific

hardware devices, with no support for CPU or accelerator devices.

With the release of the ATI Stream SDK 2.0, the OpenCL 1.0 standard has been

implemented, allowing full support for both GPU and CPU devices under the SDK [16].

NVidia’s OpenCL implementation, built upon the CUDA architecture, offers no current

support for CPU OpenCL utilization. Due to the fact that OpenCL remains an emerging

standard, and given that CUDA and Stream are mature technologies an effective com-

parison is difficult due to the lack of available comparisons between operations provided

by OpenCL and either architecture.

3.4.1.4 Overall Program Structure

For the successful execution of an OpenCL program, the following set of steps must be

undertaken in the stated order:

• Initialization - Selection of devices and the creation of a context in which the cal-

culation runs.

• Allocation of resources - Setup of memory to use for sending and receiving data

between host and device.

• Creation of program/kernels - Compilation of programs and kernels from source files

(or loading of binaries).

• Execution of program - Kernel arguments are passed and kernel is executed upon

each work item.

• Tear down - Results are read back to the host and previously allocated resources

are freed.

During the initialization phase of the OpenCL program, we create a context within which

our device (or devices) may operate. This is achieved through querying the list of available

OpenCL devices within the system, and creating the context with reference to a particular

device ID within the list. When all the required setup has been achieved, a queue is

28


specified for commands to be submitted to the device from the host system. The queue is

created in reference to a particular context and device and allows operations to be carried

out upon OpenCL memory objects, kernels and programs [32]. With the command queue

specified, we may now allocate memory for data input and output to and from the kernels

to be executed.

Next, the kernels and programs will be created. This can be achieved via loading

of pre-compiled binary files, or directly from source. Upon creation of the program, it

may be built for all devices within the specific context. Each individual kernel to be

executed must be given an identifier, to allow the host the ability to communicate with it.

Using the kernel identifier, all required arguments are passed from the host system to the

device kernel, and the kernel may now operation on the supplied data. Upon completion

of the OpenCL program, all data is read back from the device to the host system, and

all previously utilised resources are now freed. The host program may now finish its

execution.

29

Chapter 4

Implementation

4.1 Platform Details

4.1.1 System Hardware and Software

For an effective implementation of the OpenCL Monte-Carlo routine, the system must

be comprised of hardware supported by the Stream SDK Version 2.01 [18]. Currently

native hardware support is provided for the ATI Radeon HD 5000 series GPU’s, and x86

architecture CPU devices featuring SSE2 support. For the system, a triple-core AMD

CPU (Phenom II X3 720BE) with each core clocked at 2.8× 109 Hz, and a corresponding

high performance GPU (Radeon HD 5850 1GB) operating at 0.775×109 Hz. For the most

apt choice of components for the remainder of the system, the following considerations

were required to be met:

• Must support both CPU and GPU devices chosen.

• All data transfers between devices on the system must be the maximum allowed

value for an effective comparison.

• All components are to remain at their stock configurations where applicable - no

overclocking1 of components for improved performance.

For the first consideration, the only supported mainboard family were the AMD 700-

series. To allow for the maximum data transfer between the CPU and RAM the use

of 4GB of high performance DDR3 memory clocked at 1.333 × 109 Hz was considered,

allowing for peak transfers of 1.0667 × 1010 GB/s. To effectively minimize lag induced

from CPU-to-GPU transfers the PCI-Express v2.0 x16 bus lane for the CPU-to-GPU

1The increase of a component’s clock speed and/or device voltages to allow devices to offer performancebeyond manufacturers specifications

30

4.1 Platform Details Implementation

connection was required, achieving a peak transfer of 1.6 × 1010 GB/s. With the above

criteria fulfilled the system was fully assembled, including all non calculation-essential

components.

The use of a Linux-based operating system was decided upon for the inherent support

of the GNU compiler collection (one of few compiler suites supported by the Stream

SDK), it’s support of tools such as GNUplot for data visualization, and the fact that no

heavyweight integrated development environment (IDE) is required to fully exploit the

OpenCL implementation chosen. Ubuntu 9.10 64-bit was chosen as the base system for

the implementation from the list of available supported distributions on the Stream SDK

website. The system was configured as standard, with the exception of the removal of

page attribute table (PAT) option at kernel runtime.

Upon successful installation of the Ubuntu system the latest version of the Stream

SDK and corresponding display driver were acquired from the AMD/ATI website [18].

Following the procedure for successful installation of the display driver and SDK, suitable

packages were acquired for the Ubuntu system for application development, such as the full

suite of GNU Compiler Collection tools, a lightweight code IDE (Geany), and GNUplot.

The system was therefore fully equipped to develop using the Stream SDK for OpenCL

applications.

4.1.2 OpenCL System Information

With the system fully setup it was possible to examine the properties and available re-

sources which may be utilised by the OpenCL runtime during an application’s execution.

Using the predefined OpenCL functions for querying system details, the following output2

was returned for the GPU and the CPU device.

1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−

ATI Radeon HD5850 GPU

3 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Local Mem Type ( Local =1, Global=2) : 1

5 Local Mem S i z e (KB) : 32

Global Mem Si z e (MB) : 256

2The following represents a subset of the information returned. For the full system details please referto Appendix A.1.

31


7 Global Mem Cache S i z e ( Bytes ) : 0

Max Mem Al loc S i z e (MB) : 256

9 Max Work Group S i z e : 256

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

11 AMD Phenom I I X3 720BE CPU

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

13 Local Mem Type ( Local =1, Global=2) : 2

Global Mem Si z e (MB) : −1024




The queried properties define all available resources of the system which may be used.

From the above list it is possible to determine any possible limitations which may occur

within the system context. Analyzing the above results gives the maximum available

memory which may be utilized by the program at any given time. In the case of the GPU

this value is limited to 256 MB of the internal GPU memory, of which is 1024 MB in

total [14]. This limits the amount of available data which may be stored on the device

memory at any given time. Comparing this to value reported by the CPU of -1024MB

(wherein the negative value is believed to be a typo) the GPU is at a disadvantage in this

respect in terms of available storage. Since one of the main requirements of the system

is to generate and make use of random numbers, it would be beneficial to have a large

pool of random values available. Therefore, for effective operation of the program, it was

decided to fill this 256 MB space in its entirety with random values, of which will be

comprised of two separate arrays corresponding to the angles θ and φ required for the

routine.

The data listed above shows a maximum of 32 KB of available local memory for the

GPU. Within the OpenCL specification, this region of memory is used as a scratchpad

for temporary data storage and manipulation [32]. Data transfer within local memory

for the GPU device can be as high as 850 GB/s, as taken from the Stream SDK perfor-

mance tests, compared to the maximum theoretical bandwidth offered between device and

global memory of 128 GB/s. Effective utilization of such space may allow for increased

32


data throughput where necessary by maintaining minimal transfers to global memory.

Examination of the CPU local memory listing yields a global memory indicator, resulting

in an emulated region of local memory mapped to global memory. Following from this,

any performance benefits achieved through the use of local memory for calculations on

the GPU will not see the same improvement when using the CPU device.

Details for the workgroup size on each device correspond to the maximum available

workitems, or threads of execution, that may execute within a single workgroup at any

specific moment. As the maximum available number of workitems which may execute

on the GPU is capped to a value of 256, it was decided to limit the number of elements

in each workgroup to this value for both CPU and GPU. The product of workitems

available in each of the 3 dimensions available must also not exceed this maximum value.

To enable this to be easily calculated a single dimensional space was considered, with

maximum value, NDRange, equal to the maximum imposed group size. No advantages

may be achieved by using a 1, 2 or 3 dimensional index space at runtime, as this is solely

for simplifying indexing purposes where necessary [19].

4.1.3 Program Structure and Design

For the most efficient solution to the proposed physical problem, the majority of the work-

load should be carried out on the OpenCL device to fully exploit the available multiple

cores, be that on the CPU or GPU. Figure 4.1.3 shows the system operating sequence.

The parameters H,T,N, n represent the magnetic field, absolute temperature, spin-

chain length, and number of random values respectively. Figure 4.1.3 shows the various

states of operation the system may be in during execution, giving an abstracted view

of the implemented algorithms. Throughout the development process of the OpenCL

program every attempt was made to follow the specified operational sequence and state

diagrams given. The final result of the iterative development cycle adhered to the dia-

grams provided and operated as expected.

33


Host Machine

Setup Host

Initialise

OpenCL device

Run OpenCL

Kernels

Release

OpenCL Resources

Release

Host Resources

Evaluate integral

Host resources initialised

OpenCL device initialised

Results returned from device

OpenCL resource freed

Host resources freed

Kernel 1 Execution

Initialise

Generator

Load seeds

Generate values

Execute kernel

Get seeds

Seeds acquired

Run generator Reinitialise generator

Update seedsFinished generation

Kernel 2 Execution

Load values into

local memory

Calculate sample

of function

Parallel resuction

of workitems

Execute kernel

All data transferred

Samples evaluated

Finished reduction

Figure 4.1: Statechart of proposed system implementation

34


End User

Host Machine OpenCL Device

1: Define H,T,N, n

2: Report result

1.1: Setup host

1.2: Initialise calculation Parameters

1.3: Initialise OpenCL device

1.4: Confirm initialisation

1.5: Transfer n seeds to reserved memory

1.6: SIGNAL – Kernal 1 finished execution

1.7: Transfer H, T and N to generator memory

1.8: SIGNAL – Kernal 2 finished execution

1.9: Request reduced values from memory

1.10: Transfer results to host buffer

1.11: Finish evaluation

1.12: Free OpenCL memory buffers

1.13: Finish Host memory buffers

1.5.1: Generate value, storein global memory

1.7.1: Transfer data fromglobal to local memory

1.7.2: Perform MC sampling

1.7.3: Generate single statisticper workgroup (psum)

1.7.4: Transfer results toglobal memory

Figure 4.2: Sequence diagram of proposed system implementation

35

4.2 Evolution of OpenCL Implementations Implementation

4.2 Evolution of OpenCL Implementations

To successfully achieve the operation of the program, various versions and methods of

implementation were tested, with each iteration stage corresponding to a milestone in the

overall program design. Through each iteration various issues were encountered, which

often required several refinements and modifications to achieve the desired results. All

errors encountered were debugged through use of the commandline GNU debugger (gdb)

utility. All compilation was carried out through the use of the GNU C++ compiler (g++),

makefiles, and the resulting compilation Stream SDK macros.

The remainder of this section outlines the various development versions and in some

case thier variants. Recall, in order to distinguish between mathematical identities and

identifiers that are used in the code, identifiers are typeset in blue.

4.2.1 Version I — Uniform Distribution

For the initial testing of the system, a simple host-to-single kernel setup was employed.

Using a slightly modified SIMD Mersenne Twister kernel [18] provided with the Stream

SDK of which shall hereafter be referred to as kernelGen, a host program was devised

to send the required number of seeds to the kernel, and read back the resulting output.

The output was modified from a Gaussian to a uniform distribution over the range of zero

to one. The resulting cumulative distribution function of the output from kernelGen is

given as

F (x) =

0 for x < a

x−ab−a for a ≤ x < b

1 for x ≥ b

(4.1)

where a and b are the limits of integration, in the range [0..1]. For every value received

back to the host program, a serial summation was carried out. The resulting output was

then divided by the total number of values to provide an estimate of the area over the

uniform distribution. Using 220 random numbers, the result varied about 0.5, as expected

for the uniform distribution. A graph representation of the integration performed can be

seen in Figure 5.1.1.

36


4.2.2 Version II — Multidimensional Integration of a known

function

Building upon the program developed for Version I, a multidimensional function of known

analytical solution was chosen. The following integral was programmed in the same

manner as that previously used for the uniform distribution previously

I16 =

∫Ω

16∏i=1

exi cos(xi) (4.2)

where Ω represents the hypercube volume over the which the integral is to be calculated.

The resulting values of several runs of the procedure were tabulated and compared against

an analytical result from Maple. Several enhancements were applied to the program

at this stage. The number of random numbers being generated was further increased

through modification of the host program and kernel code. Previously a large number

of seeds were transmitted, with a low number of generated values for each. The current

implementation generated as many random numbers as specified, upto to the maximum

number of supported values on the hardware (226 4-byte floats for the GPU). The need

to transmit a large number of seeds was also reduced to a fixed value, which could be

defined by the user, if necessary. This reduced the amount of data transfer necessary over

the PCI-Express bus between the CPU and GPU.

4.2.3 Version III — Integral Evaluation on Compute Device

For this iteration, the majority of the workload was carried out solely upon the GPU

device. A parallel summation routine was employed to carry out workgroup reduction,

namely reducing all workitems within a chosen workgroup to a single value. This was

achieved through modification of the Stream SDK block reduction algorithm3, originally

which accepted small scale arrays of integers. The modified version, hereafter referred

to as kernelSample, accepts a 224 element array of vector type float4, corresponding

to 4 float data types treated as a single 16-byte data type with individually accessible

elements, alligned to a 16-byte memory boundary. As we are now dealing with vector

3See appendix II - “calcKernel”

37


types, each element of the array corresponds to 4 individual float values, therefore giving

a total of 226 float results. The values were supplied via kernelGen, hereafter referred to

as kernelGen, and stored on device global memory. A pointer to the memory location,

and resulting size were thus supplied to kernelSample for evaluation. For each random

value read by kernelSample, a sample of the function

g(x) = exi cos(xi) (4.3)

where x represents the random value. The samples were then summed and the resulting

reduced values were returned to the host for finalization of the calculation. For each

workgroup within the global range of values, a single float4 value was returned. Each

float4 value on GPU memory could therefore be read as 4 subsequent float values by the

host device, yielding a total of 4×NDRange/localSize values to perform serial summation

upon, where NDRange is the global size of the problem domain, and localSize is the

maximum number of workitems per workgroup.. As the range of values was again between

zero and one, the integral was evaluated analytically in Maple as the following

I =

∫ 1

0

g(x)dx (4.4)

Upon verification of the result, a series of timing routines were implemented on the host

code to evaluate the execution time per kernel, and the overall program time to comple-

tion. Using a buffer for OpenCL events [32], each kernel execution was allowed to add an

entry to said buffer. Each buffer entry maintained information regarding execution times,

given in nanosecond precision for the devices chosen [13]. The above integral was there-

fore run for a series of different sample sizes, scaling from 220 to 226, with the resulting

timings noted. For comparison the OpenCL kernels were executed also upon the CPU

device, taking note also of the timings also. A graph was plotted of the resulting data

comparing the GPU to CPU for the calculation timings.

38


4.2.4 Version IV — Integral of Partition Function

Over the development of this version many different approaches were tested, with the most

effective implementation chosen as the final version. Issues specific to each implementation

were encountered, with each subsequent version proceeding to overcome these issues.

4.2.4.1 Subversion 1 — Single Memory Block

With the routine in Version III verified to be operational, the OpenCL kernel was modified

to allow for the calculation of a solution to the partition function in the N = 2 case, the

resulting equation of which is given by

Z2 =1

(8π2)

∫ 2π

0

∫ π

0

∫ 2π

0

∫ π

0

exp(ξ(cos(θ1) + cos(θ2)

)+K1,2

(sin(θ1) sin(θ2) cos(φ1 − φ2) + cos(θ1) cos(θ2)

))sin(θ1) sin(θ2)dθ1dφ1dθ2dφ2 (4.5)

where ξ = µ0mH/kBT , K1,2 = JC1,2/kBT .

The use of device local memory was considered to allow for the fastest possible calcu-

lation and maximal bandwidth available to the device, showing approximately 850× 109

bytes per second transfer using the Stream SDK device test code. As each workgroup

in the calculation may share local memory values amongst all its workitems, an effective

means of passing data and values was achieved. For every workgroup a collection of local

arrays were declared for each stage of the calculation. Two arrays were created for storing

the random values, corresponding to θ and φ, three more for the intermediate stages of

the calculations, and the final for the result:

• Two local arrays, localTheta, localPhi, for storing random values with θ ∈ [0, π],

and φ ∈ [0, 2π].

• Three intermediate arrays for calculations, cosSum, sinProd, exchSum, correspond-

ing to the two summation, and product term of the partition function

• An array for the result of the calculation, samples

As we are limited to 32 KB of local memory, the local arrays had to be verified to

fit within the limited memory. Both localTheta and localPhi were of the maximum

39


size available to the workgroup (256), with each entry being of vector data type float4

(16-bytes). Therefore, the total utilised memory by both arrays amounted to 2×256×16,

giving a local memory footprint of 8 KB. Each array was populated from the values in

global memory in alternating sequence, wherein even indexed values were allocated to

localTheta and odd values localPhi. Due to the nature of the partition function, for

every increment of the number of spins, N , the integral grows by 2N . A maximum number

of function samples could therefore be determined using the resulting local values, thus

giving the required size of the resulting local arrays. If we consider 256 random elements

in both random value arrays, in the case where N = 2 we can obtain 256/2 = 128 pairs of

results from both localTheta, and localPhi, each of which correspond to θ1, θ2 and

φ1, φ2. For an N dimensional chain, the number of samples, S, which can be obtained

from the local memory space is thus

S =groupSize

N(4.6)

where groupSize represents the total number of workitems within the workgroup, set to

the GPU maximum of 256. A flaw with the above method however occurs if N is not a

power of 2, as certain values will remain unused within the local arrays. However, currently

only the N = 2 case is being considered. Using the above result for the N = 2 case allowed

the remaining local array sizes to be specified, wherein each entry will correspond to a

sample for the supplied values. The equations through which arrays cosSum, sinProd

and exchSum are calculated were taken from Equation (4.5) as follows:

cosSum = cos(θ1) + cos(θ2) (4.7)

exchSum = sin(θ1) sin(θ2) cos(φ1 − φ2) + cos(θ1) cos(θ2) (4.8)

sinProd = sin(θ1) sin(θ2) (4.9)

(4.10)

For the final part of the calculation the array samples, utilising all the sampled values

within the above array given in Equation (4.7), performs the following computation

40


samples = exp

(−(−µ0mHcosSum− JC1,2exchSum

kBT

))sinProd (4.11)

where all constants are specified explicitly, as opposed to use of ξ and K1,2as in Equation

(4.5). Setting all constants to their known values, using the equations listed for JC and

m, and also choosing values for H, T , s, and J yielded a method for the determination

of a value for each set of samples. Through use of kernelSample, the resulting values

were summed and passed back to the host for final evaluation. It was noticed that errors

existed within the routine as specified values were returned as unrepresented numbers

(NaN).

4.2.4.2 Subversion 2 — Random Value Recycling

A series of tests were performed upon the routines in Version 2 examining the values stored

in each array location in memory during the sequence of the calculation. It was discovered

that during the transfer of values from global memory to local memory of localTheta and

localPhi that every alternating value was zero. This error was discovered to be resulting

from both the means in which OpenCL performs a parallel read and the implemented

method for reading the values. Thus in order to correct the issue, either the process

was to be serialised, removing the parallel nature of the read operation, or to find an

alternative method for reading values from global to local memory. To correct this issue

it was decided to avoid serializing the read operation, instead opting to read the indexed

values in from global memory and to remove the alternation between index values of the

local memory. This essentially meant that for every value read and stored at localTheta,

the same value was thus read and stored at the same index location in localPhi. Values

were therefore scaled to the appropriate range upon read from global memory, and the

resulting operation was modified to consider each random value to be used twice, thus

resulting in a maximum of 227 values being utilised.

In line with Equation (4.5), ξ and K1,2 were used to represent the constants and

variables. The resulting calculation of the samples array was thus modified to

samples = exp(ξcosSum +K1,2exchSum

)sinProd (4.12)

41


Values were chosen for ξ and K1,2 which would be representative of the zero field case

of the partition function, ξ = 0. The paper by Cregg et al on classical Heisenberg spin

chains [9] gives a reduced form of the partition function for this case in terms of known

functions, evaluated as

Z2(K1,2) =sinh(K1,2)

K1,2

(4.13)

A routine developed by Murphy [48] for the evaluation of this equation allowed for the

determination of values for comparison to the OpenCL Monte Carlo method. A serial

Monte Carlo integration using the GNU scientific library, also developed by Murphy

for calculation of the result, was used for timing comparisons between itself and the

parallelised implementation under OpenCL. Initial results for various values of K1,2 were

returning with a large margin of error between the OpenCL implementation, and both

the analytical and serial Monte Carlo implementations. It was discovered that the errors

were again resulting from issues with local memory read and write operations. It was

thus decided to restructure the means with which the random values were generated and

stored in local memory, and hence passed into the kernel for use thereafter.

4.2.4.3 Version 3 — Dual Global Arrays

Within the host program, the single global memory buffer created for storage of the

generated random values was partitioned into 2, an array of values each for θ and φ. The

indexing method for the number generation kernel was modified to accommodate this

change, with the original 226 range of random values split into two unique global arrays,

phiGlobal, thetaGlobal, of size 225 each. From here, the address of both arrays in global

device memory were passed into the kernel for the integration, wherein each global array

was given a corresponding local array of values as specified in Version 2. An indexing

issue was discovered, requiring further refinement of the number generation kernel, after

which all values were being read effectively. The values returned for an execution of the

routine were of the expected range, and hence the routine was deemed to be operating as

expected.

Timings were taken for execution of the routine using the maximum number of samples

224 for the OpenCL routine on both the GPU and the CPU. Modifications to the routine

42


of Murphy [48] were carried out to use the same number of sample points of the partition

function, and also yield the timing of execution. The results were compared with those

obtained through the OpenCL based methods. No further changes were carried out to the

inner workings of this version of the routine, with the exception being the inclusion of the

parameters Kj,j+1, where j < N − 1. Both ξ and the Kj,j+1 parameters were specified to

the kernel at runtime. A series of simulations were carried out for the zero field partition

function over the range of Kj,j+1 values from 0.0 to 50.0 using the Monte Carlo routine,

and plotted against that of the analytical results from Equation (4.13).

Error comparisons were also drawn for the above routine, specifying the zero field

case with Kj,j+1 at 0, which according to Cregg et al evaluates as 1.0 [9][Eqn. 10]. A

series of runs were carried out for incrementing numbers of generated values, with the

resulting error determined by the Monte Carlo method compared to the known value of

the partition function. A graph was plotted of the resulting relative errors versus the

number of samples taken, given as Figure 5.1.4.

Further analysis was performed on the implemented routine, through variation of

the localSize value for the GPU and CPU. Non-optimal values were used to test the

performance of the system, and to demonstrate the difference in data throughput and

calculation speed for a non-optimised routine. The resulting graphs, Figure 5.1.4 and

Figure 5.1.4 demonstrate the results of the testing.

43

Chapter 5

Results

5.1 System Testing

5.1.1 Version I — Uniformity Testing

For the uniform distribution case, the set of random values were generated by kernelGen

as represented by Figure 5.1.1 as given. The integration of the area defined by the limits,

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

corresponding to the ranges [0..1], yield an overall area of 1 of the region. Taking the

number of random values as 226, the area of the region defined under the curve amounted

to 0.5001. Comparing to the known value of 0.5, this can be seen to be sufficiently close to

uniform for an effective Monte Carlo integration routine, and so the generator kernelGen

was used throughout.

5.1.2 Version II — Test of Monte Carlo Integration Method

Before use of the implemented sample-mean Monte Carlo integration routine, it must first

be capable of demonstrating a reasonable degree of accuracy. To test the routine’s ability

44

5.1 System Testing Results

to produce accurate results, the 16-dimensional integral 4.2 was solved using both Maple

CAS to produce a known analytical solution, and the simulated Monte Carlo method.

Within each dimension, the limits of integration were defined to being zero and one,

resulting in a value of 169.0871798 for the analytical solution given by Maple. Utiliz-

ing 226 samples in the case of the sample-mean Monte-Carlo method yielded a result of

169.0443945 after a single run. The resulting relative error of the implemented Monte

Carlo method versus the analytical solution was determined to be 0.0253%, a sufficiently

minor degree of error allows a reasonable approximation to the actual analytical result.

Using 5 iterations, giving a total number of samples taken as 5 ∗ 226, produced an

average result of 169.0794226. Comparing this to the analytical result, gives an improved

relative error value over the single run of samples by almost a full order of magnitude.

However, for all intensive purposes, the result for the single run shall suffice for the

current accuracy requirement, and all subsequent evaluations shall be using this number

of samples to maintain accurate comparisons.

5.1.3 Known Integral in 1-Dimension: CPU vs GPU

As the accuracy of the sample-mean Monte Carlo integration method has been verified in

practice to produce a relative percentage error of approximately 0.03% for a given number

of samples (226 ≈ 6.7× 107), it will therefore be necessary to evaluate the performance of

the routine developed, comparing the simulation running on CPU to that of the routine

running on the GPU. Using the OpenCL-based sample-mean Monte Carlo routine, an

effective comparison can be made between the CPU and the GPU by specifying the device

to the program at run-time. As the AMD OpenCL implementation supports multicore

x86 architecture CPU’s, alongside current generation GPU’s, the calculations may make

use of both devices. Evaluation of the Equation (5.1), defined as

f(x) =

∫ex cosx dx (5.1)

was used as the testbed for this method. A series of runs were taken using both the CPU

and the GPU for the specified number of samples, with each run taking a unique time

period to finish. The average values of each run for the specified number of samples were

45


taken, and plotted against execution time. Data Table 5.1 and Figure 5.1.3 show the

comparison between the CPU and GPU for the given number of samples:

Figure 5.1: Single dimensional integral for testing of the Monte Carlo

routine comparing CPU and GPU

0 1 2 3

·107

0

1

2

3

Number of sample points

Tim

e[s

]CPU

GPU

Table 5.1: Testing of the Integration Routine comparing CPU andGPU data

Elements TGen [ms] TInt [ms] TTotal [s] Rel Error

33554432 95·3259832 4·3827151 0.3600000 −9·0141E−516777216 42·1171832 2·4502873 0.2964286 −1·5147E−4

GPU

8388608 17·4728458 1·1042480 0.2603571 −7·4484E−54194304 7·3214064 0·5575617 0.2574999 6·8918E−52097152 5·0152502 0·2826003 0.2450002 7·8200E−4

33554432 330·8468800 3493·7542656 3.9142858 2·3807E−416777216 173·7412736 1745·2229632 2.0178572 1·0463E−3

CPU

8388608 94·4594992 891·9994240 1.0642858 1·0159E−34194304 62·5397376 431·0678080 0.5871430 1·1962E−32097152 44·2145600 224·7079680 0.3378572 2·1841E−3

Equation (5.1) can be evaluated analytically as

f(x) =1

2ex cos(x) +

1

2ex sin(x) (5.2)

Given the range of integration as [0..2π], the exact solution of the above equation over

46


the specific range is given as −(1/2) + (1/2)e2π, which when approximated to 7 places of

decimal yields 267.2458278. The resulting relative error of the Monte Carlo method can

thus be determined, and is given in Table 5.1. As can be seen, the relative error obtained

for the GPU execution is generally an order of magnitude smaller than that of the CPU.

As there is no defined reason why this may occur due to the use of the same number of

sample points for both, possible causes may be speculated. It is possible that the OpenCL

implementation makes better use of the GPU hardware at this point in time, and allows

a higher degree of accuracy when evaluating certain functions or arguments.

5.1.4 Integration of partition function

Applying the resulting equation directly to the integration routine yielded significant

issues with the range of values. The sampling of each function was determined to being

positive inifintiy or zero, and occasionally giving an unrepresented number for the result

(NaN).

Upon configuration of the OpenCL routine for integration of Equation (3.7), it was

essential to verify the routine would give acceptably accurate results. To test for this, the

partition function of the 2-spin case was taken by setting N = 2, yielding Equation (4.5).

Utilising the special case equations specified by Cregg et al [9] for the partition function

in terms of known functions, and a routine developed by Murphy with the GNU scientific

library [48], values could be determined for the exact solution of the function, along with

a comparative serial Monte Carlo implementation. The resulting values determined by

the OpenCL based Monte Carlo method differed a great deal from those obtained from

the analytical functions, with an initial percentage error of approximately 67%, implying

the routine may have been implemented incorrectly. The results corresponded to Version

IV.1 of the Implementation chapter, and hence led to development of Version IV.2. After

modification of the program the degree of error was reduced significantly to a value of

between 3−16% depending upon the parameters specified. This remained relatively large

compared to the results determined from the solution given by Cregg et al [10].

After further investigation and amendments to the program, the results obtained were

of the expected range, and mirrored very closely the results of Cregg et al for the reduced

partition function. Sample results obtained for Equation (4.5) allowed comparison of

47


the known solutions to those of the Monte Carlo results in both the serial and OpenCL

parallelised cases. Timing was carried out for both Monte Carlo routines to allow an

effective comparison of the calculation execution times. The results are given in Table 5.2

as follows:

Table 5.2: Results comparison for Monte Carlo implementations in thezero-field case

Z2−spin(0,−10) Result %Error Time [s]Analytical 1101·323287 0·000000 0·000Serial Computation (GSL) 1101·595590 −0·024725 32·240OpenCL (CPU) 1101·322678 0·000055 7·499Open CL (GPU) 1101·321922 0·000124 0·094

As a reference the timing for an implementation of Monte Carlo integration using the

GNU Scientific Library (GSL) is also given. The GSL is a widely used set of numerical

routines implemented in C. The GSL is strictly a serial program library (single processor

core), and does not feature any parallelised routines or acceleration 1.

Using Version IV.3, for the zero field case (ξ = 0) a series of data points were generated

for the 2-spin partition function for values of K1,2 over the range [0..50]. From the resulting

zero-field derivation provided by Cregg et al , Equation (4.13), a data set of analytically

determined values were also generated using the same K1,2 range of values. Both data

sets were plotted for K1,2 (exchange) versus Z2−spin (partition function), yielding Figure

5.1.4.

Using the special case zero field partition function, the error of the Monte Carlo routine

could be determined. For a various number of sample, the absolute relative error of a

single Monte Carlo integration result was determined. A graph was plotted of the number

of samples taken, versus the absolute relative error associated with that specific number of

samples. The resulting graph, Figure 5.1.4, demonstrates the expected error dependence

on the number of samples taken. As can be seen, the error follows a reciprocal-square

type behaviour, as was given by Equation (3.18).

For all timing readings to be an effective representation of the hardware system chosen,

1However, due to the well defined nature of the library routines, coding the Monte Carlo partitionfunction integration program required approximately 30 minutes, as opposed to the many days of imple-menting and weeks of testing in OpenCL.

48


0 10 20 30 40 50

0

2

4

·1019

K

Z2−

spin

(0,K

)

Monte Carlo estimation

Analytic formula

0 10 20 30 40 50

−4

−2

0

2

4

·10−3

K

Relative Error

Figure 5.2: Partition function, Z2−spin, with zero magnetic field (ξ = 0)and exchange parameter 0 < K ≤ 50, calculated using Monte Carlosampling (224 samples) and analytic formula [10, equ 4.13]. (Insert shows

relative error of Monte Carlo estimates.)

1 2 3 4 5 6

·107

1

2

3

·10−3

Number of sample points

Relative Error

Figure 5.3: Actual Error (compared with analytical result) of theMonte Carlo method as a function of the number of sample points.

49


Figure 5.4: Workgroup size versus total kernel execution time GPU

0 50 100 150 200 2500

200

400

600

800

1,000

1,200

1,400

Workgroup size

Tota

lK

ern

elT

ime

[ms]

it is necessary that the hardware is being utilised to the fullest was required. This is shown

through the variation in the workgroup size for the executing kernels. As the use of local

memory is imperative to the performance of the system, this should allow the performance

to be assessed for non-optimal system parameters. Starting with the maximum allowed

value for a workgroup supported by the GPU, 256, the value was halved until the lowest

supported value, 2, was reached. Figure 5.1.4 shows the relationship between the time

taken for execution of the kernels versus the size of the workgroup. For values which are

multiples of a wavefront2 on ATI hardware, the timing routines are virtually identical.

However, as the workgroup size decreases beyond this value, the performance diminishes

dramatically. Therefore, for optimal results the workgroup size must remain greater than

or equal to 64 workitems on the GPU used. For comparison sake the same test was

performed on the CPU, with a starting value of 512 (1024 max supported by CPU used).

It was seen that for the values chosen throughout, the CPU was not using the most

optimal values, with the most effective workgroup size found to occur around 8. Figure

5.1.4 shows the relationship between time of execution and workgroup size. It can be

seen however, that the optimal time for execution does not even reach anywhere near the

suboptimal GPU value.

The results show what is expected from both devices. With values less than the max-

2A hardware specific value defining the number of threads which may be run in parallel at any instantin time for ATI hardware. On current generation GPU’s a wavefront is equal to 64 workitems.

50


Figure 5.5: Workgroup size versus total kernel execution time CPU

0 100 200 300 400 500

4,000

5,000

6,000

7,000

8,000

Workgroup size

Tota

lK

ern

elT

ime

[ms]

imum number of supported simultaneous threads of execution on the GPU, the hardware

is not being used to its fullest. As a result of this many processing elements are remaining

idle, and performance for the device decreases drastically. As CPU’s are not inherently

parallel devices, it is expected that the U-shaped curve may occur at some point in the

range of localSize values. Subdivision of the problem space may achieve performance

gains when dealing with the total number of threads supported by a device at any instant

in time. However, the continuous input-output operations, such as that of context switch-

ing, required to maintain these threads may cause a severe performance hit for certain

numbers of elements. For the most optimal use of the CPU hardware, a “sweet-spot”

must be found, wherein the thread management overhead is minimised, and the data

throughput is essentially maximised. For the system used and described herein, the value

for optimal CPU performance was a localSize of 8.

51

Chapter 6

Conclusions

From the work carried out herein it was seen that for all instances the OpenCL-based

implementation of Monte Carlo methods offered significant performance increase over the

serialized method. For execution solely upon the CPU, the OpenCL implementation of-

fered over tenfold performance increase compared to the GSL enbaled serial process, with

the GPU implementation offering approximately 320 times the performance for the same

number of samples. For a direct GPU to CPU comparison under OpenCL, the perfor-

mance gains for using a GPU device can be seen to approach 80 times the throughput

for kernel performance. Thus it was concluded that the resulting hardware and paral-

lelization of computation where necessary offers significant advantages to evaluation via

conventional means.

Future Work

Due to timing constraints, the evaluation of values for the magnetic susceptibility and

recovery of exchange parameters was not achieved. Possible future work in the area would

be implement the means to achieve this. Implementation of an automated regression

analysis routine for the recovery of said exchange parameters would allow for a fully

automated and efficient calculation routine.

Performance improvements could be achieved through the use of variants of num-

ber generation routines. The Mersenne Twister for graphics processors (MTGP) [41] is

a newly developed variant of the SIMD Mersenne Twister routine, catering specifically

for GPU devices. Currently, no OpenCL based implementation exists, therefore if im-

plemented correctly performance is expected to be greater than that of the older SIMD

Mersenne Twister method, yielding a higher rate of number generation in comparison.

Although use of local memory is imperative to the performance of the system, as

can be seen from the performance graphs Figure 5.1.4 and Figure 5.1.4, serialisation

52

Conclusions

may occur in local memory access. No attempts have been made to allow for full use

of coalesced1 memory reads, which may allow for further performance increases in the

resulting implementation. Given sufficient time for implementation and testing, this may

be achieved.

As the implemented routine is not designed optimally for non power-of-two chain

lengths, a future revision of the program may cater to remove this issue, with equivalent

performance offered regardless of spin-chain length.

1Wavefront-wide memory read operations, without any serialised local memory access. All values areread simultaneously by each executing thread.

53

Bibliography

[1] Inc. Apple. Opencl programming guide for mac os x, 2009. URL

http://developer.apple.com/mac/library/documentation/Performance/

Conceptual/OpenCL-MacProgGuide/OpenCL-MacProgGuide.pdf.

[2] P. Blum. Physical properties handbook, 1998. URL http://www-odp.tamu.edu/

publications/tnotes/tn26/TOC.HTM.

[3] P. Blum. Physical properties handbook, 1998. URL http://www-odp.tamu.edu/

publications/tnotes/tn26/TOC.HTM.

[4] Tom Robey Bob Robey, Nick Bennet. Monte carlo integration, 2009. URL http://

www.challenge.nm.org/archive/09-10/kickoff/classes/MonteCarlo.key.pdf.

New Mexico Supercomputing Challenge 2009.

[5] Advanced Micro Devices: Udeepta Bordoloi. Image convolution using opencl, 2009.

URL http://developer.amd.com/gpu/ATIStreamSDK/ImageConvolutionOpenCL/

Pages/ImageConvolutionUsingOpenCL.aspx.

[6] Caltech Physics Computation. Monte carlo integration and the metropolis al-

gorithm, 2009. URL http://www.pma.caltech.edu/~physlab/ph22-spring06/

assignment-1.pdf.

[7] Stanford Distributed Computing. Folding@home, 2010 - Verified 06/03/2010. URL

http://folding.stanford.edu/.

[8] P J Cregg and L Beessais. Series expansions for the magnetisation of a solid su-

perparamagnetic system of non-interacting particles with anisotropy. Journal of

magnetism and magnetic materials, 202(2):544–564, 1999.

[9] P J Cregg, J L Garcia-Palacios, and P Svedlindh. Partition functions of classical

heisenberg spin chains with arbitrary and different exchange. Journal of Physics

A: Mathematical and Theoretical, 41(43):435202 (8pp), 2008. URL http://stacks.

iop.org/1751-8121/41/435202.

54

BIBLIOGRAPHY BIBLIOGRAPHY

[10] P J Cregg, J L Garcia-Palacios, P Svedlindh, and K Murphy. Low-field susceptibility

of classical heisenberg chains with arbitrary and different nearest-neighbour exchange.

Journal of Physics: Condensed Matter, 20(20):204119 (5pp), 2008. URL http://

stacks.iop.org/0953-8984/20/204119.

[11] Kurt Binder David Landau. A Guide to Monte Carlo Simulations in Statistical

Physics, 2000.

[12] P. F. de Chatel. Encyclopedia of Condensed Matter Physics, volume 1, chapter

Paramagnetism. Elsevier, 2005.

[13] Advanced Micro Devices. Opencl programming guide, 2010. URL

http://developer.amd.com/gpu\_assets/ATI\_Stream\_SDK\_OpenCL\

_Programming\_Guide.pdf.

[14] Advanced Micro Devices. Ati radeon hd 5850 specification, 2009. URL

http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/

hd-5850/Pages/ati-radeon-hd-5850-specifications.aspx-Accessed19:

50GMT,21/02/10.

[15] Advanced Micro Devices. Introductory tutorial to opencl, 2009. URL http:

//developer.amd.com/gpu/ATIStreamSDK/pages/TutorialOpenCL.aspx.

[16] Advanced Micro Devices. Ati stream sdk v2.0 - getting started guide,

2009. URL http://developer.amd.com/gpu/ATIStreamSDK/assets/

ATI-Stream-SDK-Getting-Started-Guide-v2.0.pdf.

[17] Advanced Micro Devices. Ati stream sdk v2.0 - developer release notes,

2009. URL http://developer.amd.com/gpu/ATIStreamSDK/assets/

ATI-Stream-SDK-Release-Notes-Developer.pdf.

[18] Advanced Micro Devices. Ati stream sdk v2.01, 2010. URL http://developer.

amd.com/gpu/ATIStreamSDK/Pages/default.aspx.

[19] Advanced Micro Devices. Ati stream sdk v2.01 - opencl video se-

ries, 2010. URL http://developer.amd.com/DOCUMENTATION/VIDEOS/

OPENCLTECHNICALOVERVIEWVIDEOSERIES/Pages/default.aspx.

55


[20] Advanced Micro Devices. Ati stream computing - introduction to opencl, 2009. URL

http://ati.amd.com/technology/streamcomputing/intro\_opencl.html.

[21] Oak Ridge National Lab: Physics Division. Flynn’s taxonomy, 1996 - Verified

03/03/10. URL http://www.phy.ornl.gov/csep/ca/node11.html.

[22] Flyura Djurabekova. Monte carlo simulations, 2008. URL http://beam.acclab.

helsinki.fi/~eholmstr/mc/.

[23] Flyura Djurabekova. Monte carlo integration, 2010. URL http://beam.acclab.

helsinki.fi/~djurabek/mc/mc5nc-1x2.pdf.

[24] Flyura Djurabekova. Mc simulation of “thermodynamic” ensembles, 2008. URL

http://beam.acclab.helsinki.fi/~djurabek/mc/mc9nc.pdf.

[25] Universit Pierre et Marie Curie France Marius Lewerenz. Monte carlo meth-

ods: Overview and basics. Quantum Simulations of Complex Many-Body Systems:

From Theory to Algorithms, (10):1–24, 2002. URL http://www.fz-juelich.de/

nic-series/volume10/lewerenz.pdf.

[26] Colin Fowler. Concurrent systems i: Simd, 2008 - Verified 03/03/2010. URL https:

//www.cs.tcd.ie/~fowlerc/3ba26/simd1.pdf.

[27] Gohara. Opencl podcast, 2009. URL http://www.macresearch.org/files/

opencl/.

[28] Khronos Group. Opencl man pages, 2009. URL http://www.khronos.org/opencl/

sdk/1.0/docs/man/xhtml/.

[29] Khronos Group. Opencl overview, 2009. URL http://www.khronos.org/

developers/library/overview/opencl-overview.pdf.

[30] Khronos Group. Opencl quick reference guide, 2009. URL http://www.khronos.

org/files/opencl-quick-reference-card.pdf/.

[31] Khronos Group. Khronos group, 2009. URL http://www.khronos.org/.

56


[32] Khronos Group. OpenCL Specification v1.0.48. 2009. URL http://www.khronos.

org/registry/cl/specs/opencl-1.0.48.pdf.

[33] Khronos Group. OpenCL Specification v1.0.48, chapter Chapter 3. 2009. URL

http://www.khronos.org/registry/cl/specs/opencl-1.0.48.pdf.

[34] Dr Mads Haar. Introduction to randomness and random numbers. URL http:

//www.random.org/randomness/.

[35] Walker Halliday, Resnick. Fundamentals of Physics, chapter 32, pages 874–875.

Wiley, 7 edition.

[36] Mark Harris. Gpgpu.org. URL http://gpgpu.org/about.

[37] Mike Houston. General purpose computation on graphics processors. URL http:

//www-graphics.stanford.edu/~mhouston/public\_talks/R520-mhouston.pdf.

[38] University of Oulu Kari Rummukainen Department of physical sciences. Monte carlo

simulations in physics, 2007 - Verified 03/03/2010. URL http://cc.oulu.fi/~tf/

tiedostot/pub/montecarlo/lectures/mc\_notes1.pdf.

[39] Charles Kittel. Introduction to Solid State Physics. Wiley, 7 edition, 1995.

[40] Samrit Maity. Debugging opencl with gdb, 2010 - Verfied 03/03/2010. URL http://

samritmaity.files.wordpress.com/2009/11/opencl-program-structure2.jpg.

[41] Mutsuo Saito Makoto Matsumoto. Mersenne twister for graphics processors, 2009.

URL http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MTGP/index.html.

[42] Mutsuo Saito Makoto Matsumoto. Mersenne twister: A 623-dimensionally equidis-

tributed uniform pseudorandom number generator. ACM Transactions on Mod-

eling and Computer Simulation, (8):3–30, 1998. URL http://www.math.sci.

hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/mt.pdf.

[43] Mutsuo Saito Makoto Matsumoto Hiroshima University. Mersenne twister for graphic

processors, 2009. URL http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/

MTGP/index.html.

57


[44] MaxxPI2. Top 10 - flops. Technical report, 2010. URL http://www.maxxpi.net/

pages/result-browser/top10---flops.php.

[45] Miscellaneous. Curie’s law, 2010 - Verified 16/04/2010. URL http://en.wikipedia.

org/wiki/Curie\%27s\_law.

[46] Bruce M. Moskowitz. Hitchhikers guide to magnetism: Environmental magnetism

workshop, institute for rock magnetism, 1991. URL http://www.irm.umn.edu/

hg2m/hg2m-b/hg2m-b.html.

[47] Hesham El-Rewini Mostafa Abd-El-Barr. Fundamentals of Computer Organization

and Architecture. Wiley, 2005.

[48] Kieran Murphy. Gsl monte carlo integration routine, 2010.

[49] NVidia. Opencl best practices guide, 2009. URL http://

www.nvidia.com/content/cudazone/CUDABrowser/downloads/papers/

NVIDIA-OpenCL-BestPracticesGuide.pdf.

[50] NVidia. Nvidia cuda, 2009. URL http://www.nvidia.com/object/cuda\_home.

html.

[51] NVidia. Opencl jumpstart guide, 2009. URL http://developer.download.nvidia.

com/OpenCL/NVIDIAOpenCL-JumpStart-Guide.pdf.

[52] NVidia. Opencl programming guide, 2009. URL http://www.nvidia.com/content/

cudazone/download/OpenCL/NVIDIA-OpenCL-ProgrammingGuide.pdf.

[53] NVidia. Opencl programming overview, 2009. URL http://www.nvidia.com/

content/cudazone/download/OpenCL/NVIDIA-OpenCL-ProgrammingOverview.

pdf.

[54] Israeli Institute of Technology Computational Physics Group. Computational

physics/monte carlo integration. URL http://phycomp.technion.ac.il/~comphy/

classfiles/mci.html.

[55] R. K. Pathria. Statistical Mechanics. Butterworth-HeineMann, 2 edition, 1996.

58


[56] PCI-Sig. Pcie base 2.1 specification. Technical report, 2009 - Verfied 03/03/2010.

URL http://www.pcisig.com/specifications/pciexpress/base2/.

[57] Victor Podlozhnyuk. Parallel mersenne twister, 2007. URL http:

//developer.download.nvidia.com/compute/cuda/sdk/website/projects/

MersenneTwister/doc/MersenneTwister.pdf.

[58] Victor Podlozhnyuk. Parallel mersenne twister, 2007. URL http:

//developer.download.nvidia.com/compute/cuda/sdk/website/projects/

MersenneTwister/doc/MersenneTwister.pdf.

[59] Teukolsky Vetterling Flannery Press. Numerical Recipes: The Art of Scientific Com-

puting. Cambridge University Press, 3 edition, 2007.

[60] Pedro F. Quintana-Ascencio. Proof that sample variance is unbiased, Accessed 21:50,

24/02/10. URL http://biology.ucf.edu/~pascencio/classes/Methods/.

[61] Jeffrey S. Rosenthal. Parallel computing and monte carlo algorithms. Far-East Jour-

nal of Theoretical Statistics, (4):207–236, 2000. URL http://www.probability.ca/

jeff/ftpdir/para.pdf.

[62] Philip E. Ross. Why cpu frequency stalled, 2008 - Verified 06/03/2010. URL http:

//spectrum.ieee.org/computing/hardware/why-cpu-frequency-stalled.

[63] R. J. Sadus. Introduction to monte carlo methods. URL http://www.swinburne.

edu.au/ict/research/cms/documents/mod1.pdf.

[64] Herb Sutter. The free lunch is over: A fundamental turn toward concurrency in

software, 2004, Updated 2009 - Verified 06/03/2010. URL http://www.gotw.ca/

publications/concurrency-ddj.htm.

[65] ATI Stream Development Team. Mersenne twister stream sdk demo, 2009. URL

http://developer.amd.com/gpu/atistreamsdk/pages/default.aspx.

[66] Brown Deer Technology. Opencl tutorial: N-body simulation, 2009. URL http:

//browndeertechnology.com/docs/BDT-OpenCL-Tutorial-NBody.html.

59


[67] Eric Veach. Robust Monte Carlo Methods for Light Transport Simulation. PhD thesis,

Stanford University, December 1997.

[68] Roy Wikramaratna. Pseudo-random number generation for parallel monte carlo -

a splitting approach. URL http://citeseerx.ist.psu.edu/viewdoc/download?

doi=10.1.1.103.4550\&rep=rep1\&type=pdf.

60

Appendix A

Code Listing

A.1 System Information Output

1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−

ATI Radeon HD5850 GPU

3 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Vendor : Advanced Micro Devices , Inc .

5 Device Name : Cypress

P r o f i l e : FULL PROFILE

7 Supported Extens ions : c l k h r g l o b a l i n t 3 2 b a s e a t o m i c s

c l k h r g l o b a l i n t 3 2 e x t e n d e d a t o m i c s c l k h r l o c a l i n t 3 2 b a s e a t o m i c s

9 c l k h r l o c a l i n t 3 2 e x t e n d e d a t o m i c s

11 Local Mem Type ( Local =1, Global=2) : 1

Global Mem Si z e (MB) : 256



15 Clock Frequency (MHz) : 775

17 Vector type width for : char = 16

Vector type width for : short = 8

19 Vector type width for : int = 4

Vector type width for : long = 2

21 Vector type width for : f loat = 4

Vector type width for : double = 0

23

Max Work Group S i z e : 256

25 Max Work Item Dims : 3

Max Work Items in Dim 1 : 256

27 Max Work Items in Dim 2 : 256


A-1

A.1 System Information Output Developed Code

29 Max Compute Units : 18

31 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−

AMD Phenom I I X3 720BE CPU

33 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Vendor : AuthenticAMD

35 Device Name : AMD Phenom(tm) I I X3 720 Proces sor

P r o f i l e : FULL PROFILE

37 Supported Extens ions : c l k h r i c d c l k h r g l o b a l i n t 3 2 b a s e a t o m i c s

c l k h r g l o b a l i n t 3 2 e x t e n d e d a t o m i c s

c l k h r l o c a l i n t 3 2 b a s e a t o m i c s

39 c l k h r l o c a l i n t 3 2 e x t e n d e d a t o m i c s c l k h r i n t 6 4 b a s e a t o m i c s

c l k h r i n t 6 4 e x t e n d e d a t o m i c s c l k h r b y t e a d d r e s s a b l e s t o r e

41

Local Mem Type ( Local =1, Global=2) : 2

43 Global Mem S i z e (MB) : −1024

Global Mem Cache S i z e ( Bytes ) : 64

45 Max Mem Al loc S i z e (MB) : 1024

Clock Frequency (MHz) : 2800

47

Vector type width for : char = 16

49 Vector type width for : short = 8

Vector type width for : int = 4

51 Vector type width for : long = 2

Vector type width for : f loat = 4

53 Vector type width for : double = 0


Max Work Item Dims : 3




Max Compute Units : 3

A-2

A.2 Host Machine Program Developed Code

A.2 Host Machine Program

//%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%//

2 #define PI 3.1415926536

//%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%//

4 #include <CL/ c l . h>

#include <g s l / g s l r n g . h>

6 #include <g s l / gs l math . h>

#include <s t d l i b . h>

8 #include <s t d i o . h>

#include <s t r i n g . h>

10 #include <malloc . h>

#include <math . h>

12 #include <time . h>

#include <s t r i n g . h>

14 #include <fstream>

#include <sys / time . h>

16 //%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%//

/∗∗CL dev i c e hand les and se tup type s ∗/

18 c l c o n t e x t gpuContext ;

c l d e v i c e i d ∗ d e v i c e s ;

20 cl command queue gpuQueue ;

/∗∗ Input and output hos t arrays ∗/

22 c l u i n t ∗ input ;

c l f l o a t ∗output ;

24 /∗∗Handles to a l l CL dev i c e memory b u f f e r s ∗/

cl mem seeds Input ;

26 cl mem randomBufferPhi ;

cl mem randomBufferTheta ;

28 cl mem resu l tOutput ;

/∗∗Handle to program and k e rn e l s ∗/

30 c l program program ;

c l k e r n e l ke rne lSusc ;

32 c l k e r n e l ke rne lCa l c ;

/∗∗< t ime taken to se tup OpenCL resource s and b u i l d i n g k e rne l ∗/

34 c l d o u b l e setupTime ;

/∗∗< t ime taken to run ke rne l and read r e s u l t back ∗/

A-3


36 c l d o u b l e kernelTime ;

/∗∗< Max a l l owed work−i tems in a group ∗/

38 s i z e t maxWorkGroupSize ;

/∗∗< Max group dimensions a l l owed ∗/

40 c l u i n t maxDimensions ;

/∗∗< Max work−i tems s i z e s in each dimensions ∗/

42 s i z e t ∗maxWorkItemSizes ;

/∗∗< width o f the execu t i on domain ∗/

44 c l u i n t width ;

/∗∗< h e i g h t o f the execu t i on domain ∗/

46 c l u i n t he ight ;

/∗∗ Group S i z e re turned by k e rne l ∗/

48 s i z e t kernelWorkGroupSize ;

/∗∗< b l o c k s i z e in x−d i r e c t i o n ∗/

50 s i z e t blockSizeX ;

/∗∗< b l o c k s i z e in y−d i r e c t i o n ∗/

52 s i z e t blockSizeY ;

/∗∗< Number o f i t e r a t i o n s f o r k e rne l execu t i on ∗/

54 int i t e r a t i o n s ;

/∗∗Return va lue f o r a l l OpenCL func t i on c a l l s ∗∗/

56 c l i n t e r r ;

/∗∗Local s i z e va lue − s i z e o f workgroup∗∗/

58 c l u i n t BLOCK SIZE ;

/∗∗ Mu l t i p l i c a t i o n f a c t o r s f o r number genera tor ∗∗/

60 c l u i n t f a c t o r ;

c l u i n t factorB ;

62 /∗∗ S p e c i f i e s CL dev i c e type ∗∗/

c l d e v i c e t y p e deviceType ;

64 /∗∗Length o f spin−chain ∗/

c l u i n t N;

66 /∗∗Number o f random va lue s ∗/

c l u i n t n ;

68 /∗∗ S p e c i f i e s neces s sary parameters to k e rne l ∗/

f loat params [ 4 ] ;

70

//%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%//

72

A-4


/∗∗ I n i t i a l i s e the OpenCL p la t form ∗∗/

74 void in itCL (void ) ;

76 /∗∗Prin t s system informat ion ∗∗∗/

void d e v i c e I n f o ( c l d e v i c e i d id ) ;

78

/∗∗Converts k e rne l f i l e to a s t r i n g ∗∗/

80 std : : s t r i n g convertToStr ing ( const char ∗ f i l ename ) ;

82 /∗∗Setup a l l necessary va l u e s f o r c a l c u l a t i o n ∗∗/

void setupHost (void ) ;

84

/∗∗Execute Kernels∗∗/

86 void runKernels (void ) ;

88 /∗∗Release CL resource s ∗∗/

void re leaseCL (void ) ;

90

/∗∗Release Host re source s ∗∗/

92 void r e l e a s e H o s t (void ) ;

94 c l u i n t ∗ seedGen (void ) ;

//%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%//

96 void in itCL ( )

s i z e t d e v i c e L i s t S i z e ;

98

/∗∗Modif ied from AMD ICD Page ∗∗/

100 c l u i n t numPlatforms ;

c l p l a t f o r m i d plat form = NULL;

102 e r r = clGetPlatformIDs (0 , NULL, &numPlatforms ) ;

i f ( e r r != CL SUCCESS)

104

p r i n t f ( ” c lGetPlat formIDs f a i l e d ” ) ;

106 return ;

108 i f (0 < numPlatforms )

A-5


110 c l p l a t f o r m i d ∗ plat fo rms ;

p la t fo rms = ( c l p l a t f o r m i d ∗)

112 mal loc ( numPlatforms∗ s izeof ( c l p l a t f o r m i d ) ) ;

e r r = clGetPlat formIDs ( numPlatforms , p lat forms , NULL) ;

114 i f ( e r r != CL SUCCESS)

116 p r i n t f ( ” c lGetPlat formIDs f a i l e d ” ) ;

return ;

118

for (unsigned i = 0 ; i < numPlatforms ; ++i )

120

char pbuf [ 1 0 0 ] ;

122 e r r = c lGetPlat fo rmIn fo ( p la t fo rms [ i ] ,

CL PLATFORM VENDOR,

124 s izeof ( pbuf ) ,

pbuf ,

126 NULL) ;


128 p r i n t f ( ” c lGetPlat fo rmIn fo f a i l e d ” ) ;

return ;

130

plat form = plat fo rms [ i ] ;

132 i f ( ! strcmp ( pbuf , ”Advanced Micro Devices , Inc . ” ) )

134 break ;

136

f r e e ( p la t fo rms ) ;

138

i f ( NULL == plat form )

140

p r i n t f ( ”NULL plat form found so Exi t ing Appl i ca t ion . ” ) ;

142 return ;

144 /∗

∗ I f we cou ld f i nd our p lat form , use i t .

146 ∗ Otherwise use j u s t a v a i l a b l e p la t form .

A-6


∗/

148 c l c o n t e x t p r o p e r t i e s cps [ 3 ] =

150 CL CONTEXT PLATFORM,

( c l c o n t e x t p r o p e r t i e s ) platform ,

152 0

;

154

c l c o n t e x t p r o p e r t i e s ∗ cprops = (NULL == plat form ) ? NULL : cps ;

156 gpuContext = clCreateContextFromType ( cprops ,

deviceType ,

158 NULL,

NULL,

160 &e r r ) ;

162 //%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%//

/∗ Firs t , g e t the s i z e o f dev i c e l i s t data ∗/

164 e r r = clGetContextIn fo ( gpuContext ,

CL CONTEXT DEVICES,

166 0 ,

NULL,

168 &d e v i c e L i s t S i z e ) ;


170

p r i n t f (

172 ” Error : Gett ing Context In f o \

( dev i c e l i s t s i z e , c lGetContextIn fo ) \n” ) ;

174 return ;

176 /////////////////////////////////////////////////////////////////

// Detect OpenCL dev i c e s

178 /////////////////////////////////////////////////////////////////

d e v i c e s = ( c l d e v i c e i d ∗) mal loc ( d e v i c e L i s t S i z e ) ;

180 i f ( d e v i c e s == 0)

182 p r i n t f ( ” Error : No d e v i c e s found .\n” ) ;

return ;

A-7


184

186 /∗ Now, ge t the dev i c e l i s t data ∗/

e r r = clGetContextIn fo (

188 gpuContext ,

CL CONTEXT DEVICES,

190 d e v i c e L i s t S i z e ,

dev ice s ,

192 NULL) ;


194

p r i n t f ( ” Error : Gett ing Context In f o \

196 ( dev i ce l i s t , c lGetContextIn fo ) \n” ) ;

return ;

198

200 /////////////////////////////////////////////////////////////////

// Create an OpenCL command queue

202 /////////////////////////////////////////////////////////////////

gpuQueue = clCreateCommandQueue (

204 gpuContext ,

d e v i c e s [ 0 ] ,

206 CL QUEUE PROFILING ENABLE,

&e r r ) ;


210 p r i n t f ( ” Creat ing Command Queue . ( clCreateCommandQueue ) \n” ) ;

return ;

212

/////////////////////////////////////////////////////////////////

214 // Create OpenCL memory b u f f e r s

/////////////////////////////////////////////////////////////////

216 seeds Input = c lC rea t eB u f f e r (

gpuContext ,

218 CL MEM READ WRITE, // | CL MEM USE HOST PTR,

s izeof ( c l u i n t 4 ) ∗ width ,

220 0 ,

A-8


&e r r ) ;


224 p r i n t f ( ” Error : c l Cr e a t e Bu f f e r ( seeds Input ) \n” ) ;

return ;

226

//%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%//

228 s i z e t b u f f e r S i z e A n g l e s = ( s izeof ( c l f l o a t 4 ) ∗width∗ f a c t o r ∗ factorB /2) ;

randomBufferPhi = c lCr ea t eBu f f e r (

230 gpuContext ,

CL MEM READ WRITE,

232 bu f f e rS i z eAng l e s ,

NULL,

234 &e r r ) ;


238 p r i n t f ( ” Error : c l Cr e a t e Bu f f e r ( the taBu f f e r ) \n” ) ;

return ;

240

randomBufferTheta = c lC rea t eB u f f e r (

242 gpuContext ,

CL MEM READ WRITE,

244 bu f f e rS i z eAng l e s ,

NULL,

246 &e r r ) ;


250 p r i n t f ( ” Error : c l Cr e a t e Bu f f e r ( the taBu f f e r ) \n” ) ;

return ;

252

254 //%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%//

256 /////////////////////////////////////////////////////////////////

// Load CL f i l e , b u i l d CL program ob j ec t , c r ea t e CL ke rne l o b j e c t

A-9


258 /////////////////////////////////////////////////////////////////

const char ∗ f i l ename = ” Hei senbergSp in Kerne l s . c l ” ;

260 std : : s t r i n g sourceSt r = convertToStr ing ( f i l ename ) ;

const char ∗ source = sourceSt r . c s t r ( ) ;

262 s i z e t s o u r c e S i z e [ ] = s t r l e n ( source ) ;

264 program = clCreateProgramWithSource (

gpuContext ,

266 1 ,

&source ,

268 sourceS i ze ,

&e r r ) ;


272 p r i n t f ( ” Error : Loading Binary in to c l program \

( clCreateProgramWithBinary ) \n” ) ;

274 return ;

276

/∗ c r ea t e a c l program exe cu t a b l e f o r a l l the d e v i c e s s p e c i f i e d ∗/

278 e r r = clBuildProgram ( program , 1 , dev ice s , NULL, NULL, NULL) ;


280

i f ( e r r == CL BUILD PROGRAM FAILURE)

282

c l i n t l ogSta tu s ;

284 char ∗ bui ldLog = NULL;

s i z e t bu i ldLogS ize = 0 ;

286 l ogSta tu s = clGetProgramBuildInfo ( program ,

d e v i c e s [ 0 ] ,

288 CL PROGRAM BUILD LOG,

bui ldLogSize ,

290 buildLog ,

&bui ldLogS ize ) ;

292

bui ldLog = (char∗) mal loc ( bu i ldLogS ize ) ;

294 i f ( bui ldLog == NULL)

A-10


296 p r i n t f ( ” Fa i l ed to a l l o c a t e host memory . ( bui ldLog ) \n” ) ;

return ;

298

memset ( buildLog , 0 , bu i ldLogS ize ) ;

300

l ogSta tu s = clGetProgramBuildInfo ( program ,

302 d e v i c e s [ 0 ] ,

CL PROGRAM BUILD LOG,

304 bui ldLogSize ,

buildLog ,

306 NULL) ;

308 p r i n t f ( ” \n\ t \ t \tBUILD LOG\n” ) ;

p r i n t f ( ”∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗\n” ) ;

310 p r i n t f ( ”%s ” , bui ldLog ) ;

p r i n t f ( ”∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗\n” ) ;

312 f r e e ( bui ldLog ) ;

314


316

p r i n t f ( ” Error : Bui ld ing Program ( clBuildProgram ) \n” ) ;

318 return ;

320

322 /∗ ge t a k e rne l o b j e c t handle f o r a k e rne l wi th the g iven name ∗/

kerne lSusc = c lCreateKerne l ( program , ” suscKerne l ” , &e r r ) ;


326 p r i n t f ( ” Error : Creat ing Kernel Susc from program . ( c lCreateKerne l ) \n” ) ;

return ;

328

330 kerne lCa l c = c lCreateKerne l ( program , ” ca l cKerne l ” , &e r r ) ;


A-11


332

p r i n t f ( ” Error : Creat ing Kernel Reduce from program . ( c lCreateKerne l ) \n”

) ;

334 return ;

336

e r r = clGetKernelWorkGroupInfo ( kerne lSusc ,

338 d e v i c e s [ 0 ] ,

CL KERNEL WORK GROUP SIZE,

340 s izeof ( s i z e t ) ,

&kernelWorkGroupSize ,

342 0) ;


344

p r i n t f ( ” clGetKernelWorkGroupInfo f a i l e d .\n” ) ;

346

348 i f ( (BLOCK SIZE) > kernelWorkGroupSize )

350 p r i n t f ( ”Out o f Resources !\n” ) ;

p r i n t f ( ”Group S i z e s p e c i f i e d : %u\n” ,BLOCK SIZE) ;

352 p r i n t f ( ”Max Group S i z e supported on the ke rne l : %u\n” ,

kernelWorkGroupSize ) ;

354 p r i n t f ( ” F a l l i n g back to %u\n” , kernelWorkGroupSize ) ;

i f ( blockSizeX > kernelWorkGroupSize )

356

BLOCK SIZE = kernelWorkGroupSize ;

358

360 //∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗//

362 e r r = clGetKernelWorkGroupInfo ( kerne lCalc ,

d e v i c e s [ 0 ] ,

364 CL KERNEL WORK GROUP SIZE,

s izeof ( s i z e t ) ,

366 &kernelWorkGroupSize ,

0) ;

A-12



370 p r i n t f ( ” clGetKernelWorkGroupInfo f a i l e d .\n” ) ;

372

i f ( (BLOCK SIZE) > kernelWorkGroupSize )

374

p r i n t f ( ”Out o f Resources !\n” ) ;

376 p r i n t f ( ”Group S i z e s p e c i f i e d : %u\n” ,BLOCK SIZE) ;

p r i n t f ( ”Max Group S i z e supported on the ke rne l : %u\n” ,

378 kernelWorkGroupSize ) ;

p r i n t f ( ” F a l l i n g back to %u\n” , kernelWorkGroupSize ) ;

380 i f (BLOCK SIZE > kernelWorkGroupSize )

382 BLOCK SIZE = kernelWorkGroupSize ;

384

386 //%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%//

void d e v i c e I n f o ( c l d e v i c e i d id )

388 int e r r ;

s i z e t r e t u r n e d s i z e ;

390

// Report the dev i c e vendor and dev i c e name

392 c l c h a r vendor name [ 1 0 2 4 ] = 0 ;

c l c h a r device name [ 1 0 2 4 ] = 0 ;

394 c l c h a r d e v i c e p r o f i l e [ 1 0 2 4 ] = 0 ;

c l c h a r d e v i c e e x t e n s i o n s [ 1 0 2 4 ] = 0 ;

396 c l d e v i c e l o c a l m e m t y p e local mem type ;

c l u l o n g g loba l mem size , g loba l mem cache s i ze , l o ca l mem s i z e ;

398 c l u l o n g max mem al loc s ize ;

c l u i n t c l o ck f r equency , vector width , max compute units ;

400 s i z e t max work item dims , max work group size , max work i tem s izes [ 3 ] ;

c l u i n t v e c t o r t y p e s [ ] = CL DEVICE PREFERRED VECTOR WIDTH CHAR,

CL DEVICE PREFERRED VECTOR WIDTH SHORT,

CL DEVICE PREFERRED VECTOR WIDTH INT,

CL DEVICE PREFERRED VECTOR WIDTH LONG,

A-13


CL DEVICE PREFERRED VECTOR WIDTH FLOAT,

CL DEVICE PREFERRED VECTOR WIDTH DOUBLE ;

402 char ∗ vector type names [ ] = ” char ” , ” shor t ” , ” i n t ” , ” long ” , ” f l o a t ” , ” double ”

;

e r r = c lGetDev ice In fo ( id , CL DEVICE VENDOR, s izeof ( vendor name ) ,

vendor name , &r e t u r n e d s i z e ) ;

404 e r r |= clGetDev ice In fo ( id , CL DEVICE NAME, s izeof ( device name ) ,

device name , &r e t u r n e d s i z e ) ;

e r r |= clGetDev ice In fo ( id , CL DEVICE PROFILE, s izeof ( d e v i c e p r o f i l e ) ,

d e v i c e p r o f i l e , &r e t u r n e d s i z e ) ;

406 e r r |= clGetDev ice In fo ( id , CL DEVICE EXTENSIONS, s izeof ( d e v i c e e x t e n s i o n s )

, d e v i c e e x t e n s i o n s , &r e t u r n e d s i z e ) ;

e r r |= clGetDev ice In fo ( id , CL DEVICE LOCAL MEM TYPE, s izeof ( local mem type

) , &local mem type , &r e t u r n e d s i z e ) ;

408 e r r |= clGetDev ice In fo ( id , CL DEVICE LOCAL MEM SIZE, s izeof ( l o ca l mem s i z e

) , &loca l mem s ize , &r e t u r n e d s i z e ) ;

e r r |= clGetDev ice In fo ( id , CL DEVICE GLOBAL MEM SIZE, s izeof (

g loba l mem s ize ) , &globa l mem size , &r e t u r n e d s i z e ) ;

410 e r r |= clGetDev ice In fo ( id , CL DEVICE GLOBAL MEM CACHELINE SIZE, s izeof (

g l oba l mem cache s i z e ) , &g loba l mem cache s i ze , &r e t u r n e d s i z e ) ;

e r r |= clGetDev ice In fo ( id , CL DEVICE MAX MEM ALLOC SIZE, s izeof (

max mem al loc s ize ) , &max mem al loc s ize , &r e t u r n e d s i z e ) ;

412 e r r |= clGetDev ice In fo ( id , CL DEVICE MAX CLOCK FREQUENCY, s izeof (

c l o c k f r e q u e n c y ) , &c lock f r equency , &r e t u r n e d s i z e ) ;

e r r |= clGetDev ice In fo ( id , CL DEVICE MAX WORK GROUP SIZE, s izeof (

max work group s ize ) , &max work group size , &r e t u r n e d s i z e ) ;

414 e r r |= clGetDev ice In fo ( id , CL DEVICE MAX WORK ITEM DIMENSIONS, s izeof (

max work item dims ) , &max work item dims , &r e t u r n e d s i z e ) ;

e r r |= clGetDev ice In fo ( id , CL DEVICE MAX WORK ITEM SIZES, s izeof (

max work i tem s izes ) , max work item sizes , &r e t u r n e d s i z e ) ;

416 e r r |= clGetDev ice In fo ( id , CL DEVICE MAX COMPUTE UNITS, s izeof (

max compute units ) , &max compute units , &r e t u r n e d s i z e ) ;

p r i n t f ( ”Vendor : %s \n” , vendor name ) ;

418 p r i n t f ( ” Device Name : %s \n” , device name ) ;

p r i n t f ( ” P r o f i l e : %s \n” , d e v i c e p r o f i l e ) ;

420 p r i n t f ( ” Supported Extens ions : %s \n\n” , d e v i c e e x t e n s i o n s ) ;

p r i n t f ( ” Local Mem Type ( Local =1, Global=2) : %i \n” , ( int ) local mem type ) ;

A-14


422 p r i n t f ( ” Local Mem S i z e (KB) : %i \n” , ( int ) l o ca l mem s i z e /1024) ; //

(1024∗1024) ) ;

p r i n t f ( ” Global Mem Si z e (MB) : %i \n” , ( int ) g loba l mem s ize /(1024∗1024) ) ;

424 p r i n t f ( ” Global Mem Cache S i z e ( Bytes ) : %i \n” , ( int ) g l oba l mem cache s i z e ) ;

p r i n t f ( ”Max Mem Al loc S i z e (MB) : %ld \n” , ( long int ) max mem al loc s ize

/(1024∗1024) ) ;

426 p r i n t f ( ”Clock Frequency (MHz) : %i \n\n” , c l o c k f r e q u e n c y ) ;

for ( int i =0; i <6; i++)

428 e r r |= clGetDev ice In fo ( id , v e c t o r t y p e s [ i ] , s izeof ( c l o c k f r e q u e n c y ) , &

vector width , &r e t u r n e d s i z e ) ;

p r i n t f ( ” Vector type width f o r : %s = %i \n” , vector type names [ i ] ,

vec tor width ) ;

430

p r i n t f ( ”\nMax Work Group S i z e : %lu \n” , max work group s ize ) ;

432 p r i n t f ( ”Max Work Item Dims : %lu \n” , max work item dims ) ;

for ( s i z e t i =0; i<max work item dims ; i++)

434 p r i n t f ( ”Max Work Items in Dim %lu : %lu \n” , ( long unsigned ) ( i +1) , ( long

unsigned ) max work i tem s izes [ i ] ) ;

p r i n t f ( ”Max Compute Units : %i \n” , max compute units ) ;

436 p r i n t f ( ”\n” ) ;

438 //%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%//

std : : s t r i n g convertToStr ing ( const char ∗ f i l ename )

440 s i z e t s i z e ;

char∗ s t r ;

442 std : : s t r i n g s ;

s td : : f s t ream f ( f i l ename , ( std : : f s t ream : : in | std : : f s t ream : : b inary ) ) ;

444 i f ( f . i s o p e n ( ) )

446 s i z e t f i l e S i z e ;

f . seekg (0 , std : : f s t ream : : end ) ;

448 s i z e = f i l e S i z e = f . t e l l g ( ) ;

f . seekg (0 , std : : f s t ream : : beg ) ;

450 s t r = new char [ s i z e +1] ;

i f ( ! s t r )

452

f . c l o s e ( ) ;

A-15


454 return NULL;

456 f . read ( s t r , f i l e S i z e ) ;

f . c l o s e ( ) ;

458 s t r [ s i z e ] = ’ \0 ’ ;

s = s t r ;

460 return s ;

462 return NULL;

464 //%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%//

void setupHost (void )

466 width = 65536;

he ight = 1 ;

468 input = NULL;

output = NULL;

470 input = ( c l u i n t ∗) memalign (16 , width∗ s izeof ( c l u i n t 4 ) ) ;

i f ( input == NULL)

472

p r i n t f ( ” Error : Fa i l ed to a l l o c a t e input memory on host \n” ) ;

474 return ;

476 output = ( c l f l o a t ∗)

mal loc ( factorB ∗ f a c t o r ∗width∗ s izeof ( c l f l o a t 4 ) /N) ;

478 memset ( ( void ∗) output , 0 , s izeof ( c l f l o a t 4 ) ) ;

i f ( output == NULL)

480

p r i n t f ( ” Error : Fa i l ed to a l l o c a t e output memory on host \n” ) ;

482 return ;

484 for ( c l u i n t i = 0 ; i < 4∗width ; i=i +1)

486 input [ i ] = ( u int ) rand ( ) + (0xFFFFFFFF>>1)∗ c l o ck ( ) ;

488

//%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%//

490 void re leaseCL (void )

A-16


e r r = c lRe l ea s eKerne l ( ke rne lSusc ) ;


494 p r i n t f ( ” Error : In c lRe l ea s eKerne l Susc\n” ) ;

return ;

496

e r r = c lRe l ea s eKerne l ( ke rne lCa l c ) ;


500 p r i n t f ( ” Error : In c lRe l ea s eKerne l Reduce\n” ) ;

return ;

502

e r r = clReleaseProgram ( program ) ;


506 p r i n t f ( ” Error : In clReleaseProgram \n” ) ;

return ;

508

e r r = clReleaseMemObject ( randomBufferTheta ) ;


512 p r i n t f ( ” Error : In clReleaseMemObject ( seeds Input ) \n” ) ;

return ;

514

e r r = clReleaseMemObject ( randomBufferPhi ) ;


518 p r i n t f ( ” Error : In clReleaseMemObject ( ph iBu f f e r ) \n” ) ;

return ;

520

e r r = clReleaseCommandQueue ( gpuQueue ) ;


524 p r i n t f ( ” Error : In clReleaseCommandQueue\n” ) ;

return ;

526

e r r = c lRe leaseContext ( gpuContext ) ;

A-17



530 p r i n t f ( ” Error : In c lRe leaseContext \n” ) ;

return ;

532

534 //%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%//

void runKernels (void )

536 c l e v e n t events [ 2 ] ;

c l e v e n t timingEvent ;

538 c l u l o n g startTime , endTime ;

c l b o o l i sB lock ingWri te = CL FALSE ;

540 s i z e t o f f S e t = 0 ;

s i z e t b u f f e r S i z e = width∗ s izeof ( c l u i n t 4 ) ;

542

// Send seeds to b u f f e r f o r use on dev i c e

544 e r r = clEnqueueWriteBuffer ( gpuQueue , seedsInput , i sBlock ingWrite ,

o f f S e t , b u f f e r S i z e , input , 0 , 0 , 0 ) ;


p r i n t f ( ” c lEnqueueWriteBuffer f a i l e d . ( seedsBuf ) \n” ) ;

548 return ;

550 e r r = c l F i n i s h ( gpuQueue ) ;


552 p r i n t f ( ” c l F i n i s h f a i l e d . ” ) ;

return ;

554

s i z e t g lobalThreads [ 1 ] = width ;

556 s i z e t g lobalThreads2 [ 1 ] = width∗ f a c t o r ∗ factorB /2 ;

s i z e t l oca lThreads [ 1 ] = BLOCK SIZE ;

558

//%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%//

560 e r r = clSetKerne lArg ( kerne lSusc ,

0 ,

562 s izeof ( cl mem ) ,

(void ∗)&seeds Input ) ;


A-18


566 p r i n t f ( ” Error s e t t i n g ke rne l arguments 0\n” ) ;

return ;

568

e r r = clSetKerne lArg ( kerne lSusc ,

570 1 ,

s izeof ( c l u i n t ) ,

572 (void ∗)&f a c t o r ) ;



return ;

578


580 2 ,

s izeof ( c l u i n t ) ,

582 (void ∗)&factorB ) ;


584

p r i n t f ( ” Error s e t t i n g ke rne l arguments 2\n” ) ;

586 return ;

588 e r r = clSetKerne lArg ( kerne lSusc ,

3 ,


(void ∗)&randomBufferPhi ) ;



return ;

596


598 4 ,

s izeof ( cl mem ) ,

600 (void ∗)&randomBufferTheta ) ;


A-19


602

p r i n t f ( ” Error s e t t i n g ke rne l arguments 4\n” ) ;

604 return ;

606 //%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%//

608 e r r = clSetKerne lArg ( kerne lCalc ,

0 ,




614 p r i n t f ( ” Error s e t t i n g kerne lCa l c arguments 0\n” ) ;

return ;

616

e r r = clSetKerne lArg ( kerne lCalc ,

618 1 ,

s izeof ( cl mem ) ,

620 (void ∗)&randomBufferTheta ) ;


622

p r i n t f ( ” Error s e t t i n g kerne lCa l c arguments 1\n” ) ;

624 return ;


8 ,





return ;

634


2 ,

638 BLOCK SIZE∗ s izeof ( c l f l o a t 4 ) ,

A-20


NULL) ;



return ;

644


3 ,

648 BLOCK SIZE∗ s izeof ( c l f l o a t 4 ) ,

NULL) ;



return ;

654


656 4 ,

( ( c l u i n t )BLOCK SIZE/N) ∗ s izeof ( c l f l o a t 4 ) ,

658 NULL) ;


660


662 return ;


5 ,

666 ( ( c l u i n t )BLOCK SIZE/N) ∗ s izeof ( c l f l o a t 4 ) ,

NULL) ;



return ;

672


674 6 ,

( ( c l u i n t )BLOCK SIZE/N) ∗ s izeof ( c l f l o a t 4 ) ,

A-21


676 NULL) ;


678


680 return ;


7 ,

684 ( ( c l u i n t )BLOCK SIZE/N) ∗ s izeof ( c l f l o a t 4 ) ,

NULL) ;



return ;

690


692 9 ,

s izeof ( c l f l o a t 4 ) ,

694 (void ∗)&params ) ;


696


698 return ;

700 //%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%//

e r r = clEnqueueNDRangeKernel ( gpuQueue , kerne lSusc , 1 ,NULL,

702 globalThreads , loca lThreads , 0 ,

NULL,& events [ 0 ] ) ;

704


706

p r i n t f ( ”clEnqueueNDRangeKernel f a i l e d . ” ) ;

708 return ;

710

e r r = c l F i n i s h ( gpuQueue ) ;


A-22


714 p r i n t f ( ” F in i sh Queue f a i l e d . ” ) ;

return ;

716

//∗∗∗∗∗∗∗∗∗∗∗∗∗Timing∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗//

718 c l G e t E v e n t P r o f i l i n g I n f o ( events [ 0 ] , CL PROFILING COMMAND START,

s izeof ( c l u l o n g ) , &startTime , NULL) ;

720 c l G e t E v e n t P r o f i l i n g I n f o ( events [ 0 ] , CL PROFILING COMMAND END,

s izeof ( c l u l o n g ) , &endTime , NULL) ;

722 c l u l o n g elapsedTimeGen = endTime−startTime ;

//∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗//

724 p r i n t f ( ” Kernel Time Generate : %f ms\n” ,

( c l f l o a t ) elapsedTimeGen ∗0 .000001) ;

726 // p r i n t f (”%f , ” , ( c l f l o a t ) elapsedTime ∗0.000001) ;

e r r = c lRe leaseEvent ( events [ 0 ] ) ;


730 p r i n t f ( ” c lRe leaseEvent f a i l e d . ” ) ;

return ;

732

734 e r r = clReleaseMemObject ( seeds Input ) ;


736

p r i n t f ( ” Error : In clReleaseMemObject ( seeds Input ) \n” ) ;

738 return ;

740

//∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗//

742

e r r = clEnqueueNDRangeKernel ( gpuQueue , kerne lCalc , 1 ,NULL,

744 globalThreads2 , loca lThreads , 0 ,

NULL,& events [ 1 ] ) ;


748 p r i n t f ( ”clEnqueueNDRangeKernel f a i l e d . ” ) ;

return ;

A-23


750

e r r = c l F i n i s h ( gpuQueue ) ;


754 p r i n t f ( ” F in i sh Queue f a i l e d . ” ) ;

return ;

756

758 //∗∗∗∗∗∗∗∗∗∗∗∗∗Timing∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗//

c l G e t E v e n t P r o f i l i n g I n f o ( events [ 1 ] , CL PROFILING COMMAND START,

760 s izeof ( c l u l o n g ) , &startTime , NULL) ;

c l G e t E v e n t P r o f i l i n g I n f o ( events [ 1 ] , CL PROFILING COMMAND END,

762 s izeof ( c l u l o n g ) , &endTime , NULL) ;

c l u l o n g elapsedTimeCalc = endTime−startTime ;

764 p r i n t f ( ” Kernel Time Ca lcu la te : %f ms\n” ,

( c l f l o a t ) elapsedTimeCalc ∗0 .000001) ;

766 //∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗//

768 e r r = c lRe leaseEvent ( events [ 1 ] ) ;


772 p r i n t f ( ” c lRe leaseEvent f a i l e d . ” ) ;

return ;

774

776 //∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗//

e r r = clEnqueueReadBuffer (

778 gpuQueue , randomBufferPhi ,CL TRUE, 0 ,

factorB ∗( f a c t o r /2) ∗width∗ s izeof ( c l f l o a t 4 ) /BLOCK SIZE,

780 output , 0 ,NULL,& events [ 0 ] ) ;


782

p r i n t f ( ” clEnqueueReadBuffer f a i l e d . ” ) ;

784 return ;

786 e r r = clWaitForEvents (1 , &events [ 0 ] ) ;

A-24



788

p r i n t f ( ” clWaitForEvents f a i l e d . ” ) ;

790 return ;

792 //∗∗∗∗∗∗∗∗∗∗∗∗∗Timing∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗//

c l G e t E v e n t P r o f i l i n g I n f o ( events [ 0 ] , CL PROFILING COMMAND START,

794 s izeof ( c l u l o n g ) , &startTime , NULL) ;

c l G e t E v e n t P r o f i l i n g I n f o ( events [ 0 ] , CL PROFILING COMMAND END,

796 s izeof ( c l u l o n g ) , &endTime , NULL) ;

c l u l o n g elapsedTimeRead = endTime−startTime ;

798 p r i n t f ( ” Function Time Read : %f ms\n” ,

( c l f l o a t ) elapsedTimeRead ∗0 .000001) ;

800 p r i n t f ( ” Total Kernel Time : %f s \n” , ( c l f l o a t ) ( elapsedTimeGen +

elapsedTimeCalc + elapsedTimeRead ) ∗0 .000000001) ;

802 //∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗//

804 e r r = c lRe leaseEvent ( events [ 0 ] ) ;


806

p r i n t f ( ” c lRe leaseEvent f a i l e d . ” ) ;

808 return ;

810

double t o t a l = 0 . 0 ;

812 for ( int i = 0 ; i < ( ( factorB ∗( f a c t o r /2) ∗width ∗4) /(BLOCK SIZE) ) ; i++)//

( f a c t o r /2)∗width ∗4; i++)

t o t a l += output [ i ] ;

814

p r i n t f ( ”%2.12 f \n” , ( 0 . 5 ) ∗(3 .14159265358979∗3.14159265358979) ∗( t o t a l /(

width∗ f a c t o r ∗ factorB ∗ (4) /(N) ) ) ) ;

816

//%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%//

818 void r e l e a s e H o s t (void )

i f ( input != NULL)

820

f r e e ( input ) ;

A-25


822 input = NULL;

824 i f ( output != NULL)

826 f r e e ( output ) ;

output = NULL;

828

i f ( d e v i c e s != NULL)

830

f r e e ( d e v i c e s ) ;

832 d e v i c e s = NULL;

834

//%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%//

836 int main (void )

N = 4 ;

838 u int c lDev i c e = 1 ;

deviceType = ( c lDev i c e == 1) ? CL DEVICE TYPE GPU : CL DEVICE TYPE CPU;

840 // c l o c k t s t a r t = c l o c k ( ) ;

842 params [ 0 ] = 1 . 0 ;

params [ 1 ] = 0 .0 f ;

844 params [ 2 ] = ( c l f l o a t ) N;

factorB = 32 ;

846 f a c t o r = 8 ;

BLOCK SIZE = 256 ;

848

p r i n t f ( ”∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗\n” ) ;

850 setupHost ( ) ;

in itCL ( ) ;

852 // de v i c e In f o ( d e v i c e s [ 0 ] ) ;

runKernels ( ) ;

854 re leaseCL ( ) ;

r e l e a s e H o s t ( ) ;

856 // p r i n t f (”% l f \n” , ( ( doub le ) d i f f t im e ( c l o c k ( ) , s t a r t ) ) /(2 .8∗CLOCKS PER SEC) )

;

p r i n t f ( ”∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗\n” ) ;

A-26

A.3 OpenCL Device Kernels and Functions Developed Code

858

return 0 ;

860

/home/mlxd/Exchange/ProjectFiles/HeisenbergSpin.c

A.3 OpenCL Device Kernels and Functions

#pragma OPENCL EXTENSION c l k h r f p 6 4 : enable

2

void

4 l s h i f t 1 2 8 ( u int4 input , u int s h i f t , u int4 ∗ output )

6 unsigned int i n v s h i f t = 32u − s h i f t ;

u int4 temp ;

8 temp . x = input . x << s h i f t ;

temp . y = ( input . y << s h i f t ) | ( input . x >> i n v s h i f t ) ;

10 temp . z = ( input . z << s h i f t ) | ( input . y >> i n v s h i f t ) ;

temp .w = ( input .w << s h i f t ) | ( input . z >> i n v s h i f t ) ;

12 ∗output = temp ;

14

void

16 r s h i f t 1 2 8 ( u int4 input , u int s h i f t , u int4 ∗ output )

18 unsigned int i n v s h i f t = 32u − s h i f t ;

u int4 temp ;

20 temp .w = input .w >> s h i f t ;

temp . z = ( input . z >> s h i f t ) | ( input .w << i n v s h i f t ) ;

22 temp . y = ( input . y >> s h i f t ) | ( input . z << i n v s h i f t ) ;

temp . x = ( input . x >> s h i f t ) | ( input . y << i n v s h i f t ) ;

24 ∗output = temp ;

26

k e r n e l void suscKerne l ( g l o b a l u int4 ∗ input ,

28 u int mulFactor , u int mulfactorB ,

g l o b a l f l o a t 4 ∗outputPhi ,

A-27


30 g l o b a l f l o a t 4 ∗outputTheta )

32 u int p = 0 ;

u int4 temp [ 3 2 ] ;

34 u int g l o b a l S i z e = g e t g l o b a l s i z e (0 ) ;

u int g id = g e t g l o b a l i d (0 ) ;

36 u int4 s t a t e 1 = ( uint4 ) (0 ) ;

u int4 s t a t e 2 = ( uint4 ) (0 ) ;

38 u int4 s t a t e 3 = ( uint4 ) (0 ) ;

u int4 s t a t e 4 = ( uint4 ) (0 ) ;

40 u int4 s t a t e 5 = ( uint4 ) (0 ) ;

42 u int stateMask = 1812433253u ;

u int t h i r t y = 30u ;

44 u int4 mask4 = ( uint4 ) ( stateMask ) ;

u int4 t h i r t y 4 = ( uint4 ) ( t h i r t y ) ;

46 u int4 one4 = ( uint4 ) (1u) ;

u int4 two4 = ( uint4 ) (2u) ;

48 u int4 three4 = ( uint4 ) (3u) ;

u int4 four4 = ( uint4 ) (4u) ;

50 u int4 r1 = ( uint4 ) (0 ) ;

u int4 r2 = ( uint4 ) (0 ) ;

52

u int4 a = ( uint4 ) (0 ) ;

54 u int4 b = ( uint4 ) (0 ) ;

u int4 e = ( uint4 ) (0 ) ;

56 u int4 f = ( u int4 ) (0 ) ;

58 unsigned int t h i r t e e n = 13u ;

unsigned int f i f t e e n = 15u ;

60 unsigned int s h i f t = 8u ∗ 3u ;

unsigned int mask11 = 0 x f d f f 3 7 f f u ;

62 unsigned int mask12 = 0 xe f 7 f 3 f 7du ;

unsigned int mask13 = 0 xff777b7du ;

64 unsigned int mask14 = 0 x 7 f f 7 f b 2 f u ;

const f loat one = 1 .0 f ;

66 const f loat intMax = 4294967296.0 f ;

A-28


const f loat PI = 3.14159265358979 f ;

68 u int i = 0 ;

u int j = mulfactorB ;

70

for (p = 0 ; p < mulfactorB ; p++)

72

s t a t e 1 = input [ g id ] ;

74 s t a t e 2 = mask4 ∗ ( s t a t e 1 ˆ ( s t a t e 1 >> t h i r t y 4 ) ) + one4 ;

s t a t e 3 = mask4 ∗ ( s t a t e 2 ˆ ( s t a t e 2 >> t h i r t y 4 ) ) + two4 ;

76 s t a t e 4 = mask4 ∗ ( s t a t e 3 ˆ ( s t a t e 3 >> t h i r t y 4 ) ) + three4 ;

s t a t e 5 = mask4 ∗ ( s t a t e 4 ˆ ( s t a t e 4 >> t h i r t y 4 ) ) + four4 ;

78

for ( i = 0 ; i < mulFactor ; ++i )

80

switch ( i )

82

case 0 :

84 r1 = s t a t e 4 ;

r2 = s t a t e 5 ;

86 a = s t a t e 1 ;

b = s t a t e 3 ;

88 break ;

case 1 :

90 r1 = r2 ;

r2 = temp [ 0 ] ;

92 a = s t a t e 2 ;

b = s t a t e 4 ;

94 break ;

case 2 :

96 r1 = r2 ;

r2 = temp [ 1 ] ;

98 a = s t a t e 3 ;

b = s t a t e 5 ;

100 break ;

case 3 :

102 r1 = r2 ;

r2 = temp [ 2 ] ;

A-29


104 a = s t a t e 4 ;

b = s t a t e 1 ;

106 break ;

case 4 :

108 r1 = r2 ;

r2 = temp [ 3 ] ;

110 a = s t a t e 5 ;

b = s t a t e 2 ;

112 break ;

case 5 :

114 r1 = r2 ;

r2 = temp [ 4 ] ;

116 a = temp [ 0 ] ;

b = temp [ 2 ] ;

118 break ;

case 6 :

120 r1 = r2 ;

r2 = temp [ 5 ] ;

122 a = temp [ 1 ] ;

b = temp [ 3 ] ;

124 break ;

case 7 :

126 r1 = r2 ;

r2 = temp [ 6 ] ;

128 a = temp [ 2 ] ;

b = temp [ 4 ] ;

130 break ;

132 case 8 :

r1 = r2 ;

134 r2 = temp [ 7 ] ;

a = temp [ 3 ] ;

136 b = temp [ 5 ] ;

break ;

138 case 9 :

r1 = r2 ;

140 r2 = temp [ 8 ] ;

A-30


a = temp [ 4 ] ;

142 b = temp [ 6 ] ;

break ;

144 case 10 :

r1 = r2 ;

146 r2 = temp [ 9 ] ;

a = temp [ 5 ] ;

148 b = temp [ 7 ] ;

break ;

150 case 11 :

r1 = r2 ;

152 r2 = temp [ 1 0 ] ;

a = temp [ 6 ] ;

154 b = temp [ 8 ] ;

break ;

156 case 12 :

r1 = r2 ;

158 r2 = temp [ 1 1 ] ;

a = temp [ 7 ] ;

160 b = temp [ 9 ] ;

break ;

162 case 13 :

r1 = r2 ;

164 r2 = temp [ 1 2 ] ;

a = temp [ 8 ] ;

166 b = temp [ 1 0 ] ;

break ;

168 case 14 :

r1 = r2 ;

170 r2 = temp [ 1 3 ] ;

a = temp [ 9 ] ;

172 b = temp [ 1 1 ] ;

break ;

174 case 15 :

r1 = r2 ;

176 r2 = temp [ 1 4 ] ;

a = temp [ 1 0 ] ;

A-31


178 b = temp [ 1 2 ] ;

break ;

180 case 16 :

r1 = r2 ;

182 r2 = temp [ 1 5 ] ;

a = temp [ 1 1 ] ;

184 b = temp [ 1 3 ] ;

break ;

186 case 17 :

r1 = r2 ;

188 r2 = temp [ 1 6 ] ;

a = temp [ 1 2 ] ;

190 b = temp [ 1 4 ] ;

break ;

192 case 18 :

r1 = r2 ;

194 r2 = temp [ 1 7 ] ;

a = temp [ 1 3 ] ;

196 b = temp [ 1 5 ] ;

break ;

198 case 19 :

r1 = r2 ;

200 r2 = temp [ 1 8 ] ;

a = temp [ 1 4 ] ;

202 b = temp [ 1 6 ] ;

break ;

204 case 20 :

r1 = r2 ;

206 r2 = temp [ 1 9 ] ;

a = temp [ 1 5 ] ;

208 b = temp [ 1 7 ] ;

break ;

210 case 21 :

r1 = r2 ;

212 r2 = temp [ 2 0 ] ;

a = temp [ 1 6 ] ;

214 b = temp [ 1 8 ] ;

A-32


break ;

216 case 22 :

r1 = r2 ;

218 r2 = temp [ 2 1 ] ;

a = temp [ 1 7 ] ;

220 b = temp [ 1 9 ] ;

break ;

222 case 23 :

r1 = r2 ;

224 r2 = temp [ 2 2 ] ;

a = temp [ 1 8 ] ;

226 b = temp [ 2 0 ] ;

break ;

228 case 24 :

r1 = r2 ;

230 r2 = temp [ 2 3 ] ;

a = temp [ 1 9 ] ;

232 b = temp [ 2 1 ] ;

break ;

234 case 25 :

r1 = r2 ;

236 r2 = temp [ 2 4 ] ;

a = temp [ 2 0 ] ;

238 b = temp [ 2 2 ] ;

break ;

240 case 26 :

r1 = r2 ;

242 r2 = temp [ 2 5 ] ;

a = temp [ 2 1 ] ;

244 b = temp [ 2 3 ] ;

break ;

246 case 27 :

r1 = r2 ;

248 r2 = temp [ 2 6 ] ;

a = temp [ 2 2 ] ;

250 b = temp [ 2 4 ] ;

break ;

A-33


252 case 28 :

r1 = r2 ;

254 r2 = temp [ 2 7 ] ;

a = temp [ 2 3 ] ;

256 b = temp [ 2 5 ] ;

break ;

258 case 29 :

r1 = r2 ;

260 r2 = temp [ 2 8 ] ;

a = temp [ 2 4 ] ;

262 b = temp [ 2 6 ] ;

break ;

264 case 30 :

r1 = r2 ;

266 r2 = temp [ 2 9 ] ;

a = temp [ 2 5 ] ;

268 b = temp [ 2 7 ] ;

break ;

270 case 31 :

r1 = r2 ;

272 r2 = temp [ 3 0 ] ;

a = temp [ 2 6 ] ;

274 b = temp [ 2 8 ] ;

break ;

276 default :

break ;

278

280 l s h i f t 1 2 8 (a , s h i f t , &e ) ;

r s h i f t 1 2 8 ( r1 , s h i f t , &f ) ;

282

temp [ i ] . x = a . x ˆ e . x ˆ ( ( b . x >> t h i r t e e n ) & mask11 ) ˆ f . x ˆ ( r2 . x <<

f i f t e e n ) ;

284 temp [ i ] . y = a . y ˆ e . y ˆ ( ( b . y >> t h i r t e e n ) & mask12 ) ˆ f . y ˆ ( r2 . y <<

f i f t e e n ) ;

temp [ i ] . z = a . z ˆ e . z ˆ ( ( b . z >> t h i r t e e n ) & mask13 ) ˆ f . z ˆ ( r2 . z <<

f i f t e e n ) ;

A-34


286 temp [ i ] . w = a .w ˆ e .w ˆ ( ( b .w >> t h i r t e e n ) & mask14 ) ˆ f .w ˆ ( r2 .w <<

f i f t e e n ) ;

288 for ( i = 0 ; i < ( mulFactor ) /2 ; i=i +1)

outputTheta [ g id + g l o b a l S i z e ∗( i ) + g l o b a l S i z e ∗( mulFactor /2) ∗(p) ] = PI

∗( c o n v e r t f l o a t 4 ( temp [2∗ i ] ) ∗ one / intMax ) ;

290 outputPhi [ g id + g l o b a l S i z e ∗( i ) + g l o b a l S i z e ∗( mulFactor /2) ∗(p) ] = ( (

f l o a t 4 ) 2 . 0 ) ∗PI ∗( c o n v e r t f l o a t 4 ( temp [2∗ i + 1 ] ) ∗ one / intMax ) ;

292 input [ g id ] = temp [ g id%mulFactor ] ;

294

296 k e r n e l void

ca l cKerne l ( g l o b a l f l o a t 4 ∗ phi , g l o b a l f l o a t 4 ∗ theta ,

298 l o c a l f l o a t 4 ∗ phiLocal , l o c a l f l o a t 4 ∗ thetaLocal ,

l o c a l f l o a t 4 ∗ cosSum , l o c a l f l o a t 4 ∗ exchSum ,

300 l o c a l f l o a t 4 ∗ sinProd , l o c a l f l o a t 4 ∗ samples ,

g l o b a l f l o a t 4 ∗ output , f l o a t 4 params )

302

// Get Id va l u e s f o r index ing

304 u int l i d = g e t l o c a l i d (0 ) ;

u int bid = g e t g r o u p i d (0 ) ;

306 u int g id = g e t g l o b a l i d (0 ) ;

308 f loat x i = params . y ;

u int chainLength = c o n v e r t u i n t ( params . z ) ;

310 f loat K[ 1 5 ] ;

K[ 0 ] = 1 .0 f ;

312 K[ 1 ] = 1 .0 f ;

K[ 2 ] = 1 .0 f ;

314 K[ 3 ] = 2 .5 f ;

K[ 4 ] = 3 .0 f ;

316 K[ 5 ] = 3 .5 f ;

K[ 6 ] = 4 .0 f ;

318 K[ 7 ] = 4 .5 f ;

K[ 8 ] = 5 .0 f ;

A-35


320 K[ 9 ] = 5 .5 f ;

K[ 1 0 ] = 6 .0 f ;

322 K[ 1 1 ] = 6 .5 f ;

K[ 1 2 ] = 7 .0 f ;

324 K[ 1 3 ] = 7 .5 f ;

K[ 1 4 ] = 8 .0 f ;

326

//Get s i z e o f workgroup in workitems

328 u int l o c a l S i z e = g e t l o c a l s i z e (0 ) ;

u int g l o b a l S i z e = g e t g l o b a l s i z e (0 ) ;

330

// Ca l cu l a t e how many samples may be taken

332 u int e lements = ( u int ) ( ( l o c a l S i z e ) / chainLength ) ;

334 phiLoca l [ l i d ] = phi [ g id ] ;

thetaLoca l [ l i d ] = theta [ g id ] ;

336 b a r r i e r (CLK LOCAL MEM FENCE) ;

338 cosSum [ l i d ] = ( f l o a t 4 ) 0 . 0 f ;

exchSum [ l i d ] = ( f l o a t 4 ) 0 . 0 f ;


342 for ( u int j = 0 ; j < chainLength − 1 ; j++)

f l o a t 4 term1 = ( f l o a t 4 ) 0 .0 f ;

344 f l o a t 4 term2 = ( f l o a t 4 ) 0 .0 f ;

f l o a t 4 term3 = ( f l o a t 4 ) 0 .0 f ;

346 i f ( l i d < e lements )

348 term1 = n a t i v e s i n ( thetaLoca l [ l i d + ( e lements ) ∗ j ] ) ∗

s i n ( thetaLoca l [ l i d + ( e lements ) ∗( j +1) ] ) ;

350 term3 = n a t i v e c o s ( thetaLoca l [ l i d + ( e lements ) ∗ j ] ) ∗

cos ( thetaLoca l [ l i d + ( e lements ) ∗( j +1) ] ) ;

352 term2 = n a t i v e c o s ( phiLoca l [ l i d + ( e lements ) ∗ j ] −

phiLoca l [ l i d + ( e lements ) ∗( j +1) ] ) ;

354

exchSum [ l i d ] += K[ j ] ∗ ( ( term1∗ term2 ) + term3 ) ;

356

A-36


b a r r i e r (CLK LOCAL MEM FENCE) ;

358 for ( u int i = 0 ; i < chainLength ; i++)

i f ( l i d < e lements )

360

cosSum [ l i d ] += n a t i v e c o s ( thetaLoca l [ l i d + ( e lements ) ∗ i ] ) ;

362


366 sinProd [ l i d ] = ( f l o a t 4 ) 1 .0 f ;


368 for ( u int i = 0 ; i < chainLength ; i++)


370

s inProd [ l i d ] = sinProd [ l i d ] ∗

372 n a t i v e s i n ( thetaLoca l [ l i d + ( e lements ) ∗ i ] ) ;

374


376


378 samples [ l i d ] = nat ive exp ( x i ∗cosSum [ l i d ] + exchSum [ l i d ] ) ∗

s inProd [ l i d ] ∗ ( 1 ) ;

380


382 for ( u int s = elements>>1; s > 0 ; s >>= 1)

384 i f ( l i d < s )

386 samples [ l i d ] += ( samples [ l i d + s ] ) ;


390

i f ( l i d == 0)

392 output [ bid ] = samples [ 0 ] ;

A-37


394

/home/mlxd/Exchange/ProjectFiles/HeisenbergSpin Kernels.cl

A-38

Final Year Research Thesis

Documents