A Parallel Heterogeneous Approach to Perturbative Monte Carlo QM/MM Simulations Sebasti ˜ ao Salvador de Miranda Thesis to obtain the Master of Science Degree in Electrical and Computer Engineering Supervisors: Dr. Pedro Filipe Zeferino Tom´ as, Dr. Nuno Filipe Valentim Roma Examination Committee Chairperson: Dr. Nuno Cavaco Gomes Horta Supervisor: Dr. Pedro Filipe Zeferino Tom´ as Members of the Committee: Dr. Gabriel Falc˜ ao Paiva Fernandes October 2014
93
Embed
A Parallel Heterogeneous Approach to Perturbative Monte ... · A Parallel Heterogeneous Approach to Perturbative Monte Carlo QM/MM Simulations Sebastiao Salvador de Miranda˜ Thesis
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A Parallel Heterogeneous Approach toPerturbative Monte Carlo QM/MM Simulations
Sebastiao Salvador de Miranda
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisors: Dr. Pedro Filipe Zeferino Tomas,
Dr. Nuno Filipe Valentim Roma
Examination Committee
Chairperson: Dr. Nuno Cavaco Gomes HortaSupervisor: Dr. Pedro Filipe Zeferino Tomas
Members of the Committee: Dr. Gabriel Falcao Paiva Fernandes
October 2014
Acknowledgments
Foremost, I would like to thank my supervisors, Doctor Pedro Tomas, Doctor Nuno Roma and Doctor
Frederico Pratas, who have provided me with invaluable guidance. I would also like to thank Doctor
Gabriel Falcao, who has reviewed the intermediate report of this dissertation and provided several in-
sightful comments.
I would like to express my gratitude to Doctor Ricardo Mata, who enlighted me on several occasions
about computational chemistry aspects, and invited me to spend a very pleasant month of research at
the Free Floater Research Group, Institut fur Physikalische Chemie, Georg-August-Universitat Gottin-
gen, Germany. Furthermore, I would like to thank Jonas Feldt, who helped me to achieve a greater
understanding of the PMC QM/MM simulation method, and with whom I have intensively collaborated
in writing research articles and developing new simulation features. I would also like to thank my col-
leagues Tomas Ferreirinha, David Nogueira, Francisco Gaspar, Andriy Gorobets and Joao Silva, with
whom I have discussed a multitude of topics, doubts and ideas, during the development of my disser-
tation. Furthermore, I would like to thank Joao Guerreiro and Luıs Tanica for having helped me in the
development of power and energy measurement techniques.
Special thanks to my girlfriend Mafalda Coelho, who has endured several months of listening to dry
technical details about my dissertation. I would also like to thank my father Pedro Miranda and my mother
Ana Salvador, for having discussed with me several topics on matters of biology, chemistry, physics and
computation.
Finally, I would like to express my gratitude to INESC-id and Institut fur Physikalische Chemie for
having given me access to their infrastructure, namely their high performance computing platforms.
Furthermore, the work presented herein was partially supported by national funds through Fundacao
para a Ciencia e a Tecnologia (FCT) under projects Threads (ref. PTDC/EEA-ELC/117329/2010) and
P2HCS (ref. PTDC/EEI-ELC/3152/2012).
ABSTRACT
Molecular simulations play an increasingly important role in computational chemistry, computational
biology and computer aided drug design. However, traditional single core implementations hardly sat-
isfy the current needs, due to the prolonged runs that often arise for not exploiting the intrinsic data and
task parallelism of some of these methods. To address this limitation, a new heterogeneous parallel
solution to Monte Carlo (MC) molecular simulations is herein introduced, exploiting fine-grained par-
allelism in the inner structure of the bottleneck procedures, and coarse-grained parallelism in the MC
state-space sampling. Unlike typical high-performance pure Quantum Mechanics (QM) or Molecular
Mechanics (MM) parallelization approaches, the work herein presented focuses on accelerating a novel
Perturbative Monte Carlo (PMC) mixed QM/MM application. The hybrid nature of the proposed parallel
approach warrants an efficient use of heterogeneous systems, composed by single or multiple CPUs
and heterogeneous accelerators (e.g., GPUs), by relying on the multi-platform OpenCL programming
framework. To efficiently exploit the parallel architecture, load balancing schemes were employed to
schedule the work between the available accelerators. A speed-up of 56× is achieved in the compu-
tational bottleneck for the case of a relevant chorismate dataset, when compared with an optimized
single-core implementation. A speed-up of 38× is observed for the full simulation, using both multi-core
CPUs and GPUs, thus effectively reducing the execution time of the full simulation from ∼80 hours to ∼2
hours.
Keywords
Quantum Mechanics (QM), Molecular Mechanics (MM), Monte Carlo (MC) Simulations, Parallel
Computing, Heterogeneous Computing, OpenCL.
iii
RESUMO
As simulacoes moleculares desempenham um papel cada vez mais importante na quımica e bi-
ologia computacionais e no desenvolvimento de farmacos assistido por computador. No entanto, as
implementacoes tradicionais single-core tem execucoes muito prolongadas, nao aproveitando o para-
lelismo de dados e de tarefas intrinsecamente presente nalguns destes metodos. De forma a colmatar
esta limitacao, este trabalho introduz uma solucao paralela e heterogenea para simulacoes moleculares
Monte Carlo (MC), explorando o paralelismo fine-grained na estrutura interna do bottleneck computa-
cional e o paralelismo coarse-grained na amostragem do espaco de estados de MC. Ao contrario de
abordagens tıpicas de alta performance a algoritmos puros de Quantum Mechanics (QM) ou Molecular
Mechanics (MM), este trabalho concentra-se na aceleracao de uma novo metodo Perturbative Monte
Carlo (PMC) mixed QM/MM. A natureza hıbrida da abordagem paralela proposta permite o uso de ar-
quiteturas heterogeneas, compostas por um ou varios CPUs e aceleradores heterogeneos (e.g. GPUs),
tirando partido da biblioteca multi-plataforma OpenCL. De forma a explorar eficazmente arquiteturas he-
terogeneas, foram aplicados esquemas de Load Balancing para distribuir a carga computacional pelos
aceleradores disponıveis. E atingido um speed-up de 56× no bottleneck computacional para o caso
de um chorismate dataset relevante na area, quando comparado com uma implementacao single-core
otimizada. No caso da simulacao completa, e observado um speed-up de 38×, tirando partido de multi-
core CPUs e GPUs. O tempo total desta simulacao foi assim reduzido de ∼80 horas para ∼2 horas.
Palavras Chave
Mecanica Quantica, Mecanica Molecular, Simulacoes Monte Carlo, Computacao Paralela, Computacao
2.7 Example of an heterogeneous network composed by several compute nodes, each com-prised by multiple Central Processing Unit (CPU) cores and one or more specialized ac-celerators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 A system composed of one QM molecule (C) and two MM solvent molecules (A and B).For each MC step, the difference in energy between the molecule moved (A) and everyother molecule has to be computed, but at different levels of theory. . . . . . . . . . . . . 20
3.2 Perturbative Monte Carlo QM/MM with focus on the simulation bottleneck (PMC cycle,right). Arrows represent data dependencies. . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Main data structures used in the PMC Cycle. Refer to Table 3.1 for parameter definitions. 24
3.4 Data dependencies within the PMC Cycle. The VDW QMMM and Coulomb Nuclei QMMMprocesses only read the atoms that are part of the QM molecule, not the whole lattice. . . 25
4.1 Independent MC state-space exploration chains (illustrative example for 2 chains), eachgenerating an independent sampling of the conformal space of the target QM/MM system. 30
4.2 MC State-Space alongside with the execution timeline for three Markov chains. . . . . . . 31
4.4 Multi-process/multi-threading structure of the designed parallel solution for the PMC method(right), alongside with the original dual-process approach (left). . . . . . . . . . . . . . . . 33
4.5 Program flow of the devised parallel PMC program, for the case of a single-device single-process instance (in order to keep the illustration clear). The legend for the numberedparts of this figure is presented throughout the text. . . . . . . . . . . . . . . . . . . . . . . 35
4.6 mol2atom data structure, together with the lattice vectors. The mol2atom structure returnsthe index of the first atom belonging to the target molecule, which can then be used toindex the lattice vectors, which contain the {x, y, z, σ, ε, q} data. . . . . . . . . . . . . . . . 37
ix
4.7 Original approach to distance computation (left), together with the devised on-the-fly so-lution (right). For the sake of clarity, the distance computation procedures were singledout, although they are executed in the same computation loop as the Coulomb/VDW pro-cedures. The remaining procedures of the PMC Cycle step have been omitted for thesake of clarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1 Mapping of the PMC Cycle procedures into OpenCL Kernels. It should be noticed thatsome procedures were merged into the same kernel. Furthermore, the OpenCL versionrequires additional kernels for the parallel reductions (mm finish and q3m finish, markedwith a ∗). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Memory layout example for the main data structures used in the PMC Cycle. . . . . . . . 42
5.3 Diagram for the devised monte carlo kernel, together with the layout of the data which ismanipulated in this procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 Scheme used for partitioning the grid among the work-groups, in order to allow a coa-lesced memory access pattern. For the sake of keeping the illustration clear, an examplefor P = 2 and wgsize = 4 is shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.5 q3m c and q3m finish kernels structure. In this example, work-group 0 was presentedwith additional detail, although all work-groups share an identical structure. Likewise, the8 work-items per work-group configuration was adopted for simpler illustrative purposes,as the work-group size is fully parameterizable. Furthermore, additional details concern-ing the first global memory accesses (label 1) are depicted in Figure 5.4. . . . . . . . . . . 45
5.6 q3m vdwc kernel structure. An 8 work-items per work-group configuration was adoptedfor simpler illustrative purposes, as the work-group size is fully parameterizable. . . . . . . 46
5.7 decide update kernel diagram. An 8 work-items per work-group configuration was adoptedfor simpler illustrative purposes, as the work-group size is fully parameterizable. . . . . . . 47
5.8 Exploiting multiple heterogeneous OpenCL devices to execute the PMC Cycle. The exe-cution is balanced by executing different kernels on each device and dividing the work ofthe heavier kernels (q3m c and q3m reduce). . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.1 Time footprint for a single PMC Cycle step for the bench-A dataset running on the avx2-baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 One complete PMC outer iteration, comprised of 10k PMC Cycle steps and a QM Up-date, for the bench-A dataset running on the avx2-baseline. The bottleneck of each PMCiteration is the PMC Cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3 Speed-up obtained for a PMC Cycle with 10k iterations, when using fp64 -fp32 mixed-precision. The Corresponding execution times are presented in Table 6.3. . . . . . . . . . 57
6.4 OpenCL kernel timings (per step) for the PMC Cycle running on the mcx2 heterogeneousplatform. The load is balanced for the heavier kernels (q3m c/q3m finish, correspondingto Coulomb QM/MM), whereas the lighter kernels were scheduled to the first GraphicalProcessing Unit (GPU). The considered benchmark is bench-A, using mixed fp64 -fp32
6.5 Convergence pattern of the implemented load balancing algorithm (balancing every 2000steps), for the Bench-C running on the GTX 780Ti/660Ti platform (mcx2). The presentedPMC cycle time measurements represent mean times since the previous balancing. . . . 59
x
6.6 Scalability of the PMC Cycle when changing the size of the QM part in bench-A. Speed-upresults are presented for a dual GTX680 system in respect to a single GTX680 (platformmcx4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.7 QM/MM Simulation box for the bench-R dataset (partial representation), together with thesimulation results for the conversion of the chorismate structure into prephenate. . . . . . 61
xi
xii
LIST OF TABLES
3.1 QM/MM Run Characterization, together with the typical parameter range for the bench-marks considered in this work. For the case of homogeneous solvents, the Z(i)
MM param-eter (concerning molecule i) will be the same for every MM molecule. . . . . . . . . . . . 21
5.1 Complexity of communication and synchronization overheads, in respect to the QM/MMsystem characteristics and to run parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1 Considered QM/MM benchmark datasets. The chemical aspects of bench-R are pre-sented in detail in [11]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Considered execution platforms in the experimental evaluation. . . . . . . . . . . . . . . . 54
6.3 Execution time (in seconds) for a PMC Cycle with 10k steps, in several hardware plat-forms, when using fp64 -fp32 mixed-precision. The column ”Total” corresponds to thecomplete execution times of the PMC Cycle (10k steps), including the final serial over-head of reading back and writing the output to a file. This overhead is discriminated incolumn ”Output”. The presented execution times correspond to a median among fourexperimental trials, for each platform configuration. . . . . . . . . . . . . . . . . . . . . . . 56
6.4 Kernel execution times obtained in the GTX780Ti accelerator and the in the referenceavx2-baseline platform, for the particular case of bench-A. The speed-up in respect to theavx2-baseline is also presented, together with the fraction of the PMC Cycle (%) eachkernel represents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.5 bench-R execution time for the PMC Cycle (50k steps) and QM Update (24.8M iters)stages, as well as for the full PMC simulation. The presented results consider two base-lines and four parallel solutions, with either a single or 8 Markov chains and fp64 orfp64 -fp32 precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.7 Speed-up of the mixed precision q3m c kernel versions versus the original fp64 version,running on the same machine, for the case of bench-A. . . . . . . . . . . . . . . . . . . . 64
6.8 Obtained numerical precision. The error is shown for the ∆ECQM/MM energy term, as well
as for the total energy of the system (E), when considering the em = 1.0 × 10−1kJ/mol
maximum error. The average values were taken from the complete set of generatedQM/MM systems, by using bench-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
xiii
6.9 Execution time speed-up, energy savings and average power consumption, when com-paring the Tesla K20C GPU running all the devised numerical precision approaches withavx2-baseline (with the original fp64 precision). The testbench was run on the K20C GPUfor 100k steps, in order to ensure a representative sampling of the computational cost ofq3m c. The default core frequency configuration was used for all experiments. . . . . . . 65
devices in an heterogeneous environment and allows writing portable code between different architec-
tures. Thus, OpenCL has been chosen for performing the proposed work.
OpenCL is organized in a hierarchy of models [22]: Platform Model, Execution Model, Memory
Model and Programming Model. Each of these models is explained in the following sections. The
OpenCL framework includes the OpenCL compiler (OpenCL C), the OpenCL platform layer and OpenCL
11
Runtime. In this project, the newest available OpenCL standard was used for each device (OpenCL 1.1
for the considered Nvidia GPUs and OpenCL 1.2 for the Intel CPUs and AMD GPUs).
2.3.1 Platform Model
The Plaform Model defines how a program maps into the OpenCL platform, which is an abstract
hardware representation of the underlying device. As depicted in Figure 2.4, the platform model is
composed of a Host connected to one or multiple OpenCL devices. An OpenCL device is a collection of
CUs, which in turn are divided into one or more PE3, where the computation is done. The code that runs
on the host uses the OpenCL Runtime to interface with the OpenCL device, to which it may enqueue
synchronization commands, data or kernels. A kernel is a function written in OpenCL C, and can be
compiled before or during program execution. Within each CU, PEs can execute either in SIMD or
Single Program Multiple Data (SPMD) fashion. In the former, PEs execute in lock-step, whereas in the
latter PEs keep their own program counter and may follow independent execution paths.
Figure 2.4: OpenCL Platform Model [22].
2.3.2 Execution Model
An OpenCL program executes over an index space in two main components: host code running
on the host device and kernel code that runs on each OpenCL Device. Kernel instances are called
work-items and are further grouped into work-groups. Each work-item has a unique identifier in the
global index space and in the local index space (local to each work-group). Index spaces are called
NDRanges and can have 1,2 or 3 dimensions, thus, the local and global indices are 1,2 or 3 dimensional
vectors. In Figure 2.5, an example of this organization is depicted for 2 dimensions. For GPU devices
the best performance should be attained when the work-group-size is an integer multiple of the warp-
size (NVIDIA) or the wavefront-size (AMD) because this is the minimum execution granularity supported.
Failing to meet this criteria will cause running work-group-size%warp-size useless threads.
To support different devices with different thread management systems, OpenCL employs a relaxed
synchronization and memory consistency model. This way, execution of work-items is not guaranteed
to follow any specific order. Nevertheless, explicit work-group barrier instructions can be placed in the
3This structure (and naming) closely resembles the one for AMD devices, presented in section 2.2.1.
12
Figure 2.5: Partitioning of work-items into work-groups.
kernel code to ensure execution synchronization between work-items of the same work-group. Synchro-
nization of work-items belonging to different work-groups is not possible during the same kernel launch,
a behaviour depicted in Figure 2.6. Memory consistency details are explored in Section 2.3.3.
BB
B
B
Instruction 1Instruction 2WG BarrierInstruction 3
11
1
1
22
2
2
Kernel Work-group (WG) 0
...
3
33
3
Kernel Launch 0
Work-Group Synchronization
Memory consistency inside each WG
Kernel Launch 1
...
B
BB
B
11
1
1
2
2
2
2
Work-group (WG) N
33
3
3 Global Synchronization
Memory Consistency between WG
Figure 2.6: Partitioning of work-items into work-groups.
Another important concept is an OpenCL Context, which includes a collection of OpenCL Devices,
a set of kernels, a set of Programs (source and compiled binary that implement the kernels) and a set
of Memory Objects. Associated with a Context is one or more Command Queues, via which the host
enqueues execution, memory and synchronization commands to the OpenCL Devices. This queue may
be set as in-order or out-of-order, which defines if the order by which commands are enqueued must be
13
respected or not.
2.3.3 Memory Model
The OpenCL standard defines four memory region types, each having different rules for access and
allocation:
(i) Global Memory: This memory reagion is accessible by all work-items for read/write operations.
Furthermore, the OpenCL-Host has read/write access and is responsible for dynamic memory al-
location. This memory may either be cached or not, depending on the target architecture. AMD
SI-GPU and newer NVIDIA devices, for example, have global memory caches accessible by each
CU. Global memory read/write consistency between work-items of the same work-group is only
guaranteed if they encounter a global work-group barrier. Conversely, there is no guarantee of
memory consistency across different work-groups, during the execution of a kernel. This behavior
is depicted in Figure 2.6.
(ii) Constant Memory: Memory Accessible by all work-items for read operations, remaining constant
during the execution of a kernel. Like the Global Memory, the Host has read/write access and is
responsible for memory allocation (Dynamic). Constant memory is usually cachable (e.g., in the
Kepler architecture it is implemented as a configurable fraction of the L1 cache) and typically has a
lower average access latency in respect to Global Memory.
(iii) Local Memory: This memory region is shared by work-items of the same work-group for read/write
operations. Allocation can be done either statically by a kernel or dynamically by the Host (although
the Host cannot access this memory region). It is usually implemented as dedicated memory in
each CU, but in some devices it can also be mapped into Global Memory. In AMD SI-GPU, this
memory is mapped into LDS (see Figure 2.2), whereas in Nvidia’s Kepler architecture it is mapped
into the Shared Memory (see Figure 2.3). Local memory is only consistent between work-items of
the same work-group after they encounter a local work-group barrier, as depicted in Figure 2.6.
(iv) Private Memory: Memory region private to each work-item, for read/write access. Neither the Host
nor other work-items can access this memory. It must be statically allocated in the kernel and is
usually implemented as registers in each CU.
2.3.4 Programming Model
The OpenCL standard supports two programming models: Data Parallel and Task Parallel. In the
Data Parallel programming model, parallelism is exploited by parallel executing the same set of oper-
ations over a large collection of data. Considering computation over data in an array, each work-item
executes an instance of the kernel in one array index (strictly data parallel model) or in more (relaxed
data parallel model). Hierarchical partitioning of work-items into work-groups can be defined explicitly
by the programmer or implicitly by the OpenCL implementation.
14
Conversely, in the Task Parallel programming model, a single instance of the kernel is executed,
where parallelism can be extracted by using vector types supported by the device or by enqueueing
multiple tasks (different kernels) to the Device. Intel SSE/AVX/AVX2 vector instructions, for example,
can be inferred by writing operations with OpenCL vector types (e.g. float4, int4).
2.3.5 OpenCL Runtime Parametrization
To account for the existing heterogeneity, the OpenCL Host can query the underlying platform through
the OpenCL library for the available devices and their specific characteristics. As an example, the pre-
ferred elementary work-group size of each device can be queried, typically returning the warp-size (32)
for Nvidia GPUs, and the wavefront size (64) for AMD GPUs. For Intel OpenCL compatible CPUs, this
number is usually equal (or higher) than 64 [24]. According to the results obtained from this device dis-
covery process, different work-group partitioning schemes may be used for each device (e.g., number of
work-items, work-group-size, amount of data per work-item, etc). Furthermore, to enable inter-platform
portability, the OpenCL framework offers the possibility of compiling the developed kernels in runtime,
allowing different compilation flags or kernel versions to be chosen, according to the target platform.
2.4 Load Balancing Techniques
When considering the trade-offs between the multi-core CPU and GPU architectures, it makes sense
to attempt a simultaneous exploitation of these computational platforms, by scheduling the workload to
the device best suited for each particular task. Figure 2.7 depicts an example network of heterogeneous
computing nodes, each comprised by multiple CPU cores and one or more specialized accelerators. In
particular, these accelerators can be GPUs with very different compute capabilities, or even other types
of hardware platforms (e.g. FPGAs, DSPs). In such an heterogeneous environment, HPC applications
Heterogeneous Node
GPU A
GPU B
Other (e.g. FPGA)
Node
Node
Node
Multi-Core CPU Node
Node
Node
...
...
HeterogeneousNetwork
Figure 2.7: Example of an heterogeneous network composed by several compute nodes, each com-prised by multiple CPU cores and one or more specialized accelerators.
frequently call for load balancing mechanisms to distribute the workload among the available processing
nodes. A simple and insightful way of posing a typical load balancing problem is the following: Consider
15
a cluster of p processing nodes; let ti(dki ) be the time taken by node i to compute over the assigned
data dki at iteration k, where i ∈ [0, p − 1[. The objective is that at some iteration k = b, all devices take
the same time to compute the assigned load, yielding dbi = dbj for every {i, j} node pair. Specialized
algorithms may take into account other performance metrics, such as consumed power [29] or inter-
node communication latency[28]. Furthermore, while some publications aim to present generic load
balancing methods, others focus in offering a solution for specific applications or scientific fields.
Load balancing algorithms found in the literature can typically be classified according to some fun-
damental characteristics [12]. First of all, the load balancing solution can either be Static or Dynamic.
Static [28] implementations evaluate the characteristics of the application and the target hardware plat-
form (either at compile-time or run-time) and make the workload distribution based on these data. For
example, in [28] the authors introduce an algorithm to find a subset of computing nodes in a complex
network that form an optimal virtual ring network, classifying candidate nodes by considering the pro-
cessing capabilities of each one and the bandwidth of the respective communication links. Conversely,
dynamic [10, 42] load balancing solutions take into account one or more performance metrics (e.g. time,
power, accuracy) measured in run-time and dynamically modify the workload distribution to best fit the
heterogeneous platform. For example, in [42], a dynamic load balancing algorithm is devised for the
Adaptive Fast Multipole Method (AFMM) method, which is a solver for n-body problems (e.g. Colliding
Galaxies, Fluid Dynamics). In order to balance the load in a cluster composed by 10 CPUs and 4 GPUs,
an adaptive decomposition of the particle space is employed, and is modified dynamically according to
a performance model that predicts the performance of future iterations using previous execution time
measurements.
Secondly, load balancing algorithms can either be Centralized [9, 10, 12] or Decentralized [28, 53].
The former concentrate load balancing decisions in one monitoring node that schedules the work among
the cluster, whereas the later rely on local decisions made on each computing node (possibly using
information from neighbour nodes) to distribute the workload among them. Furthermore, centralized load
balancing algorithms can be further classified as either Task-Queue [10] or Predicting-The-Future [9,
12, 42]. Task-queue algorithms rely on partitioning the work-load into several smaller tasks, which are
continually fetched by the computing nodes. Although they are a relatively simple solution to implement,
a high speed communication link is required between the node managing the task-queue and every
other computing node, since tasks are usually required to be fetched frequently (to ensure a fine-grained
balancing). Conversely, predicting-the-future approaches schedule the work depending on performance
measurements of past iterations. If the balancing solution is well implemented (and the target algorithm
allows it), these approaches can converge to stabilized work-load distribution, and cease to require
intensive inter-node communication.
Considering the importance of load balancing methods for scheduling the workload among hetero-
geneous devices, two balancing solutions were employed in the parallelization solution devised in this
dissertation. The first is a task-queue algorithm with a distributed balancing decision, whereas the sec-
ond is a centralized predicting-the-future dynamic load balancing approach. Details about these two
algorithms will be presented in Chapter 6.2 and Chapter 5.
16
2.5 Summary
In this chapter, an overview of both state-of-the-art CPU and GPU hardware was presented, and the
architectural differences between the two platform families were discussed. Following, an overview of
the OpenCL programming framework was introduced, highlighting the structure of the framework, and
the opportunities it offers to exploit a wide range of accelerators. The advantages of exploiting hetero-
geneous platforms comprised of CPU and GPU devices and the wide availability of these computational
resources among scientific research groups, led to the choice of targeting these type of systems. At this
respect, a brief review of the literature on load balancing solutions was discussed, presenting typical so-
lutions for addressing the problem of efficiently scheduling the workload in an heterogeneous computing
environment. Further details about the particular load balancing algorithms employed in this dissertation
The Perturbative Monte Carlo QM/MM algorithm is a molecular simulation procedure designed to
study mixed QM/MM simulations. These simulations usually consider a circumscribed region of interest
(often referred to as active site) and an immersive environment. It takes a QM/MM system as input and
outputs other chemically viable configurations of the same system, sampled with the Metropolis Monte
Carlo rule [32]. As introduced earlier in this dissertation, the Metropolis MC method samples the sys-
tem in the ensemble space, rather than following a time coordinate. With this method, a sequence of
random configurations is obtained on the basis of Maxwell-Boltzmann statistics, by performing random
movements at each time frame and by evaluating the corresponding change of the system energy. The
underlying methods for calculating this energy can be traditional MM, QM or mixed QM/MM methods.
MM approaches represent atoms and molecules through ball and spring models, with heavily param-
eterized functions to describe their interactions. Alternatively, QM approaches explicitly simulate the
electrons, at a cost of a much higher computational burden. An alternative solution (which is employed
in the algorithm herein studied) consists of a mixed QM/MM approach, which combines the strengths of
each method. In this case, a small active region is simulated with QM, while the remaining environment
is represented by classical MM. Achieving a comprehensive understanding of the algorithm structure
represents a fundamental step to devise the best parallelization approach. Accordingly, a brief char-
acterization of the QM/MM simulations under study, together with an overview of the PMC method, is
presented in this chapter. Following, a computational complexity analysis and the strategy that will be
applied for the algorithm parallelization is introduced. Finally, the related work on accelerating molecular
simulation algorithms is discussed.
3.1 Algorithm Description
For the purpose of describing the PMC method, a chemical solution composed by a solute (region of
interest) and a solvent (environment) will be herein considered as an example. Accordingly, Figure 3.1
depicts a schematic of such a system, comprised by a single solute molecule (molecule C), which is
treated at the QM level, and two solvent molecules (molecules A and B), treated at the MM level. Then,
by applying a Metropolis MC step, one of these molecules will be randomly picked, translated and
rotated to generate a new structure. This last MC step will be either accepted or rejected, according to
the resulting energy change. MC steps are accepted if the energy of the obtained configuration is lower
than the previous configuration reference (which is the last accepted configuration) or accepted with a
probability1 e∆E
KBT if the energy of the system has risen.
As depicted in Figure 3.1, the energy change is computed by considering two types of interactions
with the changed molecule (e.g. molecule A), resulting in either QM/MM energy terms or pure MM
terms. The QM/MM terms account for the interaction with the QM solute (molecule C), whereas the MM
energy terms account for the interaction with every other solvent molecule (in this case, just molecule
B). Furthermore, for both levels of theory (QM/MM or pure MM), Coulomb and Van der Waals (vdW)
contributions have to be considered.
1Boltzmann distribution, where KB stands for the Boltzmann factor and T for temperature.
20
AB
C
A B
C
B
C
A
QM/MM InteractionMM InteractionMonte Carlo Step
Before Monte Carlo Step After Monte Carlo Step
QM Region
MM Region
QM/MM System
Figure 3.1: A system composed of one QM molecule (C) and two MM solvent molecules (A and B). Foreach MC step, the difference in energy between the molecule moved (A) and every other molecule hasto be computed, but at different levels of theory.
Monte Carlo Step
Coulomb Nuclei QMMMVDW QMMMVDW MMCoulomb MM
+Update
Reference
Output Result
yAccept ?
QMupdate
PMCcycle
PMCcycle
QMupdate
PMCcycle
Coulomb Grid QMMM
Δ EMMC Δ EMM
vdW Δ EQM/MMvdW Δ EQM/MM
C,nuclei Δ EQM/MMC,grid
Δ E
PMC Cycle: KCycle steps PMC : K PMC iterations
Figure 3.2: Perturbative Monte Carlo QM/MM with focus on the simulation bottleneck (PMC cycle, right).Arrows represent data dependencies.
Furthermore, Figure 3.2 illustrates the dataflow of the target PMC method. In each PMC Cycle
(right), KCycle Monte Carlo steps of the MM subsystem are executed, while keeping the QM region
static. In another process, the electronic density of the QM region is updated (QM Update) by using
MOLPRO[55], and the result is subsequently used in the next PMC Cycle. As described earlier, the
system energy variation has to be computed at each MC step (henceforth referred to as a PMC Cycle
step), given by the expression:
∆E = ∆ECMM + ∆EvdW
MM + ∆EvdWQM/MM + ∆EC,nuclei
QM/MM + ∆EC,gridQM/MM (3.1)
where each partial ∆E term corresponds to an energy contribution computed in a particular PMC Cy-
cle procedure (see Figure 3.2). As previously introduced, each PMC Cycle step consists in selecting,
translating and rotating a random MM molecule, computing ∆E (See Equation 3.1), and checking the
current QM/MM system configuration for acceptance. In order to store the obtained results, every Foutput
iterations (see Table 3.1), the current QM/MM configuration is written in an output file. Moreover, despite
being a good example for illustrating the algorithm, Figure 3.1 only depicts a very small system. Con-
versely, a more general QM/MM run will have a much higher number of molecules, and is characterized
21
Table 3.1: QM/MM Run Characterization, together with the typical parameter range for the benchmarksconsidered in this work. For the case of homogeneous solvents, the Z
(i)MM parameter (concerning
molecule i) will be the same for every MM molecule.
Input QM/MM SystemParameter Description Typical RangeNQM Number of QM grid points [105, 107]
NMM Number of MM Molecules 103
ZQM Number of atoms in the QM region [10, 102]
Z(i)MM Number of atoms per MM molecule [1, 10]
AMM Number of MM atoms [103, 104]
Run ParametersKPMC Number of PMC iterations [104, 107]
KCycle Number of PMC cycle steps (per PMC iteration) [10, 103]
The Coulomb QM/MM energy computation is of particular interest since it is the most computational
intensive calculation in each PMC Cycle. This energy contribution is accounted for by two distinct terms,
∆EC,nucleiQM/MM and ∆E
C,gridQM/MM. The former accounts for the interaction with the atoms of the QM molecule
represented by classic nuclei centred charges, whereas the latter accounts for the interaction with the
QM electronic wave represented by a grid of point charges (henceforth referred to as grid). Between
the two, the ∆EC,gridQM/MM term is largely more computational intensive (see Section 3.2 for more details)
and corresponds to a discretization of the integral shown in Equation 3.2, where ZMM and NQM follow
the definition given in Table 3.1, ρ(.) is the electronic density function, q the charge and r the distance
between the changed molecule and each grid point.
∆EC,gridQM/MM =
ZMM∑j
∫ρ(r)
qjri,j
dr GRID−−−−→ZMM∑
j
NQM∑i
qiqjri,j
(3.2)
The pseudo-code for the Coulomb Grid QM/MM energy computation (∆EC,gridQM/MM) is presented in Algo-
rithm 1. As shown, for each {atom, grid point} pair (considering the atoms of the displaced molecule),
the Coulomb potential is computed. Furthermore, since periodic QM/MM systems (defined by a repeat-
able simulation box) are herein considered, the spacial range of the considered electrostatic interactions
(i.e. Coulomb, vdW) has to be limited by a cutoff distance in space (rc). Accordingly, shifted potentials
(Vshift) [16] are used in the ∆EC,gridQM/MM interaction terms
Vshift =
{1r −
1rc
+ 1r2c(r − rc) r < rc
0 r ≥ rc(3.3)
affecting each term differently, depending on to the distance between each {atom, grid point} pair (r),
and completely disregarding (set to 0) the interaction whenever r > rc. The usage of shifted potencials
can be observed in Algorithm 1, resulting in four possible space regions depending on the distance
between the considered grid point and both the old and the new set of coordinates of each atom of the
displaced molecule. Hence, four slightly different energy expressions (resulting from the application of
22
Vshift) may be computed. As discussed further in this dissertation, the procedure presented in Algorithm 1
will be one of the main targets of parallelization.
Algorithm 1 Coulomb Grid QM/MM energy (∆EC,gridQM/MM). See Table 3.1 for parameter definitions.
Define: atom := {position = {x, y, z}, chemical params = {σ, ε, q}}Init: Energy = 0.0Init: rc → Coulomb cutoff (run parameter)
1: for each atom i in changed molecule do [ Z(chmol)MM cycles ]
2: for each point j in charge grid do [ NQM cycles ]
3: rold = distance(i, j) in reference system4: rnew = distance(i, j) in new system5: qs = −qi × qj6: if rnew < rc and rold < rc then
7: Energy += qs× ( 1rnew
− 1rold
+ 1r2c
(rnew − rold))
8: else if rnew < rc and rold ≥ rc then
9: Energy += qs× ( 1rnew
− 1rc
+ 1r2c
(rnew − rc))
10: else if rold < rc then
11: Energy −= qs× ( 1rold− 1rc
+ 1r2c
(rold − rc))
12: end if13: end for14: end for
3.2 Computational Complexity Analysis
The computational complexity of the PMC QM/MM depends on the complexity of the program proce-
dures that comprise the PMC Cycle (see Figure 3.2). The Monte Carlo Step procedure, which consists
in rotating and translating a random molecule (henceforth referred to as chmol), has a complexity pro-
portional to the size of that MM molecule
O(Monte Carlo Step) = Z(chmol)MM (3.4)
usually having a very light execution time footprint. On the other hand, the complexity of the Coulomb
Grid QM/MM procedure (Algorithm 1) is proportional to the product of the size of chmol by the number
of grid points
O(Coulomb Grid QM/MM) = Z(chmol)MM ×NQM (3.5)
which will usually be the most time consuming procedure, since NQM is typically a big number. The
other two Coulomb computations have identical algorithm structures, although the involved data differs.
Coulomb Nuclei QM/MM uses nuclei centred point charges instead of the electronic grid, and thus it’s
complexity is given by:
O(Coulomb Nuclei QM/MM) = Z(chmol)MM × ZQM (3.6)
yielding a much lower complexity in comparison with Coulomb Grid QM/MM. On the other hand, Coulomb
MM computes the interaction between each atom of chmol and each atom of every other MM molecule.
Hence, it’s complexity is given by:
O(Coulomb MM) = Z(chmol)MM ×
NMM∑i
Z(i)MM (3.7)
23
which would be simplified to (Z(chmol)MM )2 × NMM , for the case of homogeneous solvents. The total
number of MM atoms (AMM ) may also be used in this text as the complexity variable, considering that
AMM =∑NMM
i Z(i)MM , for either homogeneous or heterogeneous solvents.
The vdW energy calculations have a completely different energy expression, as presented in Algo-
rithm 2, which shows the pseudo-code for vdW MM . Likewise, the vdW QM/MM procedure shares
an identical structure, although it loops over the QM nuclei centred charges instead of the MM atoms.
Similarly to the Coulomb computations, the vdW procedures have a nested for-loop structure with four
cutoff branches. Thus, the complexity of the vdW procedures is identical to their Coulomb counterparts,
yielding the following expressions:
O(VDW MM) = Z(chmol)MM ×
NMM∑i
Z(i)MM (3.8)
O(VDW QMMM) = Z(chmol)MM × ZQM (3.9)
Finally, each PMC Cycle step terminates with an update of the current system reference and output
writing. The reference update complexity is proportional to the size of the changed molecule, whereas
the output saving is proportional to the total number of MM atoms over the writing frequency:
O(Update Reference) = Z(chmol)MM (3.10)
O(Output XYZ) =
∑NMM
i Z(i)MM
Foutput(3.11)
Having in mind the typical magnitude of the QM/MM parameters (see Table 3.1), one can deduce
that the most computational intensive procedure is the Coulomb Grid QM/MM. This will be taken into
consideration when parallelizing the PMC program procedures. By accounting for all the PMC Cycle
procedures, the complexity of one PMC Cycle step results in:
O(PMC Cycle) = Z(chmol)MM × (NQM + ZQM +AMM ) +
AMM
Foutput(3.12)
by recalling that AMM =∑NMM
i Z(i)MM . Considering the typical ranges for these parameters (see Ta-
ble 3.1), one can observe that the leading term will be Z(chmol)MM ×NQM . In particular, the ×NQM param-
eter will have the heaviest footprint on the resulting complexity.
3.3 Data Dependencies
The PMC Cycle operates over three main data structures, which are depicted in Figure 3.3. Firstly,
the changed molecule (chmol), which is composed by Z(chmol)MM atoms, each represented by three di-
mensional Cartesian coordinates x, y, z and chemical constants σ, ε, q. Secondly, the QM grid, which is
composed by NQM point charges, each also represented by Cartesian coordinates and a charge2 (q).
Finally, the MM lattice, which comprises all MM molecules, including the chmol data before the MC step
takes place, and the QM molecule represented with classical MM nuclei (ZQM atoms).
2In this case, the charge is not constant because it is modified (alongside with the coordinates) by the QMUpdate process.
24
Algorithm 2 VDW MM energy (∆EvdWMM ). See Table 3.1 for parameter definitions.
Define: atom := {position = {x, y, z}, chemical params = {σ, ε, q}}Init: Energy = 0.0Init: rc → van der Waals cutoff (run parameter)
1: for each atom i in changed molecule do [ Z(chmol)MM cycles ]
2: for each atom j in every other MM molecule do [∑NMMj Z
(j)MM = AMM cycles ]
3: rold = distance(i, j) in reference system4: rnew = distance(i, j) in new system5: if rnew < rc and rold < rc then
6: Energy +=√εi × εj ×
((σi×σjr2new
)6 − (σi×σjr2old
)6 − (σi×σjr2new
)3 + (σi×σjr2old
)3)
7: else if rnew < rc and rold ≥ rc then
8: Energy +=√εi × εj ×
((σi×σjr2new
)6 − (σi×σjr2new
)3)
9: else if rold < rc then
10: Energy +=√εi × εj ×
((σi×σjr2old
)6 − (σi×σjr2old
)3)
11: end if12: end for13: end for
(N MM×ZMM+ZQM )×{ x , y , z ,σ ,ϵ , q }
NQM×{ x , y , z , q }
ZMM×{ x , y , z ,σ ,ϵ , q }MC
constants
chmol
grid
lattice
Figure 3.3: Main data structures used in the PMC Cycle. Refer to Table 3.1 for parameter definitions.
Figure 3.4 shows the data dependencies of each process in the PMC Cycle. In particular, the data
structure corresponding to the changed molecule (chmol) is written by the Monte Carlo Step and sub-
sequently read by all the energy calculation procedures, which compute their respective ∆E energy
terms to be processed by the Decide & Update procedure. Then, if the step under consideration is
accepted, the lattice corresponding to the MM Region (see Figure 3.1) is updated with the tested chmol
configuration, and a new Monte Carlo Step may take place. Unlike the other data structures, the grid
corresponding to the QM Region (see Figure 3.1) is not modified within the PMC Cycle. Instead, it is up-
dated by the QM Update process. Hence, considering the described data dependencies within the PMC
Cycle, it is observed that the energy contribution procedures can be executed in parallel with respect
to each other. Furthermore, even the energy calculations are amenable to parallelism, as each of them
can be mapped to a parallel reduction structure. For the particular case of the Coulomb Grid QM/MM
procedure, this can be verified by inspecting Algorithm 1, although the other energy calculations share
the same structure, apart from the energy expression and the involved data (e.g., see Algorithm 2).
Having this in mind, the PMC Cycle is the main target of study and parallelization in this work. At
this respect, several OpenCL kernels were devised to extract the available parallelism in the PMC Cycle
procedures, as well as a capable Host-side management framework to schedule the work among the
25
Monte Carlo Step
Coulomb Nuclei QMMMVDW QMMMVDW MMCoulomb MM Coulomb Grid QMMM
grid
lattice
chmol
ΔE
Decide & Update
Data
Procedure
write read
Legend
Figure 3.4: Data dependencies within the PMC Cycle. The VDW QMMM and Coulomb Nuclei QMMMprocesses only read the atoms that are part of the QM molecule, not the whole lattice.
available computational resources. The internal implementation of the QM Update will be kept mostly
unchanged3, apart from simple add-ons to accelerate inter-process communication. Nevertheless, a
scalable multiple Markov chain solution, which exploits parallelism in the MC state-space sampling, was
designed to accelerate the QM Update procedure. Chapters 4 and 5 discuss the devised solution in
detail.
3.4 Related Work
Due to the computational complexity of molecular simulation procedures, there has been substantial
research work. The literature describing this work can be grouped by: i) the nature of the employed
sampling, ii) the type of theory used for the energy calculations and iii) the chemical application for
which they have been tuned to. The employed sampling is usually performed in time (MD) or in state-
space (MC) and the energy interactions may consider pure QM, pure MM and mixed QM/MM terms.
Furthermore, for the same state-space sampling strategy, several variants may be considered. For
the case of MC sampling, this includes (among other possible approaches) the Diffusion Monte Carlo
(DMC) [34], the Variational Monte Carlo (VMC) and the PMC [51]. Finally, the application for which the
algorithm has been tuned to, may vary greatly, and this is the main reason for which the performance
gains attained in the parallelization of the algorithms in this field can seldom be compared to each other.
MD is a popular approach to studying a wide range of dynamical properties, and it led to several ac-
celeration works dating from the earlier times of GPGPU [17, 48] to more recent publications [31, 38, 44].
On the other hand, methods based on MC sampling allow simulating systems with longer timescales,
and several works have also accelerated these algorithms by following GPGPU approaches [2, 3, 13,
23, 30, 52]. Our work falls into the latter category (MC) and therefore we shall present a more detailed
review of those works.
The work in [2] presents a GPGPU solution for Quantum Monte Carlo (QMC), achieving up to 30×
speed-up in individual kernels and up to 6× speed-up in the overall execution. The QMC variety that is
considered by such research is based on DMC, unlike the PMC approach followed in our work. They
3The MOLPRO program suite is a closed source commercial tool and is performing extremely complex calcula-tions in the QM Update procedure. Besides not having the main code available (aside from user scripts), it is notthe bottleneck of the PMC QM/MM, and thus optimizing it is out of the scope of this work.
26
employ a scheme for simultaneous state-space exploration (each chain called a walker) by using a
scheme similar to the multiple Markov chain approach that is herein adopted. However, they emphasise
on exploiting a high amount of parallelism at the walker level (up to 16 simultaneous walker evaluations
on the GPU), whereas we focus on exploring the finer-grain level of parallelism on each chain (which are
heavy enough to keep the GPU busy, in our case), and manage chain-level parallelism with fewer chains
per GPU. We took this approach since spawning a very large number of chains on the same GPU would
be unfeasible for the case of the PMC method, since each chain requires to compute not only the MC
step trials (in this case, the PMC Cycle procedures), but also the intrinsically serial QM Update process.
In [13], the authors discuss a GPGPU parallel approach to continuum QMC, by considering the
DMC. They target Nvidia GPUs by using the CUDA framework, and MPI to schedule the work among
computational clusters, exploiting walker-level and data-level parallelism, and achieving full-application
speed-ups from 10× to 15× in respect to a quad-core Xeon CPU implementation. However, unlike the
work herein described, they do not target QM/MM systems, focusing only on QM applications.
The work described in [52] uses MC sampling (based on Variational Monte Carlo) and targets
QM/MM systems, by exploiting computational clusters composed of heterogeneous nodes. Accord-
ingly, since their performance bottleneck is on the calculation of the electrostatic potential, they use
GPUs to handle the bottleneck code and CPUs for the remaining procedures, obtaining a speed-up of
up to 23.6× versus a single-core CPU. The adopted GPGPU framework is CUDA, and an MPI solution
is proved scalable up to 4 CPU cores. They do not report any explicit load balancing solution nor target
the simultaneous exploitation of heterogeneous GPU platforms, contrary to the work herein presented.
In [3], the authors describe a CUDA GPGPU implementation for many-particle simulations using MC
sampling. They partition the particle set in several cells and apply many MC steps in parallel, known
to not interfere with each other. They do not target QM/MM systems. Instead, tests are performed
for a ”hard disk” system (two-dimensional particles which cannot overlap), and the considered particle
interactions are the physical collisions. Unlike physical collisions, the electrostatic potentials considered
in our work have a much higher range, and as such the computed energy terms at each MC step
depends on a much larger number of their neighbour molecules (the potencial cuttofs are about half of
the simulation box). Therefore, such scheme would not be effective to solve the problem that is herein
considered, as most MC steps would interfere with each other. The work presented in [23] also describes
a parallel approach to particle MC simulations using CUDA, without any emphasis on QM/MM systems.
Finally, the work in [38] targets QM/MM simulations, although time sampling (MD) is used instead,
and a special focus is given to accelerating the QM grid generation, achieving up to 30× speed-up. This
contrastes to what happens in the PMC, where the bottleneck is found in the QM/MM electrostatics (the
PMC Cycle), which is significantly accelerated by our implementation.
Before concluding, it is worth recalling that direct performance comparisons are difficult to handle
in this field, and very few authors do it in the literature. Furthermore, very few have considered the
usage of heterogeneous architectures for hybrid QM/MM simulations, whilst using MC sampling. Our
solution efficiently takes advantage of the hybrid nature of QM/MM simulations and the MC state-space
exploration, unlike typical pure QM or MM approaches.
27
Most existing works adopted CUDA as the programming framework, being constrained to Nvidia
GPUs. To circumvent this limitation, other frameworks have been developed to ease the programming
of non-conventional architectures, such as StarPU [5] and OpenCL [22]. Due to its simpler means to
orchestrate multiple devices in an heterogeneous environment and to write portable code between differ-
ent architectures, the later was used in this work. Moreover, by allowing an easy extension with the MPI
framework, the proposed approach still leaves opened the possibility to exploit further performance scal-
abilities at the chain-level, since the most challenging fine-grained part consisting on the parallelization
of the PMC Cycle was already overcomed.
3.5 Summary
In this chapter, a brief characterization of the QM/MM simulations under study, together with an
overview of the PMC method, was presented. Then, a computational complexity analysis of the PMC Cy-
cle procedures was conducted, revealing the computational bottlenecks and concluding that the Coulomb
Grid QM/MM procedure is the most computational intensive step of the PMC Cycle. In particular, the
dominating term was shown to be the number of QM grid points (NQM ). Following, a description of the
data dependencies present in the PMC Cycle was presented, laying out the basis for the paralleliza-
tion strategy presented in the following Chapters. Finally, the related work on accelerating molecular
simulation algorithms was discussed and commented on. It was concluded that despite existing a vast
diversity of research done on this particular field of application, there are still novel contributions from
this dissertation. In particular, heterogeneous architectures were seldom considered, and the usage
of the multi-platform multi-paradigm OpenCL framework, as well as the targeting of the particular PMC
QM/MM method, are among the novel contributions of the work herein presented.
The objective of this work is to accelerate the execution of the PMC QM/MM algorithm, by exploiting
heterogeneous platforms composed by a multi-core CPU and one or more OpenCL accelerators (e.g.,
GPUs). In this Chapter, a top-level description of the devised parallel solution is introduced. Firstly, the
original PMC QM/MM approach (developed at the Free Floater Research Group) is briefly described.
Then, an introduction on applying Markov chain theory to MC simulations is herein presented, focusing
on the particular case of the PMC QM/MM simulation method. After this, the overall structure of the
parallelization strategy is laid out, discussing details about the developed OpenCL Host program and
the work-flow of the complete application. Then, a coarse-level load balancing solution to schedule the
Markov-Chain workload among heterogeneous devices is described. Finally, a few preliminary data-
structure optimizations are discussed. A detailed description of the developed OpenCL Kernels, as well
as a second load balancing algorithm for scheduling finer-grained workloads, is presented in Chapter 5.
4.1 Original PMC QM/MM
The starting point for the parallelization study developed in this dissertation was the original PMC
QM/MM algorithm implementation, provided by the Free Floater Research Group - Computational Chem-
istry and Biochemistry, Institut fur Physikalische Chemie, Georg-August-Universitat Gottingen. This orig-
inal approach was designed to run on a single-core CPU, by executing two interleaving UNIX processes:
the PMC Cycle and the QM Update, which communicated via a file (hard-disk I/O) between PMC itera-
tions. The PMC Cycle was developed at the Free Floater Research Group and the complete C++ source
was made available for this work. The QM Update is comprised by a few FORTRAN user-scripts (also
developed at the Free Floater Research Group) which call MOLPRO routines. In contrast with the other
program parts, the MOLPRO program suite is a closed source commercial tool, which is performing
extremely complex calculations in the QM Update procedure. Since the code is not available for opti-
mization and since this procedure is not the bottleneck of the original PMC QM/MM (as it will be shown
in Figure 6.2), optimizing it was deemed to be out of the scope of this work. Nevertheless, a method
for executing several instances of the QM Update in parallel will be introduced in this dissertation, by
exploring multiple Markov chain parallelism, a topic discussed in the following section.
4.2 Exploiting Markov Chain Parallelism
In the context of the Metropolis MC sampling method [32], a sequence of accepted steps is called a
Markov Chain [19, 20]. For the particular case of the PMC QM/MM algorithm, a Markov Chain represents
a sequence of accepted QM/MM system configurations, which are generated by independent instances
of the PMC (PMC Cycle + QM Update). As depicted in Figure 4.1, several independent MC state-space
exploration chains may coexist, each generating an independent sampling of the conformal space of the
target QM/MM system.
The exploitation of multiple Markov chains in general purpose MC methods has been addressed in
several works [6, 45], and even in the context of a CPU-GPU environment [57]. In the next subsections,
details for the particular case of exploring Markov chain parallelism in the PMC QM/MM method are pre-
30
...
...
✔
✔ ...
✔
...
✔
Chain 0 Chain 1
Legend
Accepted Configuration
Changed Molecule (chmol)
✔
✔
Chain 0Output
Chain 1OutputIndependent state-space exploration
and output generation
Figure 4.1: Independent MC state-space exploration chains (illustrative example for 2 chains), eachgenerating an independent sampling of the conformal space of the target QM/MM system.
sented. To keep the devised approach as general as possible, and by considering the vast diversity of
computational platforms that are commonly available today, two distinct QM/MM simulation approaches
deserve particular attention: Running less Markov Chains than the available number of OpenCL accel-
erators, and the opposite case. In particular, the former is typically found in many-node computational
clusters, since these hardware platforms may have more computing nodes than the desired number of
independent Markov chains one wishes to spawn, in order to achieve the desired statistical properties
of the MC sampling. To address this case, specially tailored load balancing approaches are required,
since data from the same Markov chain exploration context has to be shared between several (possibly
heterogeneous) nodes. The approach for balancing the work of a single Markov chain among several
devices is presented in Chapter 5.
4.2.1 Multiple Markov Chain Parallelism
As introduced earlier, several MC state-space instances can be sampled by running several Markov
chains in parallel, thus allowing the simultaneous execution of the respective PMC Cycles. Furthermore,
this technique also allows executing the respective QM Update processes for the several chains in par-
allel. Since the PMC Cycle is the bottleneck of the PMC QM/MM method (as will be shown in Figure 6.2)
and provides a several opportunities to extract task-level and data-level parallelism (see Section 3.3), it
will be executed on OpenCL accelerators. On the other hand, since it is an intrinsically serial procedure,
the QM Update will be executed by spawning independent Markov chain instances on multiple CPU
cores. The MC state-space sampling layout corresponding to this approach is shown in Figure 4.2 (left),
together with the corresponding execution time-flow (right). Although the depicted example corresponds
to three independent chains, this number can scale with the available computational resources, as more
OpenCL accelerators and CPU cores are added to a given hardware configuration. It is important to note
that, although the PMC Cycle was the computational bottleneck in the original implementation, the high
31
QMUpdate
0.0
PMCCycle
0.0
PMCCycle
1.0
PMCCycle
2.0
PMCCycle
0.1
PMCCycle
1.1
PMCCycle
2.1
QMUpdate
1.0
QMUpdate
2.0
chain 0
PMCCycle
0.0
PMCCycle
1.0
PMCCycle
2.0
QMupdate
0.0
QMupdate
2.0
QMupdate
1.0
PMCCycle
0.1
PMCCycle
1.1
PMCCycle
2.1
Seed
... ... ...
chain 1 chain 2
State-Space
...
...
...
Execution
GPPcore
GPPcore
GPPcore
OCLAccel.
time
Figure 4.2: MC State-Space alongside with the execution timeline for three Markov chains.
performance speed-ups attained in the acceleration of this procedure considerably reduced its execu-
tion time (more details in Chapter 6). Therefore, depending on the considered acceleration platform, the
ratio between the execution times of the QM Update (t(QMupdate)) and the PMC Cycle (t(PMCcycle))
might vary considerably. Having this in mind, and by observing Figure 4.2, one can conclude that the
maximum number of independent Markov chains that can be spawned depends on the t(QMupdate)t(PMCcycle)
ratio
in the following manner:
maxchains = #Accelerators× t(QMupdate)
t(PMCcycle)+ 1 (4.1)
where the ratio t(QMupdate)t(PMCcycle)
represents the number of (accelerated) PMC Cycles required to occupy the
OpenCL accelerator while the CPU is handling the QM Update (for the sake of keeping the example in
Figure 4.2 as simple as possible, it was assumed a ratio of 2, although bigger ratios are usually observed
in read datasets - see Chapter 6). Moreover, maxchains will also be limited by the number of CPU cores
available to run the QM Updates. Since the QM Update process relies a lot on I/O disk communication,
the performance of the Host CPU may start to degrade when a higher number of processes are spawned
(as shall be shown in Chapter 6).
The multiple Markov chain parallelism strategy presented in [57] and in the some of the works dis-
cussed in Section 3.4, relies on a very high number of Markov chains to exploit parallelism in the many-
core GPU architecture. For the particular case of the approach introduced in [57], a GPU thread is
spawned to manage each Markov chain. This approach would be unfeasible for the case of the PMC
method, since each chain requires to compute not only the MC step trials (in this case, the PMC Cycle
procedures), but also the intrinsically serial QM Update process. To tackle this limitation, the approach
herein presented focused instead on exploiting task and data-level parallelism in each PMC Cycle step
(as will be discussed in Chapter 5), as well as chain-level parallelism by scheduling the tasks associ-
ated with each Markov chain (PMC Cycle and QM Update) among multiple CPU cores and OpenCL
accelerators.
32
4.3 Parallelization Strategy
By considering the Multiple Markov chain parallelism method introduced in Section 4.2.1 and the
PMC Cycle data dependency analysis presented in Section 3.3, three levels of parallelism can be ex-
tracted in the PMC QM/MM method: i) running several independent Markov Chains (chain-level par-
allelism); ii) executing the PMC Cycle procedures in parallel in respect to each other (task-level par-
allelism); iii) executing the inner iterations of each procedure in parallel, for different sections of the
dataset (data-level parallelism). At this respect, Figure 4.3 depicts the exploitation of these levels of
parallelism in the PMC QM/MM method. As discussed in Section 4.2.1, the PMC Cycle will be executed
on OpenCL accelerators, whereas the QM Update will be executed by spawning independent Markov
chain instances on multiple CPU cores. To accomplish this approach, the devised parallel solution is
mainly composed by: i) a C++ Host-side CPU program (henceforth referred to as the Host-Program)
to manage the OpenCL devices and the QM Update processes; ii) a UNIX pipe interface written in C
to manage communications between the Host-Program and the QM Update procedures (replacing the
original file-based communication); iii) a set of OpenCL kernels to accelerate the PMC Cycle execution.
Monte Carlo Step
Coulomb Nuclei QMMMVDW QMMMVDW MMCoulomb MM Coulomb Grid QMMM
Figure 4.3: Simultaneous exploitation of chain-level, task-level and data-level parallelism in the PMCQM/MM method.
The described management approach was taken for several reasons. Firstly, a centralized Host-
Program approach was taken, since the hardware setup targeted in this thesis is a single compute node
composed by multi-core CPUs and heterogeneous GPUs, and in this case the overhead of centralized
management is not a problem. Although the presented approach could be scaled to a multi-node com-
puting environment (e.g., using MPI), this was not considered to be a priority in this dissertation, since a
single-node heterogeneous system already allows a fairly extensive study of the employed parallelization
and load balancing schemes. Secondly, the original file-base communication system was substituted by
UNIX pipes in order to: i) free the disk from I/O burden as much as possible (since the MOLPRO program
package used in the QM Update already uses the hard-drive intensively for temporary files); ii) provide
a faster communication medium (if sufficient memory is available, the pipe inter-process communication
transfers are executed via main memory). For the QM Update side of communications, a FOTRAN/C
binding was used, and all the pipe communications code was developed in C, due to an easier access
33
1
OCLManager
OCLDevice 0
OCLDevice 1
OCLDevice D
OCLChain 0.0
OCLChain 0.1
OCLChain 0.C QM Update 0.C
QM Update 0.1
QM Update 0.0
... OCLChain D.0
OCLChain D.1
OCLChain D.C
.....
.
QM Update D.C
QM Update D.1
QM Update D.0
......
PMC HostProcess
2
3 4
OpenCLDevice D
OpenCLDevice 0
Legend
Thread
Process
Device OpenCL Command Queue
Unix Pipe
Cond. Variable Synchronization
PMC Host-Process
QM Update
FILE
Multi-Process/Multi-Thread PMCOriginal PMC
Figure 4.4: Multi-process/multi-threading structure of the designed parallel solution for the PMC method(right), alongside with the original dual-process approach (left).
to system functions from this language. Finally, since it is the bottleneck of the PMC QM/MM method, a
particular higher focus was given on accelerating the PMC Cycle procedure with OpenCL kernels. Due
to being the main target of acceleration in this work, the acceleration of the PMC Cycle will be discussed
in higher detail in Chapter 5. Likewise, in order to keep the description of the devised approach man-
ageable, this chapter will focus on describing the top-level parallel approach, leaving a more detailed
description of the finer-grain parallelism exploitation and load balancing to Chapter 5.
4.3.1 OpenCL Host-Side Management
Figure 4.4 presents the original dual-process PMC approach1, alongside the multi-process/multi-
threading structure of the designed solution. For the case of the later approach, the PMC Host-Process
is mainly composed by i) a centralized thread to manage synchronization and balancing among all
OpenCL devices (OCLManager, label 1); ii) a thread dedicated to managing the OpenCL command
queue operations (OCLDevice, label 2) for each device; iii) a thread dedicated to each Markov chain
(OCLChain, label 3), responsible for managing inter-process communication between the PMC Host-
Process and the QM Update processes (label 4). To accomplish inter-thread synchronization, the mutex
and conditional variable directives where used. Furthermore, inter-process synchronization and com-
munication was accomplished via the use of UNIX pipes, connecting each OCLChain thread to the
corresponding QM Update process. This pipe mechanism was implemented to substitute the original
file-based (disk I/O) communication system (Figure 4.4, left).
1Despite being the original PMC implementation, it is not used as the performance baseline in this dissertation,since it would not allow a representative assessment of the performance gains, in respect to the devised solution.The performance baseline is defined in Chapter 6.
34
Figure 4.5 depicts the execution work-flow of the parallel PMC program, for the case of a single-
device single-process instance (in order to keep the example manageable). The program starts by
reading the input file (step 1, Figure 4.5) containing run configurations, and the input lattice and grid
structures that will serve as starting references for the MC sampling. Then, the QM process is created
(step 2) via an execlp() call, and a UNIX pipe is opened between this process and the Host-Program to
enable inter-process communication. Following, the Host-Program will query the underlying hardware
for the available OpenCL platforms (step 3) and attempt to open an OpenCL context for each of them.
This discovery process respects several user-provided heuristics, such as allowing only certain device
types (e.g., GPUs, CPUs) or setting a maximum of selected devices. Next, the OpenCL buffers are
allocated on the selected device (step 4) and the starting references for the grid and lattice transfered
to the device (step 5) via an OpenCL command queue. Then, the first PMC Cycle (comprised of Kcycle
steps) is executed on the OpenCL device (step 6), and the resulting lattice configuration and ∆E term
read back to the Host-Program. The later then communicates these data to the QM Process via an UNIX
pipe (step 7), which then executes the QM Update (step 8). After this, the obtained grid configuration
is sent back to the Host-Program, which finally transfers it to the device, starting the next PMC Cycle.
This concludes 1 PMC iteration. The described work-flow is repeated for KPMC iterations, and then the
saved configurations are read back and printed to an output file. Since the OpenCL device may have
limited memory, the saved configurations are actually read back to the host from time to time, according
to the device maximum memory. Having the described work-flow in mind, the next subsection introduces
the developed OpenCL kernels.
4.3.1.A Load Balancing Among Multiple Markov Chains
Since the results produced by each Markov chain are equivalent, they may be sampled for a different
amount of steps, in respect to each other. Therefore, balancing the execution of the Markov chains
across different OpenCL devices is accomplished via a simple algorithm that works as follows:
1. Access a shared task-queue. If there are no tasks left, finish execution and Skip (2).
2. Execute task from task-queue.
In this approach, the balancing decision is distributed across the OCLDevice threads, although a cen-
tralized task-queue is employed to keep record of the available work-load. This algorithm doesn’t fit
perfectly in the classification scheme presented in Section 2.4, although it could be a considered as a
task-queue distributed balancing approach.
4.4 Data Structure Optimizations
Before describing in detail the fine-grained parallelism strategy (which will be presented in Chap-
ter 5), it is worth discussing the preliminary optimizations made on the original serial code. These
optimizations were employed to ensure that the obtained acceleration results were not inflated due to
under-performance of the serial baseline (more on baseline definition in Chapter 6). At this respect,
35
grid
Discover OpenCL DevicesCreate OpenCL
Buffers
Read grid file
QM Update
lattice
PMC Cycle
1
PMC Cycle
PMC Cycle
PMC Cycle
Init PMC Data
lattice
Δ E
gridgrid
lattice
Δ E
PMC Cycle
PMC Cycle
PMC Cycle
PMC Cycle
Launch QM update
Receive new grid
QM Update
lattice
Δ E
lattice
Δ ELaunch QM update
2
3
4
Host-ProgramOpenCLQueueUnixPipeQM Process
OpenCLDevice
gridgrid
PMC Cycle
PMC Cycle
PMC Cycle
PMC Cycle
Receive new grid
7
9
8
6
…...
...
...
Start Pipe QM
5
KPMCiterations
K cyclesteps
KCyclesteps
KCyclesteps
Read OutputConfigurations
Write Output File
10
11
1PMC
iteration
Figure 4.5: Program flow of the devised parallel PMC program, for the case of a single-device single-process instance (in order to keep the illustration clear). The legend for the numbered parts of this figureis presented throughout the text.
36
some algorithm modifications that were made in the parallel version were later ported to the serial base-
line, whenever such optimizations also lead to decreased serial execution time.
4.4.1 Indexing Molecules and Atoms
As introduced in Section 3, the pure MM electrostatic interactions, which are computed every PMC
Cycle step, consider the interaction between chmol and every other MM molecule stored in the lattice
(see Algorithm 2). In the original PMC implementation, the data structures employed to store the ele-
ments of the lattice were: i) 6 vectors with AMM entries, storing the parameters {x, y, z, σ, ε, q} for each
MM atom ii) a vector with AMM entries that returned the molecule id, given the atom index (henceforth
referred to as atom2mol). This approach caused many inefficient looping cycles, since one would have
to loop through every pair of atoms {i, j}, access atom2mol [i] and atom2mol [j], and then check if any
of these atoms belonged to chmol. As depicted in Algorithm 3, this would waste a lot of cycles just to
find the atoms that belong to chmol.
Algorithm 3 Original interaction loop: (A2MM−AMM )
2 cycles1: for each atom i ∈ [0, AMM − 1[ do2: for each atom j ∈ [i+ 1, AMM [ do3: if atom2mol[i]! =chmol and atom2mol[j]! =chmol then4: continue;5: end if6: compute interaction ...7: end for8: end for
To address the described inefficiency, an additional data structure was introduced, to allow map-
ping a specific molecule to the list of its respective atoms (henceforth referred to as mol2list). The
usage of this new structure reduced the total number of cycles for the MM interaction computations from(A2
MM−AMM )2 to Z(chmol)
MM ×AMM , which is a much smaller number (see Table 3.1). The resulting iteration
structure is presented in Algorithm 4. Since it enabled a faster execution of electrostatic computations,
this improvement was added to the performance baseline used in this work.
1: for each atom i in chmol atoms : i ∈ [0, Z(chmol)MM [ do
2: for each atom j ∈ [0, AMM [ do3: compute interaction ...4: end for5: end for
The structure that was later used in the parallel version was slightly adapted, as depicted in Fig-
ure 4.6. Instead of returning a list with the member atoms, this new structure (henceforth referred to as
mol2atom) returns the index of the first atom belonging to the target molecule, which can then be used to
index the lattice vectors, which contain the {x, y, z, σ, ε, q} data. This structure is more suitable for GPU
platforms, since it keeps the fast access to the atoms of a target molecule that the mol2list structure
37
mol2atom
1th atom index
Molecule id
39 4041 42 43 44 4536 37 380 1 2 35...
ZQM=36 ZMM(0) =3 ZMM
(1) =4 ZMM(2) =3
{ x , y , z , q ,σ ,ϵ }
Identical accessMethod for the other variables
lattice
......
0363943...MM Molecules 1,2,3QM Molecule
Nuclei (0)
0123
Figure 4.6: mol2atom data structure, together with the lattice vectors. The mol2atom structure returnsthe index of the first atom belonging to the target molecule, which can then be used to index the latticevectors, which contain the {x, y, z, σ, ε, q} data.
provided, while also offering the possibility of reading the MM atoms directly from the lattice vectors in a
coalesced fashion.
4.4.2 Computing Distances
As introduced in Section 3, the Coulomb MM and VDW MM procedures include the computation of
the Cartesian distances between the chmol and every other MM molecule, in both the new (after the MC
step) and the old system configurations (see Algorithm 2). To save computing operations, the original
PMC implementation maintained two distance buffers, one to store the distances between all the atoms
in the reference system (old-dists), and another to store the distances in the configuration currently being
tested (new-dists). Both these buffers were implemented as a symmetric matrix with A2MM entries, such
that the entry {i, j} stores the same value as the entry {j, i}. By using this mechanism (depicted in
Figure 4.7, left), only the new-dists buffer had to be updated after a new MC step, since the distances
in the reference system were already stored in the old-dists buffer. Hence, only half of the distance
operations have to be executed, resulting in a total of Z(chmol)MM ×AMM calculations. However, to maintain
these buffers, additional memory operations had to be performed at the decision step: a) if the MC step
is accepted, the old-dists buffer has to be updated with the new distances computed for the chmol, b)
on the other hand, if the step is rejected, the new-dists buffer needs to be restored to its original state,
since it will have to be used again in the next step. Either of these options will result in 2 × Z(chmol)MM ×
AMM memory operations, since one has to restore every new/old-dists[m][n] entry for m = chmol, n ∈
[0, AMM [ and for m ∈ [0, AMM [, n = chmol. Considering the specific case of a GPU platform and a
new/old-dists buffer implemented as an 1D vector with A2MM entries, the first memory operations would
result in AMM coalesced memory writes, whereas the second memory operations would result in AMM
non-coalesced memory writes. The later might introduce significant overhead in a GPU platform, which
when also considering the quadratic memory requirement of these buffers (2×A2MM ), indicates that the
described approach to distance computation is not be suitable for GPU platforms.
In order to address this problem, the alternative approach presented in Figure 4.7 (right) was devised.
38
Distances(new)
old-dists
Monte Carlo Step
VDW MMCoulomb MM
Accept ?
chmol
new-dists
ΔE
…
old-dists new-dists
updateold-dists
yes no
On-the-fly Distances(old, new)
VDW MMCoulomb MM
Accept ?
ΔE
…
No needfor memoryoperations
here
yes no
2×ZMM(chmol)×AMM
memory operations
2×ZMM(chmol)×AMM
memory operations
2×ZMM(chmol)×AMM
distance operations
ZMM(chmol)×AMM
distance operations
Monte Carlo Step
chmol
Original Version On-the-fly Version
Data
Procedure
write read
Legend
restorenew-dists
Figure 4.7: Original approach to distance computation (left), together with the devised on-the-fly solution(right). For the sake of clarity, the distance computation procedures were singled out, although they areexecuted in the same computation loop as the Coulomb/VDW procedures. The remaining proceduresof the PMC Cycle step have been omitted for the sake of clarity.
This solution exploits the huge number of compute units available in typical many-core GPU platforms to
compute all the necessary distance operations in every iteration (on-the-fly ), totaling 2×Z(chmol)MM ×AMM
calculations. By computing these additional terms, the distance buffers ceased to be required, avoiding
both the quadratic memory requirement and the overhead of updating the distance buffers. Furthermore,
since those buffers are required to be persistent between MC iterations, using them in GPU platforms
would require reading and writing them from global memory2, whereas for the case of the on-the-fly
version, the distance values are generated and consumed in the local scope of the Coulomb/VDW MM
procedures, which results in trading many main memory operations for register operations. Moreover,
the on-the-fly version also proved to be more efficient in the CPU platform used as the baseline for this
work (more details in Section 6), thus it was also included there.
4.5 Summary
In this Chapter, a top-level description of the devised parallel solution was introduced. Firstly, the
original PMC QM/MM approach (developed at the Free Floater Research Group) was briefly described,
and the main pitfalls present in the original approach where commented on. Then, an introduction on ap-
plying Markov chain theory to the particular case of the PMC QM/MM simulation method was presented.
After this, the overall structure of the parallelization strategy was laid out, discussing details about the
structure of the developed OpenCL Host-Program and the work-flow of the complete application. Then,
a coarse-level load balancing solution to schedule the Markov-Chain workload among heterogeneous
2Considering the GPU platforms used in this dissertation.
39
devices was described. Finally, a few preliminary data-structure optimizations were discussed. A de-
tailed description of the developed OpenCL Kernels, as well as a second load balancing algorithm for
scheduling finer-grained workloads, will be presented in Chapter 5.
In this Chapter, a description of the devised parallel solution for extracting fine-grained parallelism
in the PMC Cycle is presented. At this respect, the OpenCL kernels developed for accelerating the
PMC Cycle procedures will be introduced and described. Following, a multi-device approach for ex-
ecuting the workload belonging to a single Markov chain is introduced, and the synchronization and
communication overheads are discussed. After this, a fine-grained dynamic load balancing solution is
presented.
5.1 PMC Cycle Parallelization
The OpenCL kernels that integrate the PMC Cycle are listed and mapped to the corresponding pro-
cedures in Figure 5.1. To minimize communication, the PMC Cycle procedures that share the same
input data (see Figure 3.4) were merged into the same kernel. Furthermore, the OpenCL version re-
quires additional kernels to finish the implemented parallel reductions. In order to keep track of the
kernel dependencies in respect to each other, OpenCL events were used to chain the kernel calls.
These kernels and the employed strategy for their parallelization will be discussed in more detail in the
next subsections.
VDW MM
Coulomb MM
Coulomb Nuclei QMMM
VDW QMMM
Coulomb Grid QMMM
Monte Carlo Step
Decide & Update
q3m_finish
q3m_c
q3m_vdwc
mm_vdwc
mm_finish
monte_carlo
decide_update
PMC Cycle Procedures OpenCL Kernels
OpenCL Event DependencyLegend:
QMupdate
PMCcycle
PMCcycle
QMupdate
PMCcycle
*
*
Figure 5.1: Mapping of the PMC Cycle procedures into OpenCL Kernels. It should be noticed that someprocedures were merged into the same kernel. Furthermore, the OpenCL version requires additionalkernels for the parallel reductions (mm finish and q3m finish, marked with a ∗).
Firstly, before entering in development details of each OpenCL kernel, the memory layout strategy
will be discussed. Figure 5.2 presents the memory layout for the main data structures used in the PMC
Cycle. The Host-Program will try to fit as many constant data in constant memory as possible, although
this memory is usually much more limited than global memory, and for most devices this will mean having
to place constant buffers in global memory. The layout depicted in Figure 5.2 is a possible instance of
such buffer distribution. As introduced in Section 3, the lattice is composed by 3 constant vectors q, σ, ε
and 3 non-constant vectors x, y, z, which are altered when the reference is updated in the decision
step. The first entries of these vectors hold the variables for the QM atomic nuclei (label 1, Figure 5.2),
whereas the later hold the MM atoms data. Furthermore, the mol2atom structure (see Figure 4.6) may
also be stored in constant memory (label 2). The grid buffers are also constant vectors by nature, since
42
(AMM+ZQM )×{x , y , z , q ,σ ,ϵ}
lattice
…
ConstantMemory
… …
… … …
… … …
… … …
GlobalMemory
…
mol2atom NMM
ZQM AMM
grid
NQM×{ x , y , z , q }
… …
…
NQM
…
1
3
2
Figure 5.2: Memory layout example for the main data structures used in the PMC Cycle.
they are not altered during the kernels execution (only by the QM process). However, the size of these
buffers is typically prohibitive (up to 320MB, according to Table 3.1) in respect to the available constant
memory of typical OpenCL devices, forcing the Host-Program to allocate these buffers in global memory
(label 3). All these data buffers were chosen to be represented as one dimensional vectors to allow a
contiguous placement in main memory, and to reduce the level of access indirection (i.e., use a single
pointer) as much as possible. As discussed further, this favors memory coalescent accesses.
5.1.1 Monte Carlo
The Monte Carlo procedure has little potential parallelism to be extracted, since it is mainly composed
by a fairly light and intrinsically serial operation: the random translation and rotation of a single random
molecule (chmol). Nevertheless, an OpenCL kernel (monte carlo) was developed to execute this task
in the OpenCL device, because this kernel manipulates data that will be used by the kernels that follow,
enabling the communication of these data via the device’s global memory, without needing to have the
Host-Program as an intermediary. Furthermore, since a random number generator would introduce
unnecessary overhead in the GPU1, the random numbers required for the MC perturbation (10 vectors)
are pre-generated in the Host-Program and sent to the OpenCL device. The size of these vectors will
depend on the device’s memory capabilities, and it is the Host-program responsibility to manage the
periodic refresh of these random lists, every Nauto steps.
As depicted in Figure 5.3, the monte carlo kernel starts by loading the necessary random parameters
from memory, then applies the perturbation to the randomly selected molecule (specified in vector rIDs
and loaded from the lattice structure), and finally writes the displaced molecule (chmol) to global mem-
ory. This structure carries the new {x, y, z} values for each atom, the molecule id (ID), and the number
of atoms of the chmol (Z(chmol)MM ). The chmol will be loaded from memory by the energy computation
1A simple pseudo-random generator such as the one provided by glibc, would require at least two global memoryaccesses for each random number vector: one to load the current generator sequence state and another to updatethis same state.
43
ConstantMemory
Nauto
… …
…
…
{rIDs ,θ , q0 , q1 , q2 , q3 , tn , t x , t y , t z }
wi
RandomMolecule ID
Rotationnumbers
Translationnumbers
Apply Perturbation
chmol
{ZMM(chmol)
×{x , y , z }, ID ,ZMM(chmol)
}
GGlobal
Memory
… … …
ZMM(chmol)
One vector per Random parameter
C
Get parametersfor current step
Work-itemwi
G
C Constant mem. access
Global mem. access
Legend
monte_carlokernel
lattice
G
Get moleculereference
Figure 5.3: Diagram for the devised monte carlo kernel, together with the layout of the data which ismanipulated in this procedure.
kernels, which are described in the next sub-sections.
5.1.2 Coulomb Grid QM/MM
The amount of parallelism that can be extracted within each kernel varies according to existing data
dependencies and the amount of input data. Accordingly, it is highest in the q3m c kernel, not only
because the Coulomb QM/MM energy interaction (Algorithm 1) is highly data-parallel, but also due to
the size of the grid it takes as input, which may vary from hundreds of thousands to millions of grid
points. The data partition scheme employed in the q3m c kernel consists in tasking each work-item
with computing the interaction between P grid points and the atoms belonging to chmol. While the
latter is the same for every work-item and might be loaded as a global memory broadcast, the former
consists of different load addresses for each work-item. In order to obtain coalescent memory accesses,
the grid partition shown in Figure 5.4 was employed. As depicted, the grid data is stored in four one-
dimensional independent vectors, one vector for each coordinate {x, y, z} and another vector for the
charge q. Each work-group performs P memory loads, where each work-item gets the vector addresses
localindex + wgsize × i (for i iterating from 0 to P − 1). Hence, by using this strategy, work-group grid
point loads always fetch contiguous addresses, thus achieving a coalesced memory access. It should
be noticed that although Figure 5.4 depicts an example for wgsize = 4 and P = 2, this is merely for
illustration purposes, as the optimal parameter choice is different according to the target OpenCL Device
44
grid
…
Coalescentmemory loads
…
NQM×{ x , y , z , q }
… …
wi wi wi wi
G G G G
G G G G
Work-Group 0
wi wi wi wi
G G G G
G G G G
Work-Group 1
P×wgsize
wgsize=4P=2
wgsize
P
Work-Group 2
wi wi wi wi
…
LegendWork-item
Global memory accesses
wi
G
wgsize=work-group size
P=grid points per work-item
Illustrativeexample for
Figure 5.4: Scheme used for partitioning the grid among the work-groups, in order to allow a coalescedmemory access pattern. For the sake of keeping the illustration clear, an example for P = 2 andwgsize = 4 is shown.
(see Chapter 6 for further details).
A diagram of the q3m c and q3m finish kernels is presented in Figure 5.5. Firstly, each work-item
loads from the global memory P grid points and the A atoms that comprise the chmol molecule. The lat-
ter corresponds to a global memory broadcast, whereas the former is performed by P coalesced global
memory read instructions, each reading a contiguous stripe of grid points to the work-group (step 1, in
Figure 5.5). Then, for each {atom, grid point} pair, the corresponding work-item computes the squared
Cartesian distance (according to a periodic box) and compares it with squared cutoffs, thus avoiding an
expensive sqrt operation. Depending on the resulting distance, the corresponding energy expression is
computed (see the cutoff branches in Algorithm 1) and the results are accumulated in private memory
(step 2). After this, work-items of the same work-group reduce the computed energies using local mem-
ory, by accumulating all terms in one memory address after log2(work-group size) iterations (label 3).
Then, the first work-item of each work-group writes the obtained partial result into global memory and a
final reduction kernel with only one work-group is launched (label 4), to reduce the remaining terms into
to a single value (different work-groups cannot communicate via global memory). Hence, by including a
first set of energy reductions in the same kernel as the ∆ECQM/MM energy computation (q3m c), expen-
sive global memory transfers that would otherwise be required between kernel launches are avoided.
Furthermore, all reductions are organized in order to favour warp/wavefront release, ensuring that half
of the active work-items finish their execution soon after each reduction iteration, thus promoting higher
GPU occupation. The corresponding reduction structure is presented in Algorithm 5.
45
q3
m_
c
Legend
wi wi wi wi
+
L L L L
+
wi wi wi wi
L L L L
+ +
+ +
+
... wi wi wi...
+
wi wi wi...
+
...
+
P
L
L
L
L
L L
G G G
wi wi wi...
+
GEC ( pi , a j )
dist (p i , a j )
EC
G
EC
G
EC
G
EC
G
EC
G
EC
G
EC
G
EC
1
2
4
Work-Group 0(Ilustrative example for 8 work-items) Work-Group 1 Work-Group W
3
G
ΔEQM/MMC,grid
...
PLG
wi Work-item
Local memory
Private memory
Global memory q3
m_
finis
h
Figure 5.5: q3m c and q3m finish kernels structure. In this example, work-group 0 was presented withadditional detail, although all work-groups share an identical structure. Likewise, the 8 work-items perwork-group configuration was adopted for simpler illustrative purposes, as the work-group size is fullyparameterizable. Furthermore, additional details concerning the first global memory accesses (label 1)are depicted in Figure 5.4.
5.1.3 Coulomb/VDW MM
The mm vdwc kernel has a similar structure to q3m c, except that it accounts for the interaction be-
tween the changed molecule (chmol) and the lattice, instead of the grid. In this kernel, the Coulomb and
the vdW interactions have been merged together, allowing the sharing of the result from the distance
computation of the same {atom, grid point} pair via private memory (registers) in the same work-item.
The reduction structure is the same as the one presented in Figure 5.5. The data structure optimiza-
tions discussed in Section 4.4 were herein employed, to avoid having to maintain a buffer to store the
distances. The parallelization structure is identical to the one presented for Coulomb Grid QM/MM (see
Section 5.1.2), apart from the involved data.
5.1.4 Coulomb Nuclei/VDW QM/MM
Unlike the other energy computation kernels, the q3m vdwc involves a much lower amount of input
data. By recalling from Section 3.2 that the complexity of the QM/MM Coulomb Nuclei and QM/MM
VDW procedures is Z(chmol)MM × ZQM , and by consulting Table 3.1, one can conclude that the amount
of loop iterations of these procedures falls somewhere in the order of magnitude of 102. Hence, if an
approach similar to the other energy computation kernels is followed, this means that the maximum
amount of work-items one can spawn will also fall in order of magnitude of 102. For this reason, the
reduction structure for q3m vdwc is simpler, as shown in Figure 5.6. In this kernel, only two work groups
46
Algorithm 5 Pseudo code for energy reduction.Init: : local size = Size of this work-group
1: for offset = local size/2; offset > 0; offset >>= 1 do2: if localid < offset then3: local[localid] = local[localid + offset] + local[localid];
4: Local Barrier. Wait for work–group.5: end if6: end for
Legend
wi wi wi wi
+
L L L L
+
wi wi wi wi
L L L L
+ +
+ +
+
+
P
L
L
L
L
L L
G
GEC ( z i , a j )
dist ( zi , a j)G G G G G G G
1
Work-Group 0(Ilustrative example for 8 work-items)
PLG
wi Work-item
Local memory
Private memory
Global memory
EC EC EC EC EC EC EC EC
wiwiwiwi
+
LLLL
+
wiwiwiwi
LLLL
++
++
+
+
P
L
L
L
L
LL
G
GEvdw ( zi , a j)
dist ( zi , a j)GGGGGGG
2 Evdw Evdw Evdw Evdw Evdw Evdw Evdw Evdw
Work-Group 1(Ilustrative example for 8 work-items)
3 ΔEQM/MMvdWΔEQM/MM
C,nuclei
Figure 5.6: q3m vdwc kernel structure. An 8 work-items per work-group configuration was adopted forsimpler illustrative purposes, as the work-group size is fully parameterizable.
are launched, one for the Coulomb Nuclei QM/MM procedure and another one for QM/MM VDW. The
former is tasked with computing and summing the EC Coulomb terms (label 1, Figure 5.6) for each QM
atom (zi) and chmol atom (aj), whereas the later is responsible for computing the Evdw terms (label
2), for the same atom pairs. Finally, when each work-group finishes execution, the accumulated energy
terms are written in global memory (label 3), to be subsequently read by the decide update kernel.
In contrast with the q3m c/finish kernels, a subsequent reduction kernel is not required, since each
energy term is reduced completely in one work group. Depending on the target OpenCL device, the
chosen work-group size might vary, and the data each work-item computes varies accordingly.
5.1.5 Decision Step
After all the energy computation kernels have terminated their execution, the decide update kernel
is launched. Figure 5.7 depicts the work-flow of this kernel. First, the accumulated results from the
previous kernels are read from global memory and added together (label 1, Figure 5.7). Then, the
step is accepted if the energy of the obtained configuration is lower than the previous configuration
reference, or accepted with a probability e∆E
KBT if the energy of the system has risen. Otherwise, the step
is rejected. For the case of accepted steps, the chmol configuration under test is copied to the current
lattice reference (label 2). Regardless of this decision, the current system configuration is saved (label
47
GG GG G
wgsize= 8
Work-item
Global mem.accesses
Illustrative example for
Work-Group 0
wi wi wi wi wi wi wi wi
Update Reference
y
G G G
G
Σ
accept?
save?
y
GG GG GG G G
GG GG GG G G
3×(AMM+ZQM)
wgsize
Exit Kernel
{ x , y , z }
wi
G
Update lattice vectors
1
2
3
4
Legend
Idle Work-item
everyFXYZ step :
Figure 5.7: decide update kernel diagram. An 8 work-items per work-group configuration was adoptedfor simpler illustrative purposes, as the work-group size is fully parameterizable.
4) to global memory every Foutput steps (see Table 3.1). This step saving operation takes 3×(AMM+ZQM )wgsize
cycles, and follows a coalesced memory write pattern. The kernel finishes execution either after this
memory operation (label 4), or immediately after the step has been decided (label 3). Furthermore, since
the saved configurations will occupy a fair amount of memory in the OpenCL device (each configuration
taking 3 × (AMM + ZQM ) numbers), the host is responsible for reading these buffers back to main
memory from time to time, and write them to an output file.
Since the typical range for the parameter Foutput is fairly high (see Table 3.1), the employed paral-
lelization scheme in the step saving will not have much impact. Nevertheless, it was implemented for
the sole purpose of having faster Debug runs, were one might want to print every step (Foutput = 1) to
see how the QM/MM system is evolving with higher granularity. This is an important feature for code
maintainability.
5.2 Exploiting Single Markov Chain Parallelism
As introduced in Chapter 4, each Markov chain represents one Monte Carlo state-space exploration
instance. The particular case of having a single Markov chain corresponds to the work-flow which was
depicted in Figure 4.5. In this approach, the QM Update and the PMC Cycle depend strictly on the
previous PMC iteration, thus only one of these procedures can be executed at a given time (as depicted
in Figure 4.5). Nevertheless, a higher amount of parallelism can still be extracted by running the PMC
Cycle instance, respecting to the same Markov chain, on multiple OpenCL devices. The workload
48
MC
VDWMM
CoulombMM
Decide & Save
Reduction
OpenCLDevice
0
Coulomb QM/MM (n.)
VDW QM/MM
MC
%Coulomb Grid
QM/MM
Decide
Reduction
Read partial ∆E's (x2) –––>
<––– Write accumulated ∆E's
Host CPU
thread 0 thread 1
Launch overhead Launch overhead
<––– Read partial ∆E's
Write accumulated ∆E's (x2) –––>
R/W Launches
%Coulomb Grid
QM/MM
ReductionR/W Launches
Barrier Sync & Sum Partials
Other CPU threads are running the QM updates in parallel.
OpenCLDevice
1
MC
%Coulomb Grid
QM/MM
Decide
Reduction
OpenCLDevice
Nthread N
G0
G1 GN...
...
QMupdate QM
update
QMupdate
Figure 5.8: Exploiting multiple heterogeneous OpenCL devices to execute the PMC Cycle. The exe-cution is balanced by executing different kernels on each device and dividing the work of the heavierkernels (q3m c and q3m reduce).
distribution of a single Markov chain among multiple OpenCL devices is discussed in the next sections.
5.2.1 Multiple OpenCL Devices
As discussed in the Section 3.2, the most computational intensive part of each PMC Cycle step
corresponds to the computation of the ∆EC,gridQM/MM energy term. In the presented OpenCL approach, this
energy calculation is handled by the q3m c and q3m finish kernels (see Section 5.1.2). Furthermore,
according to the dependency chart depicted in Figure 3.4, the procedure which these kernels execute
(Coulomb Grid QM/MM) only depends on the chmol data structure from the MC step and on the grid
data (which is written once to the OpenCL device at the start of each PMC Cycle execution). Moreover,
typical grids have hundreds of thousands to millions of points, allowing for a fine-grained partition among
devices. Accordingly, all those conditions make these two kernels excellent candidates for multi-device
acceleration.
Figure 5.8 illustrates the employed multi-device parallelization approach for a generic heterogeneous
system composed by a host CPU and N different OpenCL devices. In this approach, the Host is re-
sponsible for syncing operations between OpenCL devices, which share partial energy results on every
iteration. In this particular example, device 0 is running all kernels, although q3m c and q3m reduce
only compute part of ∆EC,gridQM/MM. Devices 1 to N , which might be accelerators with different compute
capabilities, calculate the remaining terms of ∆EC,gridQM/MM. The relative performance of the accelerators (in
respect to each other) will determine the fraction of the grid each one gets (G0 % to GN %) and where
the least complex energy computation kernels are scheduled to.
In order to keep synchronization overhead to a minimum, every device computes the MC and de-
cision kernels redundantly, although only one of the devices is responsible for saving the the sampled
configuration, since this is the heaviest part of this procedure (see Section 5.1.5). The overhead as-
sociated with the device synchronization, to be executed at every step, is caused by several factors.
Firstly, to read and write the partial energies of each device, one has to call the OpenCL functions
49
enqueueReadBuffer and enqueueWriteBuffer, which also include an implicit clFinish to wait for the pre-
vious kernels in that step to finish (launches are chained using OpenCL Events). This is accounted for
in the R/W Launches block, in Figure 5.8. Secondly, each memory transfer introduces a small overhead
corresponding to a copy of one floating-point number per reduced energy term. The number of com-
municated terms ranges from 1 to 2 terms per device, according to the employed partitioning, since it
depends on which device is computing the lighter kernels. Finally, syncing the Host-side threads that
are managing the OpenCL accelerators (Barrier Sync) and launching and parametrizing the OpenCL
Kernels (Launch overhead) also introduces some overhead.
The multi-device synchronization overheads discussed earlier do not scale with the problem size,
depending only on the number of devices that the Host-Program has to manage. Although the Host-
Program will allocate a dedicated thread for each device (see Section 4.3.1), they will compete for the
Host resources, and the effective Host-thread parallelism may degrade. Therefore, these overheads
have a complexity of O(Ndevices), although for a small number of devices in respect to the maximum
number of parallel threads that the Host CPU can run, these complexities will be in practice sub-linear
in respect to Ndevices. Table 5.1 presents the complexity of the discussed overheads, together with two
other overheads: random list refreshing (see Section 5.1.1) and output flushing (see Section 5.1.5). The
former depends on the random list refresh frequency (10 arrays with Nauto entries, every Nauto steps),
whereas the later depends on the number of saved QM/MM systems that the OpenCL device can hold
on its global memory (Nsystems), since the host will have to read back these systems before the available
memory runs out (every Nsystems steps). Furthermore, each saved system configuration has 3 arrays of
size AMM +ZQM (see Section 5.1.5), which results in the final expression presented in Table 5.1. As for
the dependence on the number of OpenCL devices, the same rational developed earlier is applicable.
Table 5.1: Complexity of communication and synchronization overheads, in respect to the QM/MM sys-tem characteristics and to run parameters.
Overhead Overhead Complexity per PMC Cycle Step
Launch Overhead
O(Ndevices)R/W Launches
Read partial ∆E
Write partial ∆E
Refresh Random Lists O(Ndevices ×
Nauto
Nauto
)Flush Output O
(Ndevices × (AMM + ZQM )× Nsystems
Nsystems
)
5.2.2 Dynamic Load Balancing
To account for the possible heterogeneity of the computational platform, the amount of grid data that
is assigned with each device on each iteration is chosen according to a dynamic load balancing algo-
rithm. At this respect, considering the classification scheme for load balancing algorithms presented in
Section 2.4, the algorithm herein described is a centralized predicting-the-future dynamic load balancing
50
approach. Accordingly, Figure 5.9 depicts the work-flow of this solution, which starts from an unbal-
anced load distribution and eventually converges to a balanced work-load distribution, after J iterations.
Furthermore, the balancer continues to monitor the performance of the computing nodes, to ensure that
the work-load distributions continues to be optimal. This solution was based on one of the algorithms
presented in [12], for the case of constant data balancing problems.
Node 0
...
Central Node
load
...
Work-loaddistribution converged
Node 1
Node K
load
load
Node 0
...
load
Node 1
Node K
load
load
Iteration J
Node 0
...
load
Node 1
Node K
load
load
Central Node
Central Node
time
time
time
time
time
time
...
...
...
...
...
time
time
time
...
...
i) Measureperformance
ii) BalanceLoad
i) Measureperformance
ii) Load distributionconverged
i) Measureperformance
ii) ... Legend
Measure Performance
Re-schedule Load
After J Iterations with balancing every S iterations.
Iteration 0
Iteration J+S
InitialWork-loaddistribution
Figure 5.9: Work-flow of the centralized predicting-the-future dynamic load balancing solution employedin this dissertation.
In order to apply this approach for the particular case of the q3m c/finish kernels, the 3D grid is
divided into n small and independent grid blocks. In the first step, all p devices are assigned the same
number of blocks d0i = n/p. Then, for every r steps, this distribution is conveniently updated. Thus, at
The considered hardware for the experimental setup is listed in Table 6.2. The considered plat-
forms correspond to several hardware configurations of the machines available at the SiPS research
group, which include Intel i7 CPUs, Nvidia GPUs and AMD GPUs. These platform configurations were
selected to allow a fairly complete evaluation of the devised parallel solution: i) mcx0 will be used as
the performance baseline (more details in Section 6.1.3), ii) mcx1 and mcx2 were selected to evaluate
the load balancing solution between two GPUs with very different compute performances (GTX 780Ti
and GTX 660Ti), iii) platform mcx3 was selected to evaluate the performance of an highly heteroge-
neous system composed by GPUs of different vendors (AMD R9 290X and Nvidia 560Ti), iv) mcx5 will
mainly be used to assess energy consumption (since it supports NVML power measurements), v) mcx6
will be used to evaluate the parallel OpenCL solution when running on a multi-core CPU. In the pre-
sented platform configurations the Host-CPU will be both managing the OpenCL devices and running
the QM Updates, using all the available cores.
Different OpenCL work-group partitioning schemes were used for each device. For Nvidia GPUs,
the CUDA calculator [41] proved to be a useful tool for choosing starting point parameters. For AMD
cards and Intel CPUs, the optimal values were found through test and experimentation, resulting in
small multiples (e.g. 1 to 4) of the preferred elementary work-group size returned by an OpenCL device
discovery query, made in runtime to the underlying platform. The newest available OpenCL standard
was used for each device (OpenCL 1.1 for the considered Nvidia GPUs and OpenCL 1.2 for the Intel
CPUs and AMD GPUs).
6.1.3 Performance Baseline
The original PMC QM/MM single-core code was reviewed and optimized, to ensure that the obtained
acceleration results were not inflated due to under-performance of the serial baseline. Accordingly, the
optimizations discussed in Section 4.4 were added to the original algorithm. Most of the performance
comparisons presented in this Chapter are relative to the optimized version of the reference code ex-
ecuted on a single core of the i7-4770K processor (platform mcx0), compiled with Intel compiler (ICC
v13.1.3) with flags -O3 -xCORE-AVX2, unless otherwise specified. This baseline will henceforth be re-
ferred to as avx2-baseline. Figure 6.1 and Figure 6.2 illustrate a profiling evaluation of the avx2-baseline,
using the bench-A input dataset. In particular, Figure 6.2 presents the overall execution results for one
PMC iteration, whereas Figure 6.1 depicts a more detailed overview of each step of the simulation bot-
55
Monte Carlo Step
Coulomb QM/MM
VDW QM/MM
VDW MMCoulomb MM
+
Time perstep (μs)
32
356435
2
2
76077
94
CoulombGrid
QM/MM
Δ EMMC
Δ EMMvdW
Δ EQM/MMvdW
Δ EQM/MMC,nuclei
Δ EQM/MMC,grid
QMupdate
PMCcycle
PMCcycle
QMupdate
PMCcycle
PMC Cycle
Update Reference
Output Result
yAccept ?
Δ E
Figure 6.1: Time footprint for a single PMC Cycle step for the bench-A dataset running on the avx2-baseline.
0
50
10
0
15
0
20
0
25
0
30
0
35
0
40
0
45
0
50
0
55
0
60
0
65
0
70
0
75
0
80
0
85
0
PMCCycle
time(s)QM
Update
1IPMCIouterIIteration
Figure 6.2: One complete PMC outer iteration, comprised of 10k PMC Cycle steps and a QM Update,for the bench-A dataset running on the avx2-baseline. The bottleneck of each PMC iteration is the PMCCycle.
tleneck. As predicted in Section 3.2, the Coulomb Grid QM/MM procedure (∆EC,gridQM/MM) represents (for
all the tested input QM/MM systems) the most time consuming part of each PMC Cycle step, since
O(Coulomb Grid QM/MM) = Z(chmol)MM ×NQM and NQM tends to be a very large number (1, 772, 972, for
the case of bench-A).
Furthermore, both double-precision (fp64 ) and mixed double and single-precision (fp64 -fp32 ) data-
types will be employed in the performance study made in this section. Details about these numerical
configurations and the corresponding compromises, as well as a mixed fixed-point precision approach,
will be discussed in Section 6.4.
6.2 PMC Cycle Acceleration
The main performance metric of choice is the execution time of the accelerated application. However,
to further show the benefits of the proposed parallelization approach, it is herein adopted the application
speed-up in respect to the baseline serial execution:
Speedup =TbaselineTparallel
(6.1)
56
where Tbaseline is the execution time of the baseline performance, corresponding to the serial execution
on the host CPU, when compiled in Intel compiler (ICC v13.1.3) with flags -O3 -xCORE-AVX2, such as to
enable automatic loop vectorization and the usage of AVX2 vector instructions in compliant processors
(e.g., Intel 4th generation core i7). Furthermore, Tparallel represents the execution time of the proposed
solution using the system under test. In order to measure Host-side execution times, the PAPI [35]
library is used. For evaluating kernel execution time and buffer transfers to the OpenCL devices, OpenCL
Profiling Events are used instead, since they allow a finer measurement of OpenCL device operations. In
order to identify execution bottlenecks and guide the process of algorithm acceleration, the kcachegrind
tool (based on valgrind [37]) was employed.
Table 6.3 presents the PMC Cycle execution time (10k steps) for benchmarks bench-A, bench-B and
bench-C, profiled for several hardware configurations. The overall execution time corresponds to the
cost of running 10k steps plus the final output flushing from the OpenCL device back to the host and file
writing (Output time in Table 6.3, ranging from ∼0.5s to ∼2s). The extra overheads related to the time for
the OpenCL initialization and input file reading was not accounted for, because they do not scale with the
simulation size and would be diluted in longer runs (contrary to the output generation). Since the amount
of generated output scales with the number of executed steps, this overhead would repeat itself every
10k steps (for this particular run), and therefore it is taken into account in the speed-up calculations.
Table 6.3: Execution time (in seconds) for a PMC Cycle with 10k steps, in several hardware platforms,when using fp64 -fp32 mixed-precision. The column ”Total” corresponds to the complete execution timesof the PMC Cycle (10k steps), including the final serial overhead of reading back and writing the outputto a file. This overhead is discriminated in column ”Output”. The presented execution times correspondto a median among four experimental trials, for each platform configuration.
The difference between the execution time of bench-A and bench-B corresponds to the number of
MM molecules, which has two implications, namely bench-B imposes a heavier footprint of the mm vdwc
and mm finish kernels, and an increased size of the generated output, which in turn means a heavier
decide kernel and a longer output flushing. The latter can be observed in Table 6.3 and mainly depends
on Host-to-Device communication speed to read back the output, and on the time to write the output file.
Consequently, it is higher in the parallel platforms, since the output has to be read back from an external
device (in respect to the OpenCL-Host).
On the other hand, bench-C has a larger QM part, resulting in heavier q3m c and q3m finish kernels.
This favors the overall performance in respect to bench-B, as the performance of the most data-parallel
kernels is favored by a higher number of grid points. The speed-up results of the parallel platforms in
respect to the avx2-baseline (corresponding to the execution times presented in Table 6.3), are depicted
in Figure 6.2.
57
780Ti
5.08
105.36
135.96
104.38
5.62
97.50
117.06
93.60
5.58
121.69
152.89
111.59
0 50 100 150
I7-4770k
780Ti
780Ti/660ti
R9 290X/560Ti
OpenCLAccelerators
bench-A
bench-B
bench-C
8threads
mcx3
Platform
mcx2
mcx1
mcx6
Speed-up versus avx2-baseline
®
Figure 6.3: Speed-up obtained for a PMC Cycle with 10k iterations, when using fp64 -fp32 mixed-precision. The Corresponding execution times are presented in Table 6.3.
According to the presented results, the speed-up values in the PMC Cycle acceleration are fairly
high when compared to avx2-reference. This is a direct consequence of a careful exploitation of the
memory hierarchy, together with the higher memory bandwidth in GPUs architectures. In fact, although
CPUs compensate the lower main memory bandwith with multiple levels of high-speed caches, the most
intensive procedure in the PMC Cycle (Coulomb Grid QM/MM) requires loading a huge amount of data
from main memory at each step (e.g., up to 48MB for the case of bench-C), rendering the first cache
levels useless. Nevertheless, coalesced memory accesses still exploit parallelism when accessing the
main GPU device memory, regardless of using local caches or not.
Table 6.4 presents kernel execution times for the particular case of the GTX780Ti accelerator, to-
gether with the times corresponding to the reference implementation in the avx2-baseline platform. As
can be observed, the kernels that achieve the highest speed-up are q3m c and q3m reduce, as pre-
dicted in Section 5.1.2. The very large speed-up attained in these kernels (160.84×) is subsequently
affected by Amdahl’s Law (considering the fractions and speed-ups of all the other kernels) and results
in an overall PMC Cycle step speed-up of 135.55×. By recalling the execution times presented in Ta-
ble 6.3 for the particular case of the GTX780Ti accelerator, the speed-up without considering the Output
overhead would be 769.96−0.2316.33−0.534 = ∼132.8× (versus the value of 121.29× presented in Figure 6.2, were
every component is taken into account), which is slightly below the speed-up attained in the PMC Cycle
step, due to device management and kernel launching overheads, not accounted for in Table 6.4.
Furthermore, two additional details are worth commenting on. Firstly, the monte carlo kernel is faster
on the GPU, since it relies on pre-generated random number lists, which are computed by the Host in
parallel and refreshed when necessary. Conversely, the baseline version is computing these numbers
on-the-fly, resulting in a heavier Monte Carlo step. Secondly, the decision kernel is also faster because
the results are accumulated locally and only read back and written in a file from time to time, thus being
contemplated in the Output fraction of the profiling (see Table 6.3).
58
Table 6.4: Kernel execution times obtained in the GTX780Ti accelerator and the in the reference avx2-baseline platform, for the particular case of bench-A. The speed-up in respect to the avx2-baseline isalso presented, together with the fraction of the PMC Cycle (%) each kernel represents.
Figure 6.4 presents the kernel timing results per PMC Cycle step, when considering bench-A execut-
ing on the mcx2 heterogeneous platform. The load balancing algorithm introduced in Section 5.2.2 was
used and converged to the grid partitioning depicted in this figure. Figure 6.5 illustrates the time evolution
of the workload balancing. Here, the balancing term r is set to 2000 iterations, in order to avoid under-
sampling the computational weight of the q3m/mm kernels (which depends on the randomly picked MM
molecule). The starting workload distribution of 50%/50% converges to approximately 71%/29% in only
4 balancing steps, favouring the more powerful 780Ti GPU. When this distribution is reached, one can
observe that the execution of the balanced workload in each GPU takes practically the same time, which
means that the load is balanced and that the balancing mechanism has met its purpose. It is worth re-
calling that the employed balancing solution was designed to distribute the workload of the q3m c/finish
kernels (corresponding to the Coulomb QM/MM part in Figure 6.4), although the measurements taken
into account to make the balancing decision include all the other kernels and overheads, since one
wishes to balance each PMC Cycle step. In order to illustrate how the balancing persists even after the
10k-th run, the chart represents the execution up to 20k steps. When compared with an unbalanced
run (e.g., fixed 50%/50% workload distribution) on the same platform, the balanced version yields a
speed-up of 1.3×, further justifying the advantage of having incorporated a load balancing solution.
6.2.2 PMC Cycle Scalability
The memory footprint of the PMC Cycle kernels in the OpenCL accelerators is mainly limited by
the program output, pre-generated random lists, MM lattice and the QM grid. The first two solely de-
pend on the number of executed steps, and are addressed by having the Host CPU flushing the output
and refreshing the random lists periodically. The second two were also not a problem for the selected
benchmarks, since the largest used QM grid and MM lattice occupy ∼48MB and ∼160KB, respec-
tively. Nevertheless, it is important to note that the scalability of the proposed implementation is not
compromised even when significantly larger simulations are considered. To address such cases, the
following solution is envisaged: the q3m/mm kernels may concurrently execute over one chunk of data
while the Host CPU is transferring the next chunk. This double-buffering mechanism can be achieved
in OpenCL devices by using a second OpenCL command-queue and another Host CPU thread to issue
59
GTX780Ti
MC
VDWMM
CoulombMM
Decide & Save
Reduction
GTX660Ti
Coulomb QM/MM (n.)
VDW QM/MM
MC
71%Coulomb QM/MM
Decide
Reduction
Read partial ∆E's (x2) –––>
<––– Write accumulated ∆E's
Host GPP
thread 0 thread 1
Launch overhead Launch overhead
<––– Read partial ∆E's
Write accumulated ∆E's (x2) –––>
R/W Launches
29%Coulomb QM/MM
ReductionR/W Launches
Barrier Sync & Sum Partials
Avg. Time per PMC cycle step (μs)
Host-side overhead
OpenCLtime
16
45
5
19
227
15
2.6
0.6
20
28
46
22
Avg. Time per PMC cycle step (μs)
Host-side overhead
OpenCLtime
17
313
29
1.6
1.2
9
18
43
17
Other GPP threads are running the QM updates in parallel.
Figure 6.4: OpenCL kernel timings (per step) for the PMC Cycle running on the mcx2 heterogeneousplatform. The load is balanced for the heavier kernels (q3m c/q3m finish, corresponding to CoulombQM/MM), whereas the lighter kernels were scheduled to the first GPU. The considered benchmark isbench-A, using mixed fp64 -fp32 precision.
Figure 6.5: Convergence pattern of the implemented load balancing algorithm (balancing every 2000steps), for the Bench-C running on the GTX 780Ti/660Ti platform (mcx2). The presented PMC cycletime measurements represent mean times since the previous balancing.
the memory transfer operations. Since all the considered QM/MM systems have a memory footprint far
below than the maximum memory available in the considered acceleration platforms, implementing this
double-buffering mechanism was not considered a priority in this dissertation.
Other overheads worth discussing are the synchronization events related to scheduling the com-
putations belonging to a single Markov chain among multiple GPUs. These come with the additional
overhead of synching the PMC Cycle step results among the involved devices, at the end of every
step. Fortunately, these overheads do not scale with the simulation size, since the buffers that need to
be synched back and forth (between the Host and the accelerators) are reduced energy terms, each
represented by a single number. For the particular example presented in Figure 6.4, each device has
to read/write three reduced terms per step, which has a performance impact of a few dozen micro-
seconds. Conversely, the computational cost of the q3m c/finish kernels scales with the size of the QM
60
1.56
1.57
1.58
1.59
1.6
1.61
1.62
1.63
1.64
4.5 5 5.5 6 6.5 7 7.5 8 8.5
spee
d-u
p(o
f(ad
din
ga(
seco
nd
(GTX
68
0
Grid(Size((Milions(of(Points)
Figure 6.6: Scalability of the PMC Cycle when changing the size of the QM part in bench-A. Speed-upresults are presented for a dual GTX680 system in respect to a single GTX680 (platform mcx4).
grid, meaning that multi-device scalability is better for larger grids (which concern the most computation-
ally challenging problems).
Figure 6.6 presents the obtained speed-ups for the PMC Cycle kernels acceleration, when a second
GPU is added to the system (platform mcx4) to balance the same Markov chain, by considering several
QM grid sizes. As can be observed, multi-device performance scales better for simulations with greater
QM parts, which actually represents a common characteristic of real QM/MM systems. Considering
the high speed-up results obtained in the q3m c/finish kernels, one might expect the execution time of
these kernels to rise slowly with the introduction of larger grids, justifying the slow scaling of the dual-
device speed-up curve presented in Figure 6.6. Nevertheless, it is important to recall that the execution
time of the PMC Cycle kernels (for each Markov chain) was already accelerated to the micro-second
order of magnitude for a single device, and that at this level, every overhead is noticeable. Therefore,
the observed dual-device speed-up, ranging from ∼1.5× to ∼1.6×, is deemed favorable for grids up
to 8 million points. This speed-up would continue to rise (never exceeding 2×) for larger and more
computationally intensive QM/MM systems.
Finally, it is worth noting that the represented configuration, where a single Markov Chain is run
in multiple accelerators, is particularly useful when #accelerators > #chains, which corresponds to a
situation in a many-node heterogeneous cluster. The scalability in a multiple Markov chain situation, for
the case when #accelerators < #chains, will be discussed in Section 6.3.
6.3 Global PMC Results
To conclude the evaluation of the proposed parallel solution, the execution of the complete PMC
simulation is assessed (including the QM Update and the PMC Cycle stages). For such purpose, a
greater focus will be given to bench-R, corresponding to the longest and more realistic dataset. A de-
tailed discussion in terms of the chemical aspects of the obtained results is discussed in [15], validating
the results with the work where this dataset was first described [11]. Figure 6.7 depicts the simulation
results, showing the conversion of the chorismate structure into prephenate.
Table 6.5 presents the execution times for the inner PMC Cycle (comprising 50k steps), the QM Up-
61
Figure 6.7: QM/MM Simulation box for the bench-R dataset (partial representation), together with thesimulation results for the conversion of the chorismate structure into prephenate.
Table 6.5: bench-R execution time for the PMC Cycle (50k steps) and QM Update (24.8M iters) stages,as well as for the full PMC simulation. The presented results consider two baselines and four parallelsolutions, with either a single or 8 Markov chains and fp64 or fp64 -fp32 precision.
Execution time (s)Setup PMC Cycle QM Update Full Simulation
mcx4-baseline 1883.1 96.4 980038.0 s = 272.23 h300028.0 s = 83.34 h69026.4 s = 19.17 h10156.2 s = 2.82 h52833.4 s = 14.68 h7757.7 s = 2.15 h
in Table 6.6, a speed-up of up to 184.23× is obtained in the PMC Cycle alone. However, this speed-
up will be affected by Amdahl’s law, due to the QM Update fraction running on the CPU. In fact, by
looking at the mcx4-baseline reference scenario, one can observe that in the original run the PMC Cycle
represented 1883.11883.1+96.4 = 95.13% of each PMC iteration (PMC Cycle + QM Update). Hence, the speed-
up of 18.55× presented in Table 6.6 was expected, since the speed-up of 184.23× obtained in the PMC
Cycle (fp64 -fp32 version) would at maximum yield 10.9513184.23+0.0487
u 18.57× global speed-up. Therefore,
one can observe that the single-chain runs are limited by the QM Update fraction (4.87% in the mcx4-
baseline scenario), which uses the MOLPRO closed source program package, a necessary tool in the
current approach to the involved QM chemical calculations [14]. To tackle this limitation, the multiple
Markov chain approach was devised in this dissertation.
For the multiple Markov chain case, the attained speed-up is mainly due to parallel MC state-space
exploration. In fact, by comparing the single with multiple chain speed-up values for the same precision
approach, a scalable speed-up trend can be observed from the obtained results. For example, by
comparing the speed-ups attained in the mcx4 fp64 -fp32 for the cases of 1 and 8 chains, a speed-up
ratio of 126.33×18.55× = 6.81× is obtained. It is important to recall that the speed-up attainable by adding more
chains is limited by Equation 4.1, by the Host-side thread management, and by the overhead introduced
by concurrent memory and disk accesses issued by the CPU cores running the QM Updates in parallel.
In this case, one can verify from Table 6.5 that the mean QM Update execution time has degraded from
96.4s to 113.4s. Furthermore, although Equation 4.1 would yield a theoretical maximum of 23 chains for
this particular case, we are limited by the 8-cores available in mcx4, thus limiting the maximum number
of chains one can run on that platform to 8. Hence, the considered multiple Markov Chain solution
is achieving an efficiency of 38.685.67×8(#cores) = ∼85%. Nevertheless, one can conclude that by using the
same GPUs for the PMC Cycle acceleration, the proposed implementation would scale well to integrate
a system with up to 23 CPU cores. Increasing the number of OpenCL accelerators would increase this
limit even further.
Among the presented results, the most conservative speed-up of this parallel implementation is as-
sumed to be 36.86× (mcx4 fp64 -fp32 8-Chain versus avx2-baseline - see Table 6.6), as the avx2-baseline
corresponds to the reference with best performance. Naturally, the 126.33× speed-up obtained when
63
comparing mcx4 to itself could remain close to this value if a better Intel Xeon CPU had been used in
both the reference and the parallel solutions.
6.4 Numerical Evaluation: Convergence Accuracy and Energy Con-sumption
While the proposed parallel implementation does not make any approximation or relaxation in re-
spect to the original sequential method, yielding exactly the same output as the original PMC imple-
mentation, it is important to consider different numerical precisions, and to evaluate how they impact
execution performance, energy consumption and quality of the results. Accordingly, besides the original
64bit floating-point representation (fp64), the presented OpenCL version offers the following numerical
representation alternatives : i) mixed 64bit and 32bit floating-point (fp64 -fp32 ), or ii) mixed 64bit/32bit
floating-point and 32bit fixed-point (fp32 -i32 ). In the fp64 -fp32 version, the computationally more complex
q3m finish/mm finish kernels use 32bit floating-point precision for the ∆EC,gridQM/MM energy computations,
whereas 64bit floating-point is employed for the remaining energy terms, which have much faster com-
putations. This configuration also assumes the same data-type to store the grid, as well as a copy of the
lattice and the chmol. Likewise, the fp32 -i32 version uses 32bit floating-point numerical representations
for the same energy computation, but it uses 32bit fixed-point for the squared distances. The latter op-
erates on normalized grid and atom coordinates, represented by 32bit integers, which actually provides
a higher precision than the alternative 32bit floating-point. The usage of mixed precision for different
energy terms calls for casting operations, which may degrade the performance in a GPU accelerator. To
circumvent this degradation, all the necessary casting operations were moved to the monte carlo and
decide kernels, concentrating the necessary conversions in the single-threaded procedures of these
kernels and avoiding redundant casting in the many-thread kernels.
The resulting performance gains, for each of the considered precisions when executing the q3m c
kernel, are presented in Table 6.7. Depending on the adopted GPU device, execution speed-ups as
high as 8.89× can be attained by simply adopting lower resolutions, with minor degradations of the
obtained energy results. However, the generated system configurations will be the same as long as
the accumulated error does not cause the sequence of selected systems to diverge, which was verified
to be the case for all the considered benchmarks. In Table 6.8, the error introduced in the ∆EC,gridQM/MM
term and the total system energy (E) are presented for each kernel version, with respect to the fp64
implementation (which is numerically equivalent to the original serial version). It can be observed that
the fp32 -i32 version offers higher precision than the fp64 -fp32 , due to the greater number of significant
bits used for the squared distances operations. In these simulations, it was assured that em = 1.0 ×
10−1kJ/mol was the maximum error, as commonly considered in this research domain.
In order to further assess the impact of the considered mixed precision solutions, the averaged
energy consumption was measured on the Nvidia K20C GPU, by using the NVML library. The method
introduced in [8] was used to gather the attained power measurements, by using the maximum allowed
sampling frequency of 66.7Hz. Since this frequency is too low to sample one kernel launch of q3m c
64
Table 6.7: Speed-up of the mixed precision q3m c kernel versions versus the original fp64 version,running on the same machine, for the case of bench-A.
Table 6.8: Obtained numerical precision. The error is shown for the ∆ECQM/MM energy term, as well as for
the total energy of the system (E), when considering the em = 1.0 × 10−1kJ/mol maximum error. Theaverage values were taken from the complete set of generated QM/MM systems, by using bench-A.
(which executes in the order of hundreds of microseconds), a testbench with just the q3m c kernel was
built and launched repeatedly for 100k steps.
The obtained results are presented in Table 6.9. The first aspect worth noting refers to the configu-
ration that presented the highest average power: the fp64 -fp32 . This fact can be justified by the higher
core occupation allowed with the single precision floating-point implementation. The fp64 version has
a lower average power dissipation due to the opposite reason, i.e. its lower GPU occupation, resulting
in a reduced dynamic power requirement. For the case of the fp32 -i32 configuration, it is expected a
similar GPU occupation relative to fp64 -fp32 , although the integer functional units consume less power,
resulting in a 8W decrease in average power. To complement and further justify these observations,
power and energy consumption were also measured on the avx2-baseline configuration, by using the
SchedMon power and energy measurement tool [50]. Although the avx2-baseline draws (on average)
approximately 4 times less power than the most energy efficient parallel configuration on the K20C
GPU (fp32 -i32 ), the acceleration attained by the GPU in the execution time of the q3m c kernel greatly
compensates this, yielding a much lower overall energy consumption, thus saving up to 28.8× energy.
Although the same tests could not be performed on the GTX680 and GTX780Ti GPUs (these GPUs
do not feature internal power counters), one can predict rather similar energy savings for the GTX780Ti
GPUs accelerator, since it shares the same Kepler core architecture (GK110).
6.5 Summary
In this Chapter, a detailed performance assessment of the devised parallel heterogeneous approach
to the PMC QM/MM method was presented and discussed. The main considered performance metric
of interest was the executed time speed-up, when comparing the parallel solutions to either the avx2-
baseline (a single core of the i7-4770K processor, with AVX2 instructions enabled) or the mcx4-baseline
(a single core of Xeon E5-2609). To accomplish these profiling measurements, several tools were em-
65
Table 6.9: Execution time speed-up, energy savings and average power consumption, when comparingthe Tesla K20C GPU running all the devised numerical precision approaches with avx2-baseline (withthe original fp64 precision). The testbench was run on the K20C GPU for 100k steps, in order to ensurea representative sampling of the computational cost of q3m c. The default core frequency configurationwas used for all experiments.
Setup q3m c time (µs) q3m c speed-up Energy savings Avg. Poweravx2-baseline fp64 76077 1× 1× 34W
K20C fp64 1775 42.9× 10.4× 140W
K20C fp64 -fp32 670 113.5× 26.3× 147W
K20C fp32 -i32 647 117.6× 28.8× 139W
ployed, namely the PAPI [35] library, OpenCL Profiling Events, and the kcachegrind tool (based on
valgrind [37]). The performance of the parallel solution was assessed by using four sets of chemical
datasets, carefully designed by chemical experts from the Institut fur Physikalische Chemie, Georg-
August-Universitat Gottingen. In particular, a chorismate reaction dataset, relevant to the field of appli-
cation [11, 15], was benchmarked. The experiments for this particular dataset yielded 56× execution
time speed-up in the simulation bottleneck (PMC Cycle), and 38× speed-up for the full simulation (when
compared to the avx2-baseline). This is a significant acceleration, since it reduced the full execution
time from ∼80 hours to ∼2 hours. Furthermore, a scalability of 85% was observed for the case of 8
Markov chains executing in a platform with 8 CPU cores and 2 GPUs. Finally, the numerical quality
and energy consumption of the proposed solution were evaluated (by using the SchedMon power and
energy measurement tool [50]) for alternative numerical representation schemes. Energy savings of up
to 28× were observed in the heaviest kernel of the simulation bottleneck.