1/31/2014
1
Hardware and Software for Parallel
Computing
Florent Lebeau, CAPS entreprise
UMPC - janvier 2014
Agenda
Day 1
The Need for Parallel Computing
Introduction to Parallel Computing
o Different Levels of Parallelism
History of Supercomputers
o Hardware Accelerators
Multiprocessing
o Fork/join
o MPI
Multithreading
o Pthread
o TBB
o Cilk
o OpenMP
Day 2
Hardware Accelerators
o GPU
• CUDA
• OpenCL
o Xeon Phi
• Intel Offload
o Directive Standards
• OpenACC
• OpenMP 4.0
CapEx / OpEx with GPU
Porting Methodology
www.caps-entreprise.com 2
1/31/2014
2
The Need for Parallel
Computing
The Demand (1)
Compute for research o Simulate complex physical phenomenon
o Validate a mathematical model
Compute for industry o Quality control by image processing
o DNA sequence alignment
o Oil & gas prospection
o Meteorology
Compute for the masses o Playing a 1080p DVD in real time
o Compressing / uncompressing streams
o Image processing
www.caps-entreprise.com 4
1/31/2014
3
The Demand (2)
Computing needs o Data
o An operation
The computing cost (time) may be
o And even worse, sometimes
To reduce the computation time, one can o Reduce the amount of data
o Reduce the amount of operations
o Increase the computation speed
www.caps-entreprise.com 5
Qcomp = Qdata *Qop
The Demand (3)
The amount of computations keeps increasing
o Games and screens resolutions keep improving
o Longer weather prediction
o More accurate weather prediction
• Which increases the amount of data
Amount of data grows each year
But the given time to exploit these data is still the same
o 24h for another day of weather prediction
o 1/50s for a video stream image
www.caps-entreprise.com 6
1/31/2014
4
The Demand (4)
So we need to increase computations per second
At a lower cost o To purchase
o To develop
o To maintain
Technologically sustainable o Easy adaptation to next architecture
According to the company strategy o Green computing, industrial partnerships with providers…
www.caps-entreprise.com 7
The Solution (1)
By the end of the 20th century, most of applications were written for CPUs (mainly x86) o The regular frequency increase of
micro-processors ensured performance gains without code modification
o The effort focused on hardware vendors, less on developers
Today frequency increase is stuck o Because of thermal dissipation and
power leakage
o Because of components distance and die surface
Computing faster is less and less feasible (sequentially) o But we can compute more things
simultaneously (in parallel)
www.caps-entreprise.com 8
1/31/2014
5
The Solution (2)
Advanced optimizations of code to get the best of cutting-edge CPU functionalities o Vectorization
Code parallelization o Multi-threading
o Parallel computing requires parallel codes
Porting onto specialized architectures o FPGA, cluster, GPU…
o Not only developer’s choice, because may imply long-term hardware investments
www.caps-entreprise.com 9
Introduction to Parallel
Computing
1/31/2014
6
Flynn’s Taxinomy
Classification of computer architectures
o Established by Michael J. Flynn en 1966
4 categories based on the data and instruction flow
o SISD
o SIMD
o MISD
o MIMD
• With shared memory (CPU cores)
• With distributed memory (clusters)
www.caps-entreprise.com 11
Flynn’s Taxinomy : SISD
SISD : Single Instruction Single Data
www.caps-entreprise.com 12
Data Instruction
Processor Memory Control
1/31/2014
7
Flynn’s Taxinomy : SIMD
SIMD : Single Instruction Multiple Data
www.caps-entreprise.com 13
Data Instruction
Processor 0
Processor 1
Processor 2
Memory Control
Flynn’s Taxinomy : MISD
MISD : Multiple Instruction Single Data
www.caps-entreprise.com 14
Data Instruction
Processor 0
Processor 1
Processor 2
Memory Control 1
Control 0
Control 2
1/31/2014
8
Flynn’s Taxinomy : MIMD
MIMD : Multiple Instruction Multiple Data
www.caps-entreprise.com 15
Data Instruction
Processor 0
Processor 1
Processor 2
Memory 0 Control 0
Memory 1
Memory 2
Control 1
Control 2
Distributed Memory Architectures
Processors only see their own memory
They communicate explicitly by message-passing if needed
A processor cannot access to the memory of another
So the distribution must be done to avoid communications
www.caps-entreprise.com 16
Netw
ork
Processor Processor
Processor Processor
1/31/2014
9
Shared Memory Architectures (1)
A unique address space is provided by the hardware
If there is, cache consistency is maintained by hardware
www.caps-entreprise.com 17
Network
Processor Processor Processor Processor
Shared Memory Architectures (2)
Global memory space, accessible by all processors
Processors may have local memory (i.e. cache) to hold copies of some global memory
Consistency of these copies is usually maintained by hardware o Referred as Cache-Coherent
User-friendly
Programmer is in charge of correct synchronization between processes/threads
Suffer from lack of scalability
www.caps-entreprise.com 18
1/31/2014
10
UMA : Unified Memory Access
Most commonly represented today by SMP machines
Identical processors
Equal access and access times to memory
Sometimes called CC-UMA - Cache Coherent UMA
www.caps-entreprise.com 19
NUMA : Non-Uniform Memory Access
Often made by physically linking two or more SMPs
One SMP can directly access memory of another SMP
Memory access across link is slower
www.caps-entreprise.com 20
1/31/2014
11
The Amdahl’s Law
Amdahl’s law states that the sequential part of the execution limits the potential benefit from parallelization o The execution time TP using P cores is given by:
• where seq (in [0,1]) is percentage of execution that is inherently
sequential
Consequences of this law o Potential performance dominated by sequential parts of the
application
www.caps-entreprise.com 21
TP seq T1 (1 seq) *T1
P
What is a Hotspot?
A small part of code
o Most of the execution time spent in it
o Mostly loops that concentrate computation
Identified using performance profilers
Also known as kernels or compute intensive kernels
o But sometimes a hotspot can be implemented as several kernels
www.caps-entreprise.com 22
1/31/2014
12
Speedup
Speedup
o Ratio between execution on 1 process and on P processes
Efficiency
o Ratio between the speedup and the number of cores used for the
parallel version
o Parallel application is scalable when efficiency is close to 1 with
number of cores increasing
www.caps-entreprise.com 23
SP =T1
TP
EP =T1
P*TP
Amdahl’s law
www.caps-entreprise.com 24
1/31/2014
13
Speedup
Speedup
o Ratio between the original time and the optimized time
www.caps-entreprise.com 25
Sp Tseq
Tp
Gustafson’s Law
States that increasing the amount of data increases the
parallelism potential of the application
o The more computations you have, the more computations you may
overlap
A parallel architecture need to be well-exploited to get good
performances
o The more you send parallel computations on it, the best you get
www.caps-entreprise.com 26
1/31/2014
14
Scalability
Scalability gives us an idea about the system behaviour
when the number of processors is increased.
Applications can often exploit large parallel machines by
scaling the problems to larger instances
To improve the scalability :
o Increase the parallelism of the application
www.caps-entreprise.com 27
Load Balancing
It is the capacity of
the application to
adapt the amount of
work between each
Proceesing Elements
It can be statically or
dynamically adapted
www.caps-entreprise.com 28
1/31/2014
15
Granularity
Granularity means the amount of computation compared to
communications
Larger parallel tasks usually provide better speedups
o Reduce startup overhead
o Reduce communication and synchronization overhead
Smaller granularity can be exploited on strongly coupled
architecture, such as multicore
o Can require deep rewriting of the application
www.caps-entreprise.com 29
Different Levels of Parallelism
1/31/2014
16
Different Levels of Parallelism
ALU o Vectorization (MMX / Now!, SSE)
Instruction Pipeline o Instruction Level Parallelism (ILP)
Multi-core o Mass market desktop workstations
Multi-socket o Bi-CPU desktop workstations
Multi-node o Cluster
Worldwide distributed computing o SETI@home
www.caps-entreprise.com 31
Parallelism in CPUs
Scalar / superscalar / SIMD o 80486 / Intel Pentium / Intel Pentium MMX
Mono/multicore o Dual-core (2005 avec AMD Opteron)
o Quad-core, etc.
o “Duplication of the processors”
Mono/multisocket o Intel Xeon bi-socket
www.caps-entreprise.com 32
1/31/2014
17
Scalar Processors
One data is computed at a time
o The architecture is designed to perform one instruction on one data
per clock cycle
o Contrarily to vector (or superscalar architecture)
One data = a value or scalar variable
o i.e.: a value or a recipient determined by a type
o As opposed to a composite data:
• A vector
• A character string in certain programming languages
Ex : Intel 80486
www.caps-entreprise.com 33
Superscalar Processors
Able to perform many instructions simultaneously o Each in a different pipeline
Ex : Intel P5 (1993)
www.caps-entreprise.com 34
1/31/2014
18
Vector Processors
Their architecture is based on pipelines
o A vector instruction executes the same operation on all the data of a
vector
Ex : Cray, NEC, IBM, DEC, Ti, Apple Altivec G4 et G5…
www.caps-entreprise.com 35
SIMD Processors
Single Instruction Multiple Data
Ex : MMX, SSE, ARM Neon, SPARC VIS, MIPS…
www.caps-entreprise.com 36
1/31/2014
19
Today
Most architectures are based on superscalar processors
o And SIMD
Mono-socket for mass market
o Dual-socket or more in clusters or professional workstations
www.caps-entreprise.com 37
History of Supercomputers
1/31/2014
20
Top500.org
Lists the world’s 500 largest supercomputers
www.caps-entreprise.com 39
Architecture Types
www.caps-entreprise.com 40
1/31/2014
21
Architecture Types
Single Processor o But a big one
Cluster
UP^n avec SAS : o SMP if n is small
o MPP if n is large
(UP^n SAS)^m : o If n << m and /SAS : cluster
o Constellation otherwise
www.caps-entreprise.com 41
UP
withoutSAS
nUP )(
withSAS
nUP )(
withSAS
nUP )(
withoutSAS
m
withSAS
nUP )))(((
withoutSAS
m
withSAS
nUP )))(((
Architecture Types
www.caps-entreprise.com 42
1/31/2014
22
Processor Types
www.caps-entreprise.com 43
Processor Types
www.caps-entreprise.com 44
1/31/2014
23
Number of Processors
www.caps-entreprise.com 45
Interconnect Type
www.caps-entreprise.com 46
1/31/2014
24
Installation Type
www.caps-entreprise.com 47
Clusters
1/31/2014
25
An Exemple: Nova Cluster
CAPS entreprise, 2009
www.caps-entreprise.com 49
Nova Architecture
Nova is composed of:
o 1 login node (Nova0)
o 3 storage nodes over Lustre (Nova1 to Nova3)
o 20 compute nodes (Nova4 to Nova23)
Each compute node is made up of:
o A dual-socket Intel Nehalem (bi-processor) machine
• Each Nehalem processor is quad-core (4 CPU cores)
o 2 Nvidia Tesla C1060 GPUs
o 24 GB of memory
www.caps-entreprise.com 50
1/31/2014
26
Nova’s Compute Nodes Architecture
www.caps-entreprise.com 51
12 GB
Intel QPI
12 GB
Intel S5520
chipset
PCIe 2.0 16x
Tesla C1060
Tesla C1060
Nova Architecture
www.caps-entreprise.com 52
nova0.caps-entreprise.com
Nova0
File system
(Nova1-3)
Compute Nodes
(Nova4-23)
Internet
1/31/2014
27
Pros
• Less expensive
• Than one multiprocessor server of similar computing power
• In particular SMP
• More flexible
• The size is adapted to the needs and budget, contrarily to monolithic
configurations
www.caps-entreprise.com 53
Exploiting clusters
As a distributed system
o That is what they actually are
o Resources can be shared among users, applications …
o More complicated to program
As a virtual SMP
o Kerrighed
o The OS hide the underlying architecture
o Easier to program but less control
• A cluster is NUMA-type. Data distribution is important
www.caps-entreprise.com 54
1/31/2014
28
Exploiting Clusters in Distributed Mode
In distributed mode, it is generally provided with a task
scheduler
o Enables to add more servers
o Enable to manage breakdowns
o Slurm, sge, PBS,…
www.caps-entreprise.com 55
$ srun –n 4 ./myProgram.exe myProgram is running on node 13 myProgram is running on node 14 myProgram is running on node 15 myProgram is running on node 16 $
Connection to the Login Node
Before offloading your computations on Nova’s compute
nodes, you need to login to Nova0
o “Secondary” will automatically connect you to Nova0
www.caps-entreprise.com 56
$ ssh [email protected]
mylogin@Nova0 $ #you can now work!
1/31/2014
29
Module and Slurm
Nova is provided with module and srun commands
Useful Module commands
Useful Slurm commands
www.caps-entreprise.com 57
mylogin@Nova0 $ module avail
mylogin@Nova0 $ module load
mylogin@Nova0 $ module unload
mylogin@Nova0 $ module list
mylogin@Nova0 $ module switch
mylogin@Nova0 $ srun
mylogin@Nova0 $ sinfo
mylogin@Nova0 $ salloc
mylogin@Nova0 $ sacct
mylogin@Nova0 $ sreport
Running your Applications
Launch a Bash command on multiple nodes
Launch an application on multiple nodes
Launch n copies of a binary on N nodes
Launch on a specific partition
www.caps-entreprise.com 58
mylogin@Nova0 $ srun –N4 ls
mylogin@Nova0 $ srun –N3 ./myApp
mylogin@Nova0 $ srun –N1 –n3 ./myApp
mylogin@Nova0 $ srun –p hugePart –N16 ./myLargeJob
1/31/2014
30
Clusters and Cloud Computing
Some companies provide an access to their server farm
o Amazon AWS/EC2
o Google App Engine
o IBM Blue Cloud
o Rackspace Mosso
o Argia Faascape
www.caps-entreprise.com 59
Amazon EC2
www.caps-entreprise.com 60
1/31/2014
31
Hardware Accelerators
Clearspeed: Description
Clearspeed designs SIMD
processors for HPC and
embedded systems
Designs e710, e720 and
CATS-700 accelerators
units, based on the CSX700
processor
CSX700 released in 2008
www.caps-entreprise.com 62
1/31/2014
32
Clearspeed: Architecture Overview
CSX700: o Made of 2 SIMD array
processors
o 96 cores on each array
o 256 kb on-chip scratchpad memory
o 2 x 64-bit DDR2 DRAM interface with ECC support
o PCIe 16x host interface
o 96 GFLOPS SP and DP
o 9 W power dissipation
www.caps-entreprise.com 63
Clearspeed: Software Tools
o ANSI C-based Compiler: Cn
o Eclipse IDE
o GDB-based debugger
o Visual profiler
o Libraries: FFT, BLAS, LAPACK …
o MS Windows and Linux tools
www.caps-entreprise.com 64
1/31/2014
33
Clearspeed: Applications
Finance:
o « Fastest solution for credit risk analysis »
Space engineering
HPC
www.caps-entreprise.com 65
CELL: Description
Alliance of Sony, Toshiba and IBM
Dates back to the mid-2000’s
Based on the IBM POWER architecture core
Design to bridge the gap between CPU and GPU
www.caps-entreprise.com 66
1/31/2014
34
CELL: Architecture Overview
1 main processor (PPE: Power Processing Element)
8 coprocessors (SPE: Synergistic Processing Element)
256 GFLOPS SP 26 GFLOPS DP
60 to 80 W power consumption
www.caps-entreprise.com 67
CELL: Software Tools
IBM online resource center no longer available, but the latest
SDK (v3.1) can still be found on the web.
SDK v3.1 contains:
o Eclipse IDE
o GNU toolchain tools (gcc, gdb)
o Performance tools
o Libraries (FFT, LAPACK, BLAS, Monte Carlo)
Linux only
www.caps-entreprise.com 68
1/31/2014
35
CELL: Applications
Multimedia:
o Console video games
o Home cinema
HPC
o IBM Roadrunner
www.caps-entreprise.com 69
Tilera: Description
Fabless semiconductor
company focusing on
scalable multicore embedded
processors
TILE family processors
TILE-based platforms
www.caps-entreprise.com 70
1/31/2014
36
Tilera: Architecture Overview
TILEProX64:
o 8 x 8 grid of identical RISC
processor cores
o 5.6 Mbytes on-chip cache
memory
o 19 – 23 W power
consumption
o 4 DDR2 memory
controllers with optional
ECC
o 443 Giga OPS
www.caps-entreprise.com 71
Tilera: Software Tools
Tilera Multicore
Development
Environment (MDE):
o ANSI C / C++ compiler
o Eclipse IDE
o Tools for gdb, gprof and
oprofile
o Graphical multicore
debugger and profiler
Linux only
www.caps-entreprise.com 72
1/31/2014
37
Tilera: Applications
Cloud computing
o 3X perfomance-per-Watt
when running Memcache,
according to Facebook
Networking
Multimedia
www.caps-entreprise.com 73
Kalray: Description
Spin-off from the CEA
Fabless and software
company delivering
manycore processors
Developed MPPA (Multi-
Purpose Processor Array)
www.caps-entreprise.com 74
1/31/2014
38
Kalray: Architecture Overview
256 VLIW processors organized in 16 clusters of 16 cores for the basic edition
High-end products up to 64 clusters, ie 1024 cores
DDR3 memory controllers
5 W power dissipation
2 Tera OPS for the 1024 cores version
www.caps-entreprise.com 75
Kalray: Software Tools
AccessCore SDK: o C-based programming
model: • Core algorithms in C
• Primitives for task and data parallelism
o GNU gcc and gdb are used for compilation and debug
o Eclipse 3.x IDE
o Libraries
o Linux only
www.caps-entreprise.com 76
1/31/2014
39
Kalray: Applications
Signal Processing
o SIMILAN project
Video processing, industrial Imaging, transportation
o CHAPI project
www.caps-entreprise.com 77
Calxeda: Description
Provides ARM-based
SoC
EnergyCore family
processors
EnergyCards boards
www.caps-entreprise.com 78
1/31/2014
40
Calxeda: Architecture Overview
4 ARM A9 cores
4 Mbytes L2 cache and
DDR3 controller with
ECC
220 MIPS / ARM core
5 W per node
www.caps-entreprise.com 79
Calxeda: Software Tools
Use ARM-dedicated tools:
o GNU gcc ARM
o JTAG debugger
www.caps-entreprise.com 80
1/31/2014
41
Calxeda: Applications
Server market
o Web servers clusters
o Content distribution
networks
o Cloud storage
www.caps-entreprise.com 81
Supercomputers
Vector Architecture
o CRAY, NEC SX-9
Cluster
o Contemporary Supercomputers
www.caps-entreprise.com 82
1/31/2014
42
CRAY Supercomputers
Seymour Cray
o 1957: co-founder of CDC (Control Data Corporation). Takes part in
the design of the first supercomputer: the CDC 6600.
o 1972: co-founder of Cray Research, Inc. Creates the Cray-1 in 1976
and the Cray-2 in 1985.
o 1989: founder of Cray Computer Corporation. The Cray-3 has never
been built and the company went bankrupt in 1995.
www.caps-entreprise.com 83
Multiprocessing
1/31/2014
43
What are Processes?
An instance of a computer program that is being executed
o If a program is launched twice, two instances of this program will be
spawned
According to the OS
o A process can be switched with another
o A process is interruptible
Switches could be performed:
o tasks perform input/output operations
o when a task indicates that it can be switched
o or on hardware interrupts
www.caps-entreprise.com 85
Processes
With processes you can switch between processes o And if you do it quick enough it seems like they execute concurrently
Processes enable multi-tasking o OS, drivers and programs run “concurrently” on the microprocessor
o They actually share the multiprocessor core
• But with dual-cores CPUs, they are distributed over two cores
Processes make multicore possible o And take advantages of it by executing even faster
Processes have separate address spaces o And interact only through system-provided inter-process communication
mechanisms
www.caps-entreprise.com 86
1/31/2014
44
Processes
Imply explicit
communications
o Through inter-
process
communication
mechanisms
www.caps-entreprise.com 87
Multiprocessing with Multiple Cores
Enables to execute different programs concurrently
o One on each core: the overall time needed is reduced
o You have to launch programs asynchronously
Enables to execute several instances of a same program
o On different data -> SPMD
o You need either to use another command line or specific tools like
MPI
www.caps-entreprise.com 88
$ riri.exe & fifi.exe & loulou.exe &
$ srun –N 9 donald.exe data1-9.txt
1/31/2014
45
Tools for Multiprocessing
By hand
o Fork / join
o …
With the help of dedicated tools
o MPI
o PVM
o …
www.caps-entreprise.com 89
Fork/Join
www.caps-entreprise.com 90
1/31/2014
46
Fork / join
The fork() system call will spawn a new child process which
is an identical process to the parent
o Except that has a new system process ID
The process is copied in memory from the parent and a new
process structure is assigned by the kernel
o All data and properties are inherited from parent, but separated
…
www.caps-entreprise.com 91
Fork / Join Example
www.caps-entreprise.com 92
#include …
using namespace std;
main()
{
string sIdentifier;
pid_t pID = fork();
if (pID == 0) // Code only executed by child process
{
sIdentifier = "Child Process: ";
}
else if (pID < 0) // failed to fork
{cerr << "Failed to fork" << endl; exit(1);}
else // Code only executed by parent process
{
sIdentifier = "Parent Process:";
}
// Code executed by both parent and child
cout << sIdentifier << pID << endl;
}
$ g++ -o ForkTest ForkTest.cpp
$ ./ForkTest
Parent Process pID: 1234
Child Process pID: 1245
1/31/2014
47
Multiprocessing on Distributed Systems
The Fork / Join model is on-node programming
But since processes embed all their execution context
o They are quite autonomous
o And may be distributed over separated computers (nodes)
MPI
o Designed to implement multiprocessing on distributed systems
o Is based on the fork / join model
o Is portable (systems of different kinds can communicate)
www.caps-entreprise.com 93
MPI
1/31/2014
48
MPI: Message Passing Interface
A high-level message-passing API
Designed for high performance, scalability and portability
A standard which comes in two versions
o 1995: v1.2 (MPI-1)
o 1997: v2.1 (MPI-2)
An API which comes in several implementations
o Some vendor-specific extensions...
o ...can break the portability of the application
www.caps-entreprise.com 95
MPI distribution
Free implementation
o MPICH (MPI-1)
o MPICH2 (MPI-2)
o OpenMPI
o LAM/MPI
Constructor
o HP MPI
o Intel MPI
www.caps-entreprise.com 96
1/31/2014
49
Programming Paradigms
Sequential Programming Paradigm
Message-Passing Programming Paradigm
www.caps-entreprise.com 97
Data
Program
Sub-Data
Sub-Program
Sub-Data
Sub-Program
Sub-Data
Sub-Program
Sub-Data
Sub-Program
Network
MPI Programming
Each node in an MPI program runs a sub-program
o Written in a sequential language (C, Fortran, ...)
o Typically the same on each node (SPMD)
The variables of each sub-program:
o Have the same name
o Have different locations and different values (distributed memory)
o all variables are private to the sub-program
Communicate via send & receive routines
www.caps-entreprise.com 98
1/31/2014
50
Data & Work Distribution
Each process is given a unique rank from 0 to N-1
System of N independent processes
Data and work distribution decisions based on rank
www.caps-entreprise.com 99
Rank=0 Data
Sub-Program
Rank=1 Data
Sub-Program
Rank=2 Data
Sub-Program
Rank=N-1 Data
Sub-Program
Network
Messages
Messages are blocks of data exchanged by sub-programs
For both send and receive steps, necessary information are:
o Ranks of the source/destination processes
o Data location
o Data type
o Data size
www.caps-entreprise.com 100
Network
1/31/2014
51
Include In The Program File
In C
o #include <mpi.h>
In C++
o #include <mpi++.h>
In Fortran
o #include "mpif.h" ! F77 ou F90
o USE MPI
• Before IMPLICIT NONE
www.caps-entreprise.com 101
MPI C/C++ Functions
Header
o #include <mpi.h>
o #include <mpi++.h>
Function format
o mpierror = MPI_?????( … );
o MPI_?????( … );
MPI_* prefix reserved for MPI macros & routines
www.caps-entreprise.com 102
1/31/2014
52
MPI Fortran Subroutines
Header
o include 'mpif.h’
Function format
o var = MPI_?????( … )
o CALL MPI_?????( … , mpierror)
MPI_* prefix reserved for MPI macros & routines
www.caps-entreprise.com 103
initialization & termination
Initializing MPI o int MPI_Init(int *argc, char **argv) (C)
o Subroutine MPI_Init( mpierror) (Fortran)
Must be the first called MPI routine
It initialize the MPI execution environment (Communicator, …)
Exiting MPI o int MPI_Finalize() (C)
o Subroutine MPI_Finalize(mpierror) (Fortran)
Must be called by all processes before exiting
Terminates MPI execution environment
www.caps-entreprise.com 104
1/31/2014
53
Handles
A handle is a value used to identify an MPI object
For the programmers, handles are:
o Predefined constants (exist only after call to MPI_Init())
• Ex: MPI_COMM_WORLD
o Values returned by MPI routines: defined as special MPI typedefs.
Handles refer to internal data structures
www.caps-entreprise.com 105
Communicator MPI_COMM_WORLD
Communicator: set of interconnected MPI processes
All processes of one MPI program are combined in the
communicator MPI_COMM_WORLD
o Predefined in mpi.h and mpif.h
Each process in a communicator has its own rank
o Starting from 0
o Ending at size-1
www.caps-entreprise.com 106
1/31/2014
54
Size & Rank
The size :
o How many processes in a communicator
• int MPI_Comm_size(MPI_Comm comm, int *size) (c)
• Subroutine MPI_Comm_size(MPI_COMM_WORLD, mpisize, mpierror)
(Fortran)
The rank :
o Uniquely identifies each process in a communicator
o Is the basis for work / data distribution
• int MPI_Comm_rank(MPI_Comm comm, int *rank) (C)
• Subroutine MPI_Comm_rank(MPI_COMM_WORLD, mpirank, mpierror)
(Fortran)
www.caps-entreprise.com 107
Basic Example in Fortran
program firstmpi
! The most basic MPI Program
include 'mpif.h'
integer :: mpierror, mpisize, mpirank
call MPI_Init(mpierror)
call MPI_Comm_size(MPI_COMM_WORLD, mpisize, mpierror)
call MPI_Comm_rank(MPI_COMM_WORLD, mpirank, mpierror)
! Do work here
call MPI_Finalize(mpierror)
end program firstmpi
www.caps-entreprise.com 108
1/31/2014
55
Basic Example in C
#include <mpi.h>
void main(int argc, char *argv[])
{
/* The most basic MPI Program */
int mpierror, mpisize, mpirank;
mpierror=MPI_Init(&argc, &argv);
mpierror=MPI_Comm_size(MPI_COMM_WORLD, &mpisize);
mpierror=MPI_Comm_rank(MPI_COMM_WORLD, &mpirank);
/* Do work here */
mpierror=MPI_Finalize();
}
www.caps-entreprise.com 109
Compilation step
Compilation with C code
o Unless you have a specific implementation (e.g. bullmpi)
Compilation with Fortran code
www.caps-entreprise.com 110
$ mpicc -o mpiProg mpiProg.c
$ gcc mpiProg.c -o mpiProg -lmpi
$ icc mpiProg.c -o mpiProg -lmpi
$ mpif90 -o mpiProg mpiProg.f90
$ gfortran mpiProg.c -o mpiProg -lmpi
$ ifort mpiProg.c -o mpiProg -lmpi
1/31/2014
56
Execution Step
If you execute on several nodes o Specify the number of nodes (-n 2 or -np 2)
o Specify the name of the machine in a file (ex:machines)
o Use the option -machinefile
If you use a cluster o Use the system of job
o Specify the number of nodes
• In a script with PBS or SGE
• In the command line with SLURM
• …
www.caps-entreprise.com 111
$ mpirun –np 2 –machinefile machines ./a.out
$ mpiexec –n 2 –machinefile machines ./a.out
Using MPI
The file of your application :
Introduce MPI in the application
Compile the application
Run the application
www.caps-entreprise.com 112
$ myProg.c
$ mpicc myProg.c
$ mpirun –np 2 –machinefile machines ./a.out
1/31/2014
57
Point-to-point communication
Communication between 2 processes
Source process sends message to destination process
Communication takes place within a communicator
o e.g.: MPI_COMM_WORLD
Processes are identified by their rank in the communicator
www.caps-entreprise.com 113
0
4 3
5
1
2 Message
Messages
Blocks of data are exchanged by sub-programs
Contain one or more elements sharing a same datatype
MPI datatypes:
o Basic datatypes
o Derived datatypes (built from basic & after derived datatypes)
Datatype handles are used to describe the data in memory
www.caps-entreprise.com 114
1/31/2014
58
Sending a message
C: o int MPI_Send(void *buff, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm)
Fortran: o Subroutine MPI_Send(buff, count, datatype, dest, tag, comm, ierror)
buff is the starting point of the message with count elements of type datatype.
Dest is the rank of the destination in the communicator comm.
Tag is an integer added to the message, allowing the identification of the message.
www.caps-entreprise.com 115
Receiving a message
C:
o int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source,
int tag, MPI_Comm comm, MPI_Status *status);
Fortran:
o Subroutine MPI_Recv(buf, count, datatype, source, tag, comm,
status, ierror)
Buff, count and datatype describe the receive buffer.
Receive a message sent by process with rank source in
comm and with the same tag.
Additional info is returned in status.
www.caps-entreprise.com 116
1/31/2014
59
MPI Basic C datatypes
MPI Datatype C Datatype
MPI_CHAR signed char
MPI_SHORT signed short int
MPI_INT signed int
MPI_LONG signed long int
MPI_UNSIGNED_CHAR unsigned char
MPI_UNSIGNED_SHORT unsigned short int
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long int
MPI_FLOAT float
MPI_DOUBLE double
www.caps-entreprise.com 117
MPI Basic Fortran datatypes
www.caps-entreprise.com 118
MPI Datatypes Fortran Datatypes
MPI_INTEGER integer
MPI_REAL real
MPI_DOUBLE_PRECISION double precision
MPI_COMPLEX complex
MPI_LOGICAL logical
MPI_CHARACTER character
MPI_BYTE
MPI_PACKED
1/31/2014
60
Requirements
For a communication to be successful:
o Sender must specify a valid destination rank
o Receiver must specify a valid source rank
o Same communicator on both sides
o Matching tags (choosen by user)
o Matching datatypes
o A large enough buffer on the receiver’s side
www.caps-entreprise.com 119
Wildcarding
The receiver can wildcard.
To receive from any source
o source = MPI_ANY_SOURCE
To receive from any tag
o tag = MPI_ANY_TAG
Source & tag are returned within the receiver's status
parameter.
www.caps-entreprise.com 120
1/31/2014
61
A Simple Example (1)
#include <stdio.h>
#include <mpi.h>
#define MASTER 0 //We assume 2 MPI processes
#define SLAVE 1
int main(int argc, char **argv)
{
int val=0; //Value to be exchanged
int mpi_rank, mpi_size;
MPI_Status status;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);
...
www.caps-entreprise.com 121
A Simple Example (2)
...
if (mpi_size!=2) return -1;
if ( mpi_rank == MASTER ) /* The master sends a value */
{
val = 1492;
printf("I'm process %d and Value is %d\n", mpi_rank, val);
MPI_Send(&val, 1, MPI_INT, SLAVE, 777, MPI_COMM_WORLD);
}
else /* The slave prints the value */
{
printf("I'm process %d and Value is %d\n", mpi_rank, val);
MPI_Recv(&val, 1, MPI_INT, MASTER, 777, MPI_COMM_WORLD, &status);
printf("I'm process %d and Value is %d\n", mpi_rank, val);
}
MPI_Finalize();
}
www.caps-entreprise.com 122
1/31/2014
62
The Sendrecv
C: o int MPI_Sendrecv( void *sendbuf, int sendcount, MPI_Datatype sendtype, int
dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status *status )
Fortran : o Subroutine MPI_Sendrecv(sendbuf, sendcount, sendtype, dest, sendtag, recvbuf,
recvcount, recvtype, source, recvtag, comm, status, err)
sendbuff is the starting point of the message with sendcount elements of type sendtype.
recvbuff is the starting point of the message with recvcount elements of type recvtype.
dest and source are the ranks of the destination and the sender in the communicator comm.
Sendtag and recvtag are integers added to the message, allowing the identification of the message.
www.caps-entreprise.com 123
Asynchronous Sending
C: o int MPI_Isend(void *buff, int count, MPI_Datatype datatype, int dest, int
tag, MPI_Comm comm, MPI_Request *request)
Fortran: o sSbroutine MPI_Isend(buff, count, datatype, dest, tag, comm, request,
ierror)
buff is the starting point of the message with count elements of type datatype.
Dest is the rank of the destination in the communicator comm.
Tag is an integer added to the message, allowing the identification of the message.
Additional info is returned in status.
request can be used later to query the status of the communication or wait for its completion
www.caps-entreprise.com 124
1/31/2014
63
Asynchronous Receiving
C: o int MPI_Irecv(void *buff, int count, MPI_Datatype datatype, int source, int
tag, MPI_Comm comm, MPI_Status *status, MPI_Request *request);
Fortran: o Subroutine MPI_Irecv(buff, count, datatype, source, tag, comm, status,
request, ierror)
Buff, count and datatype describe the receive buffer.
Receive a message sent by process with rank source in comm and with the same tag.
Additional info is returned in status.
request can be used later to query the status of the communication or wait for its completion
www.caps-entreprise.com 125
The synchronization
C: o int MPI_Wait(MPI_Request *request, MPI_Status *status)
Fortran : o Subroutine MPI_Wait(request, status, ierror)
Wait until the operation identified by request is complete
Additional info is returned in status.
One is allowed to call MPI_Wait with a request like MPI_REQUEST_NULL argument. In this case the operation returns immediately with empty status.
www.caps-entreprise.com 126
1/31/2014
64
Example
CALL MPI_INIT(ierr)
…
CALL MPI_COMM_RANK(comm, rank, ierr)
IF(rank.EQ.0) THEN
CALL MPI_ISEND(a(1), 10, MPI_REAL, 1, tag, comm, request, ierr)
**** do some computation to mask latency ****
CALL MPI_WAIT(request, status, ierr)
ELSE
CALL MPI_IRECV(a(1), 15, MPI_REAL, 0, tag, comm, request, ierr)
**** do some computation to mask latency ****
CALL MPI_WAIT(request, status, ierr)
END IF
…
www.caps-entreprise.com 127
Broadcast (1)
MPI_Bcast broadcasts a message from the process with
rank "root" to all other processes of the group
Before a MPI_Bcast :
After a MPI_Bcast :
www.caps-entreprise.com 128
10
Process 1 Process 2 Process 3 Process 4
10
Process 1
10
Process 2
10
Process 3
10
Process 4
1/31/2014
65
Broadcast (2)
C
o int MPI_Bcast ( void *buff, int count, MPI_Datatype datatype, int root,
MPI_Comm comm )
Fortran :
o Subroutine MPI_BCAST(buff, count, datatype, root, comm)
Buff, count and datatype describe the receive/sender buffer
in comm
Root is the rank of the sender
www.caps-entreprise.com 129
Scatter (1)
MPI_Scatter sends data from one task to all other tasks in a
group
Before a MPI_Scatter :
After a MPI_Scatter :
www.caps-entreprise.com 130
10 11 12 13
Process 1 Process 2 Process 3 Process 4
10
Process 1
11
Process 2
12
Process 3
13
Process 4
1/31/2014
66
Scatter (2)
C o int MPI_Scatter ( void *sendbuf, int sendcnt, MPI_Datatype sendtype,
void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm )
Fortran o Subroutine MPI_SCATTER(sendbuf, sendcount, sendtype, recvbuf,
recvcount, recvtype, root, comm)
sendbuff, sendcount and sendtype describe the root sender in comm
recvbuff, recvcount and recvtype describe the receivers in comm
www.caps-entreprise.com 131
Gather (1)
MPI_Gather gathers together values from a group of
processes
Before a MPI_Gather :
After a MPI_Gather :
www.caps-entreprise.com 132
10 11 12 13
Process 1 Process 2 Process 3 Process 4
10
Process 1
11
Process 2
12
Process 3
13
Process 4
1/31/2014
67
Gather (1)
C o int MPI_Gather ( void *sendbuf, int sendcnt, MPI_Datatype sendtype,
void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm )
Fortran o Subroutine MPI_GATHER(sendbuf, sendcount, sendtype, recvbuf,
recvcount, recvtype, root, comm)
sendbuff, sendcount and sendtype describe the senders in comm
recvbuff, recvcount and recvtype describe the root receiver in comm
www.caps-entreprise.com 133
Reduce (1)
MPI_Reduce reduces values on all processes to a single
value. The example is using the operation MPI_SUM.
Before a MPI_Reduce :
After a MPI_Reduce :
www.caps-entreprise.com 134
46
Process 1 Process 2 Process 3 Process 4
10
Process 1
11
Process 2
12
Process 3
13
Process 4
1/31/2014
68
Reduce (2)
C o int MPI_Reduce ( void *sendbuf, void *recvbuf, int count,
MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)
Fortran o Subroutine MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root,
comm, err)
sendbuff, count and datatype describe the senders buffer in comm
Recvbuff and root describe the receiver in comm
op specifies the operator of reduction
www.caps-entreprise.com 135
Reduction Operators
www.caps-entreprise.com 136
Name Meaning
MPI_MAX maximum
MPI_MIN minimum
MPI_SUM sum
MPI_PROD product
MPI_LAND logical and
MPI_BAND bit-wise and
MPI_LOR logical or
MPI_BOR bit-wise or
MPI_LXOR logical xor
MPI_BXOR bit-wise xor
MPI_MAXLOC max value and location
MPI_MINLOC min value and location
1/31/2014
69
Barrier
C
o int MPI_Barrier ( MPI_Comm comm )
Fortran
o Subroutine MPI_BARRIER(comm)
comm is the communicator
Blocks the caller until all group members have called it
www.caps-entreprise.com 137
To go further
It exists different versions of the classical send o Basic send user-provided buffering (MPI_Bsend)
o Blocking ready send (MPI_Rsend)
o Blocking synchronous send (MPI_Ssend)
o …
It exists some completely collective communications o MPI_Allgather
o MPI_Allscatter
o MPI_Alltoall
o MPI_Allreduce
o …
You can mix OpenMP with MPI
www.caps-entreprise.com 138
1/31/2014
70
Multithreading
www.caps-entreprise.com 139
What is a Thread?
An independent stream of instructions
o That can be scheduled to run as such by the operating system
o To the software developer, the concept of a "procedure" that runs
independently from its main program may best describe a thread
Consider a main program (a.out) that contains a number of
procedures being able to be scheduled to run
simultaneously and/or independently by the operating
system
o That would describe a "multi-threaded" program
www.caps-entreprise.com 140
1/31/2014
71
What is a Thread?
So, in summary, in the UNIX environment a thread: o Exists within a process and uses the process resources
o Has its own independent flow of control as long as its parent process exists and the OS supports it
o Duplicates only the essential resources it needs to be independently schedulable
o May share the process resources with other threads that act equally independently (and dependently)
o Dies if the parent process dies - or something similar
o Is "lightweight" because most of the overhead has already been accomplished through the creation of its process
Because threads within the same process share resources: o Changes made by one thread to shared system resources (such as closing a file)
will be seen by all other threads
o Two pointers having the same value point to the same data
o Reading and writing to the same memory locations is possible, and therefore requires explicit synchronization by the programmer
www.caps-entreprise.com 141
Pthread
1/31/2014
72
What are Pthreads?
Historically, hardware vendors have implemented their own proprietary versions of threads o These implementations differed substantially from each other making
it difficult for programmers to develop portable threaded applications
In order to take full advantage of the capabilities provided by threads, a standardized programming interface was required o For UNIX systems, this interface has been specified by the IEEE
POSIX 1003.1c standard (1995)
o Implementations adhering to this standard are referred to as POSIX threads, or Pthreads
o Most hardware vendors now offer Pthreads in addition to their proprietary API’s
www.caps-entreprise.com 143
Why Pthreads?
When compared to the cost of creating and managing a
process, a thread can be created with much less operating
system overhead
www.caps-entreprise.com 144
1/31/2014
73
On-node Communications
MPI libraries usually implement on-node task communication via shared
memory, which involves at least one memory copy operation (process to
process)
o For Pthreads there is no intermediate memory copy required because
threads share the same address space within a single process.
o There is no data transfer, per se. It becomes more of a cache-to-CPU or
memory-to-CPU bandwidth (worst case) situation
www.caps-entreprise.com 145
Threads = Shared Memory Model
All threads have access to the same global, shared memory
o Threads also have their own private data
o Programmers are responsible for synchronizing access (protecting)
globally shared data
www.caps-entreprise.com 146
1/31/2014
74
Thread-safeness
Refers an application's ability to execute multiple threads simultaneously without damaging shared data or creating race conditions
o The implication to users of external library routines is that if you aren't
100% certain the routine is thread-safe, then you take your chances with problems that could arise
www.caps-entreprise.com 147
Pthread API
For C/C++
From Intel, PathScale, PGI, GNU, IBM
Initially, your main() program comprises a single, default
thread
o All other threads must be explicitly created by the programmer
www.caps-entreprise.com 148
pthread_create () pthread_exit () pthread_cancel () pthread_attr_init () pthread_attr_destroy ()
1/31/2014
75
Threads Life
Threads can spawn other threads
Threads can wait for another (actively or passively), die…
www.caps-entreprise.com 149
Pthread Example
www.caps-entreprise.com 150
#include <pthread.h> #include <stdio.h> #define NUM_THREADS 5 void *PrintHello(void *threadid) { long tid; tid = (long)threadid; printf("Hello World! It's me, thread #%ld!\n", tid); pthread_exit(NULL); }
int main (int argc, char *argv[]) { pthread_t threads[NUM_THREADS]; int rc; long t; for(t=0; t<NUM_THREADS; t++){ printf("In main: creating thread %ld\n", t); rc = pthread_create(&threads[t], NULL, PrintHello, (void *)t); if (rc){ printf("ERROR; return code is %d\n", rc); exit(-1); } } /* Last thing that main() should do */ pthread_exit(NULL); }
1/31/2014
76
Pthread Management Basis
Routines
Joining is one way to accomplish synchronization between
threads
o The pthread_join() subroutine blocks the calling thread until the
specified threadid thread terminates
www.caps-entreprise.com 151
pthread_join () pthread_detach () pthread_attr_setdetachstate () pthread_attr_getdetachstate ()
Intel Thread Building Block: TBB
1/31/2014
77
Intel TBB
Initiated in 2006, open source since 2007
Express Parallelism in C++
Express tasks not threads
http://threadingbuildingblocks.org/
April 13 Intel MIC Programming 153
Intel TBB : Warnings
Far from other ways to express parallelism
Complex to implement
Dedicated to C++
April 13 Intel MIC Programming 154
1/31/2014
78
Intel TBB : C++ vs imperative langages
Object Oriented langage with:
o complex inheritance
o Classes contain data and code (functions)
Template functions: function change with data types
STL widely used.
April 13 Intel MIC Programming 155
Intel TBB : Express tasks not threads
Intel TBB have its own scheduler
o Automatically invoked at the beginning of a program
o Can be accessed using task_scheduler_init class
Work stealing scheduler, can cohabit with Cilk Plus
scheduler.
April 13 Intel MIC Programming 156
1/31/2014
79
Intel TBB: overview
Link with tbb library (-ltbb or –ltbb_debug)
April 13 Intel MIC Programming 157
void SerialApplyFoo( float a[], size_t n )
{
for( size_t i=0; i!=n; ++i ) Foo(a[i]);
}
#include <tbb/tbb.h>
Using namespace tbb;
void SerialApplyFoo( float a[], size_t n )
{
for( size_t i=0; i!=n; ++i ) Foo(a[i]);
}
Intel TBB: parallel For (1)
Parallel for
April 13 Intel MIC Programming 158
#include <tbb/tbb.h>
Using namespace tbb;
class ApplyFoo {
float *const my_a;
public:
void operator()( const blocked_range<size_t>& r ) const {
float *a = my_a;
for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]);
}
ApplyFoo( float a[] ) : my_a(a) {}
};
1/31/2014
80
Intel TBB: parallel For (2)
Declare a class containing an operator ()
April 13 Intel MIC Programming 159
#include <tbb/tbb.h>
Using namespace tbb;
class ApplyFoo {
float *const my_a;
public:
void operator()( const blocked_range<size_t>& r ) const {
float *a = my_a;
for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]);
}
ApplyFoo( float a[] ) : my_a(a) {}
};
Intel TBB: parallel For (3)
This operator takes a blocked_range<T> argument
April 13 Intel MIC Programming 160
#include <tbb/tbb.h>
Using namespace tbb;
class ApplyFoo {
float *const my_a;
public:
void operator()( const blocked_range<size_t>& r ) const {
float *a = my_a;
for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]);
}
ApplyFoo( float a[] ) : my_a(a) {}
};
1/31/2014
81
Intel TBB: parallel For (4)
Some other iteration spaces exists (ex: blocked_range2d)
April 13 Intel MIC Programming 161
#include <tbb/tbb.h>
Using namespace tbb;
class ApplyFoo {
float *const my_a;
public:
void operator()( const blocked_range<size_t>& r ) const {
float *a = my_a;
for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]);
}
ApplyFoo( float a[] ) : my_a(a) {}
};
Intel TBB: parallel For (5)
A copy constructor must be declared
April 13 Intel MIC Programming 162
#include <tbb/tbb.h>
Using namespace tbb;
class ApplyFoo {
float *const my_a;
public:
void operator()( const blocked_range<size_t>& r ) const {
float *a = my_a;
for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]);
}
ApplyFoo( float a[] ) : my_a(a) {}
};
1/31/2014
82
Intel TBB: parallel For (6)
Then invokation will be as followed:
Blocked_range take 3 arguments:begin, end , grainsize
o Begin, End : iteration start and end
o Grainsize: size of a chunk to be executed by a thread.
April 13 Intel MIC Programming 163
void ParallelApplyFoo( float a[], size_t n )
{
parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a));
}
Intel TBB: Tips on grainsize
No more than total num of iteration/ nb threads
One thread must execute at least 10 000 operations to
overtake overhead of work stealing scheduler.
Small loops with good load-balancing won’t suffer a lot from
scheduler overhead.
April 13 Intel MIC Programming 164
1/31/2014
83
Intel TBB: more on partitioner
You can choose among 3 partionners as a 3rd arg.
o Simple_partitioner
o Auto_partitioner (default)
o Affinity_partitioner
April 13 Intel MIC Programming 165
void ParallelApplyFoo( float a[], size_t n )
{
parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a), simple_partitioner ());
}
Intel TBB: Other constructs
Parallel_reduce
Parallel_while
April 13 Intel MIC Programming 166
1/31/2014
84
Intel Cilk Plus
What is Intel Cilk Plus ?
Intel Cilk Plus is an extension to C and C++ bringing:
o Multi-core support.
o Vector Processing support.
It comes with:
o 3 keywords to manage thread-programming
o Vectorization intrinsics and directives
o Array Notation.
April 13 Intel MIC Programming 168
1/31/2014
85
Where can I get Intel Cilk Plus ?
Available with Intel Composer XE Compilers.
Cilk Plus open projects: http://cilkplus.org
o Cilk Plus in GCC
Cilk Plus extension in LLVM.
Specifications are open : http://cilkplus.org
April 13 Intel MIC Programming 169
Data Parallelism with Intel Cilk Plus
3 keywords: _Cilk_spawn, _Cilk_sync, _Cilk_for
To simplify, <clik/cilk.h> header defines macros:
o cilk_spawn
o cilk_sync
o cilk_for
Clik Plus don’t create threads
April 13 Intel MIC Programming 170
1/31/2014
86
Cilk_spawn, Cilk_sync (1)
Cilk_spawn: the statement following can be executed with
an other thread.
o It does not create threads but a new task is queued.
Cilk_sync: all spawned statements using cilk_spawn have to
be completed before execution continues.
o Implicit cilk_sync at the end of a function.
April 13 Intel MIC Programming 171
Cilk_spawn, cilk_sync (2)
Recursive Fibonacci function
April 13 Intel MIC Programming 172
Int fib (int n)
{
if (n < 2)
return n;
int x = cilk_spawn fib(n-1);
int y = fib (n-2);
cilk_sync;
return x+y;
}
1/31/2014
87
Cilk_for
cilk_for is same as #pragma omp for
o Replace a for statement.
o Iterations shared among threads by the runtime and the compiler.
April 13 Intel MIC Programming 173
clik_for (int i = 0; i < 8; i++)
{
do_work(i);
}
Keeping serial execution
Possible to use pre-processing to hide Clik Plus keywords
April 13 Intel MIC Programming 174
#define cilk_spawn
#define cilk_sync
Int fib (int n)
{
if (n < 2)
return n;
int x = cilk_spawn fib(n-1);
int y = fib (n-2);
cilk_sync;
return x+y;
}
1/31/2014
88
Who creates the threads ?
At the beginning of a program, Cilk Plus Runtime asks OS to
create as many threads as available cores.
Each thread manages a queue containing tasks to execute.
April 13 Intel MIC Programming 175
Array notation (1)
Array Notation: Tell the compiler to use SIMD instructions.
Exist in Fortran90
In Cilk plus: A[<lower_bound>:<length>:<stride]
April 13 Intel MIC Programming 176
// for(i = 0; i < N; i++) A[i] = c* B[i];
A[:] = c* B[:];
// for(i = 0; i < N; i++) A[i+N/2] = c* B[i];
A[N/2:N/2] = c* B[0:N/2];
1/31/2014
89
Array notation (2)
Fortran90: arrays can overlap
o It generates temporary arrays
Cilk Plus Arrays Elements must not overlap:
o Better performances / undefined behavior if arrays overlaps
April 13 Intel MIC Programming 177
A[0:10] = c* A[1:11]; // undefined behavior
A[0:10] = c* A[0:10]; // ok, seen as a reduction
A[0:10:2] = c* A[1:11:2]; // ok, elements don’t overlap
Array Notation: Dynamic Arrays (3)
For Dynamically Allocated arrays:
o start and length have to be specified
April 13 Intel MIC Programming 178
Int f (float* a, float * b)
{
A[:] = c* B[:]; // COMPILATION ERROR !
A[0:N] = c*B[0:N]; // OK
}
1/31/2014
90
Array Notation: Multi-dimentional arrays (4)
Multi-dimentional arrays: same constraints as 1D.
April 13 Intel MIC Programming 179
// for (i = 0; i < N; i ++)
// for (i = 0; j < N; j++)
// B[i][j] = 0;
B[:][:] = 0;
Array Notation: conditionals (5)
Conditionals
April 13 Intel MIC Programming 180
// for (i = 0; i < N; i ++)
// if (a[i] == 0)
// result[i] += 1;
If (a[:] == 0)
result[:] += 1;
1/31/2014
91
Array Notation: builtin functions (6)
Built-in functions
o __sec_reduce_add (A[:]) // sum = 𝐴[𝑖]𝑁𝑖=0
o __sec_reduce_mul (A[:]) // product
o __sec_reduce_max(A[:]) // max among elements of A
o __sec_reduce_min(A[:]) // min
o __sec_reduce_max_index(A[:]) // return index of max element
o __sec_reduce_min_index(A[:]) // return index of min element
o __sec_reduce_all_zero (A[:]) // true if all zero
o __sec_reduce_any_zero (A[:]) // true if a zero exist
o __sec_reduce_all_nonzero (A[:]) // true if all elments are not zero
o __sec_reduce_any_nonzero (A[:]) // true if there is non-zero elements
April 13 Intel MIC Programming 181
Array Notation: elemental functions (6)
Function call can be declared with __declspec(vector) or
o 2 functions generated by compiler: 1 SIMD one scalar.
April 13 Intel MIC Programming 182
__declspec (vector) void calc (float* B)
{
B[0] = B[0] + rand ();
}
calc (B[:]); // SIMD version is called
Calc ( B[5]); // sequential version is called
1/31/2014
92
Vectorization directive
#pragma SIMD in front of loops can be used instead of Array
sections
Clauses exist: linear, reduction, private etc…
April 13 Intel MIC Programming 183
#pragma simd
For (int i = 0; i < N; i++)
{
B[i] = B[i] + i;
}
Cilk Plus with MIC
Can be used alone, or combined with other programming
model : offload directives, openMP etc..
Straighforward vectorization with Cilk Plus on MIC.
_Cilk_shared : share memory between host and MIC.
o Only apply on static allocated memories.
April 13 Intel MIC Programming 184
1/31/2014
93
OpenMP
The master thread
The master thread
executes all the
sequential region and
create some slave
threads
Thread ID = 0
www.caps-entreprise.com 186
1/31/2014
94
What’s OpenMP
A standard API for shared memory parallel applications
In C/C++ or Fortran
Consists in compiler directives, runtime routines and
environment variables
Works on the fork and join model
An OpenMP program is portable
Requires less programming effort than Pthread
www.caps-entreprise.com 187
History
OpenMP 1.0 for Fortran (1997) & C/C++ (1998)
OpenMP 2.0 for Fortran (2000) & C/C++ (2002)
o Major revision
2005 : OpenMP 2.5
o Extensive rewrite and clarification
2008 : OpenMP 3.0
o Task
o Better support for nested parallelism
2013: OpenMP 4.0
o Accelerator directives
www.caps-entreprise.com 188
1/31/2014
95
The parallel directives
A parallel region is a block of code that will be executed by
multiple threads
How to declare a parallel region
To put a synchronization barrier use
www.caps-entreprise.com 189
#pragma omp parallel
{
…
…
}
!$OMP PARALLEL
…
…
!$OMP END PARALLEL
#pragma omp barrier !$omp barrier
Example
Code :
Execution :
www.caps-entreprise.com 190
#pragma omp parallel
{
printf("hello world !\n");
}
$ ./a.out
Hello world !
Hello world !
Hello world !
Hello world !
1/31/2014
96
The parallel loops
It exists a useful directive to parallelize the loops
All the threads will execute independently iterations of the loop
In C
In Fortran
We can fuse the directive parallel and the directive for/do
www.caps-entreprise.com 191
#pragma omp for
For (i=0; i<N; i++){
…
}
!$OMP DO
DO i=0,N
…
ENDDO
!$OMP END DO
Example C/C++
www.caps-entreprise.com 192
#include <omp.h>
#define N 10000
int main(){
int tab[N];
init(tab);
#pragma omp parallel for
for (i=0; i<N; i++)
{
tab[i] = foo(i, tab[i]);
}
}
1/31/2014
97
Example Fortran
www.caps-entreprise.com 193
PROGRAM main
USE omp_lib
IMPLICIT NONE
INTEGER N 10000
INTEGER, DIMENSION ( N) :: tab
CALL init(tab);
!$omp parallel do
DO i=1, N
tab(i) = foo(i, tab(i))
ENDDO
!$omp end parallel do
END PROGRAM main
Get some useful informations
To get the ID of the current thread
To get the current number of thread in the region
To set the number of thread in the next parallel region
www.caps-entreprise.com 194
int my_id = omp_get_thread_num()
int nbThreads = omp_get_num_threads()
omp_set_num_threads(int n)
1/31/2014
98
Example
Code :
Execution :
www.caps-entreprise.com 195
#pragma omp parallel
{
int id = omp_get_thread_num();
int size = omp_get_num_threads();
printf("hello world ! From %i of %i\n", id, size);
}
$ ./a.out
Hello world ! From 2 of 4
Hello world ! From 0 of 4
Hello world ! From 3 of 4
Hello world ! From 1 of 4
The data sharing clauses
Data can be shared by threads or private to each threads o Declare a list of private variables to each threads
o Declare a list of shared variables to all threads (default behaviour)
o Declare the default behavior for each variable
o None will return an error at compilation for each variables not explicitly precised
These clauses can be added to the directives o Parallel
o For/Do
o Single
www.caps-entreprise.com 196
#pragma omp directive private( variable [, variable])
#pragma omp directive shared( variable [, variable])
#pragma omp directive default(none | shared | private)
1/31/2014
99
Example using shared & private
#include <omp.h>
#define N 1000
main ()
{
int i;
float a[N], b[N], c[N];
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
#pragma omp parallel for shared(a,b,c) private(i)
for (i=0; i < n; i++)
c[i] = a[i] + b[i];
}
www.caps-entreprise.com 197
Critical part
The compilers and OpenMP
On Linux
o Gcc inputfile -fopenmp
o Icc inputfile -openmp
o Gfortran inputfile -fopenmp
o Ifort inputfile -openmp
On Windows
o Icc inputfile /Qopenmp
o Ifort inputfile /Qopenmp
www.caps-entreprise.com 198
1/31/2014
100
The single directive (1)
Only one thread executes this region of code
Second case : Any of the threads can execute this part, but
only one
www.caps-entreprise.com 199
Code to execute just once
Waiting time
The single directive (2)
A region delimited by these directives will be executed by only one of all the threads (master or slaves)
Only one thread executes this region
Synchronization at the end
www.caps-entreprise.com 200
#pragma omp parallel
{
…
#pragma omp single
{
…
}
…
}
//end of parallel region
!$OMP PARALLEL
! Parallel region
…
!$OMP SINGLE
!only one thread in
…
!$OMP END SINGLE
…
!$OMP END PARALLEL
1/31/2014
101
The master directive (1)
Only the master thread executes a region of code
www.caps-entreprise.com 201
Code to execute just once
The master directive (2)
A region delimited by these directives will be executed by the
master
Only the master thread executes this region
www.caps-entreprise.com 202
#pragma omp parallel
{
…
#pragma omp master
{
…
}
…
}
//end of parallel region
!$OMP PARALLEL
! Parallel region
…
!$OMP MASTER
!only the master in
…
!$OMP END MASTER
…
!$OMP END PARALLEL
1/31/2014
102
The Critical Region (1)
www.caps-entreprise.com 203
Start of critical region
End of critical region
Critical part
Waiting time
The Critical Region (2)
All threads execute the code, but only one at a time
A lightweight special form exists but apply only on the
following statement
www.caps-entreprise.com 204
#pragma omp critical
{
…
}
!$OMP CRITICAL
…
!$OMP END CRITICAL
#pragma omp atomic
…
!$OMP ATOMIC
…
1/31/2014
103
Example : critical & atomic
Is the same result in X at the end ?
www.caps-entreprise.com 205
PROGRAM CRITICAL
INTEGER X
X = 0
…
!$OMP PARALLEL SHARED(X)
…
!$OMP CRITICAL
X = X + 1
X = X * 2
!$OMP END CRITICAL
…
!$OMP END PARALLEL
…
END PROGRAM CRITICAL
PROGRAM ATOMIC
INTEGER X
X = 0
…
!$OMP PARALLEL SHARED(X)
…
!$OMP ATOMIC
X = X + 1
!$OMP ATOMIC
X = X * 2
…
!$OMP END PARALLEL
…
END PROGRAM ATOMIC
The performance clauses
To be sure to have performance, you can use this clause, so
at the runtime, you can adapt the behavior of your
application
At the beginning of a parallel region, you can also set the
number of threads
www.caps-entreprise.com 206
#pragma omp parallel if(cond)
{
…
}
#pragma omp parallel num_threads(n)
{
…
}
1/31/2014
104
The nowait clause
There are some implicit synchronization barriers at the end of the blocks marked by these directives o Do/For
o Single
To avoid the synchronization, put this clause
o The threads will directly continue after the end of the region, but beware of data dependencies
www.caps-entreprise.com 207
#pragma omp for nowait
for(i=0; i<N; i++){
…
}
Environment variables
Number of threads
Limit of thread in the system
o Default value is 1024
The size of the stack for each threads
o Default value is 4 MB 32 bit-system and 8 MB 64 bit-system
www.caps-entreprise.com 208
$ export OMP_NUM_THREADS=[0-9*]
$ export OMP_THREAD_LIMIT=n
$ export OMP_STACKSIZE=size
1/31/2014
105
Informations
To set the number of thread in the next level of parallelism
Get the maximum number of processors
Get time reference
o Get the current time in second
o omp_get_wtick
www.caps-entreprise.com 209
omp_set_num_threads(n)
int nb_procs = omp_get_num_procs()
Double t = omp_get_wtime()
Double t = omp_get_wtick()
Reduction (1)
Reduction applies to the following directives:
o for
o parallel
o sections
Declare a reduction:
o In C/C++ : reduction (operator: list)
o In fortran : reduction (operation|intrinsic : list)
The REDUCTION clause performs a reduction on the
variables that appear in the given list
www.caps-entreprise.com 210
1/31/2014
106
Reduction (2)
Restrictions on the variables in the list:
o must be named scalar variables
o must be declared SHARED in the enclosing context
Principle:
o For each thread a private copy of each list variable is created
o At the end of the reduction, the private copies are combined into one
scalar
o The final result is written to the global shared variable.
www.caps-entreprise.com 211
Example
www.caps-entreprise.com 212
#include <omp.h>
#define N 10000
int main(){
int tab[N];
int sum = 0;
init(tab);
#pragma omp parallel for reduction(+:sum)
for (i=0; i<N; i++)
{
sum += foo(i, tab[i]);
}
}
1/31/2014
107
Reduction Operator : C/C++
www.caps-entreprise.com 213
Intrinsic/Operation Operation
+ Sum
- Subtraction
* Product
/ Division
^ Power
&& Logical AND
|| Logical OR
Minimum
Maximum
& Bitwise AND
| Bitwise OR
Bitwise Exclusive OR
Reduction Operator : Fortran
www.caps-entreprise.com 214
Intrinsic/Operation Operation
+ Sum
- Subtraction
* Product
.EQV. Equality
.NEQV. Non-equality
.AND. Logical AND
.OR. Logical OR
MIN Minimum
MAX Maximum
IAND Bitwise AND
IOR Bitwise OR
IEOR Bitwise Exclusive OR
1/31/2014
108
Sections
The sections directive specifies that the code in the enclosed section blocks are to be divided among the threads in the teams
Each section is executed once
May be declared into a parallel region
Declare parallel sections o omp sections[clause[[,],clause…]
omp section code block
omp section code block
…
omp end sections[nowait]
www.caps-entreprise.com 215
Sections : example
#pragma omp parallel
{
#pragma omp sections nowait
{
#pragma omp section
{
printf("Thread %d doing section 1\n",tid);
workToDoInSection1();
}
#pragma omp section
{
printf("Thread %d doing section 2\n",tid);
workToDoInSection2();
}
}
}
www.caps-entreprise.com 216
1/31/2014
109
Task
Tasks are the most important change in OpenMP 3.0
Each thread encountering a task construct it and add it in a tasks pool
The task may be executed by the encountering thread, or deferred for execution by any other thread in the team
Declare a task: o omp task [clause [,clause] ]
Task synchronization: o omp taskwait
www.caps-entreprise.com 217
What can be done with task ?
Allows to parallelize irregular problems o Unbounded loops
o Recursive algorithms
o Producer/Consumers
o …
Tasks are work units which may be deferred
Tasks executed by threads in the team
Each task encountered by a thread is created
Tasks can be nested
www.caps-entreprise.com 218
1/31/2014
110
Task : example
#pragma omp parallel
{
#pragma omp single nowait
{
iter = 25;
while( iter > 0)
{
#pragma omp task
myfunction(iter);
iter = iter -1;
}
}
#pragma omp taskwait
}
www.caps-entreprise.com 219
Critical part
Hardware Accelerators
1/31/2014
111
Why Use Accelerators?
For application containing hotspots
o Parts of code where the majority of the execution time is spent
If hotspots are highly parallel:
o The application can be accelerated
o Otherwise, the algorithm may have to be changed
Accelerators can also be viewed as low-power compute
nodes
o Make the application power efficient
Intel MIC Programming 221
Hardware Accelerator
In HPC, two kinds of accelerators are extremely popular
o GPUs, such as the K20 in Titan (2nd fastest supercomputer)
• Nvidia Tesla
• AMD Firepro
o Intel Xeon Phi, for example in Tianhe 2 (fastest supercomputer)
• Based on x86 technology
1/31/2014
112
GPUs
www.caps-entreprise.com 223
Today’s Hybrid/Heterogeneous Compute Node
Streaming engines (e.g. GPU) o Application specific
architectures (“narrow band”)
o Vector/SIMD
o Can be extremely fast
General purpose cores o Share a main memory
o Core ISA provides fast SIMD instructions
o Large cache memories
www.caps-entreprise.com 224
1/31/2014
113
Stream computing
Stream programming is well suited to GPU
o But memory hierarchy is exposed
A similar computation is performed on a collection of data (stream)
o There is no data dependence between the computation on different stream elements
www.caps-entreprise.com 225
What is GPGPU ?
General Purpose computation using GPU in applications other than 3D graphics
o GPU accelerates critical path of application
Data parallel algorithms leverage GPU attributes
o Large data arrays, streaming throughput
o Fine-grain SIMD parallelism
o Low-latency floating point (FP) computation
Applications – see //GPGPU.org
o Game effects (FX) physics, image processing
o Physical modeling, computational engineering, matrix algebra, convolution, correlation, wave propagation
www.caps-entreprise.com 226
1/31/2014
114
Previous GPGPU Constraints
Dealing with graphics API
o Working with the corner cases of the graphics API
Addressing modes
o Limited texture size/dimension
Shader capabilities
o Limited outputs
Instruction sets
o Lack of Integer & bit ops
Communication limited
o Between pixels
o Scatter a[i] = p
www.caps-entreprise.com 227
Input Registers
Fragment Program
Output Registers
Constants
Texture
Temp Registers
per thread
per Shader
per Context
FB Memory
CUDA
1/31/2014
115
“Compute Unified Device Architecture”
General purpose programming model o User kicks off batches of threads on the GPU
o GPU = dedicated super-threaded, massively data parallel co-processor
Targeted software stack o Compute oriented drivers, language, and tools
Driver for loading computation programs into GPU o Standalone Driver - Optimized for computation
o Interface designed for compute - graphics free API
o Data sharing with OpenGL buffer objects
o Guaranteed maximum download & readback speeds
o Explicit GPU memory management
www.caps-entreprise.com 229
An Example of Physical Reality Behind CUDA
www.caps-entreprise.com 230
Northbridge
Southbridge
µProc
ATA
SATA
PCI
HDA
1394
USB
SCSI
PCI-e
DDR
DDR
SCSI
GPU
Device memory
Host Memory
(shared memory)
1/31/2014
116
Parallel Computing on a GPU with Nvidia
NVIDIA GPU Computing Architecture o Via a separate HW interface
o In laptops, desktops, workstations, servers
The next GK110 GPUs (K20) will deliver up to 4.5 TFLOPS (SP) on compiled parallel C applications o 1 TFLOPS DP
Programmable in C with CUDA tools o Programming model scales transparently
o Multithreaded SPMD model uses application data parallelism and thread parallelism
www.caps-entreprise.com 231
Tesla K20
GeForce GTX 680
Tesla C2075
Extended C
Declspecs
o global, device, shared, local, constant
Keywords
o threadIdx, blockIdx
Intrinsics
o __syncthreads
Runtime API
o Memory, symbol, execution management
Function launch
www.caps-entreprise.com 232
__device__ float filter[N];
__global__ void convolve (float *image) {
__shared__ float region[M];
...
region[threadIdx] = image[i];
__syncthreads()
...
image[j] = result;
}
// Allocate GPU memory
void *myimage = cudaMalloc(bytes)
// 100 blocks, 10 threads per block
convolve<<<100, 10>>> (myimage);
1/31/2014
117
Compilation Path
www.caps-entreprise.com 233
C/C++ CUDA Application
NVCC CPU Code
PTX Code Virtual
…
PTX to Target
Compiler
Physical
GF100 GT200
Hardware Overview
Device contains o Multiprocessors o Host access interface o Memory o 4 generation groups:
o 1.0, 1.1 (8800, 9800) o 1.2, 1.3 (GTX220, c1060) o 2.0 (c2050) o 3.0 (K10), 3.5 (K20)
Multiprocessors contains o ALUs o Registers o Shared Memory o Access to Local Memory o Access to Global Memory
www.caps-entreprise.com 234
DEVICE
Constant
Memory
Texture
Memory
Global
Memory
Multiprocessor 1
Shared Memory
Local
Memory
ALU
Registers
Local
Memory
ALU
Registers
Multiprocessor N
Shared Memory
Local
Memory
ALU
Registers
Local
Memory
ALU
Registers
Host
1/31/2014
118
Nvidia Architecture Overview
GT200 / GF100 / GK110
www.caps-entreprise.com 235
SMX
Memory Sizes accross GPU Generations
Specifications 1.0-1.1 1.2-1.3 2.0 3.x
Multiprocessors Up to 16 Up to 30 Up to 16 Up to 14
ALU(SP)/Multipro. 8 8 32 192
32 bits Registers/Multipro.
8 k 16 k 32 k 64k
Shared Mem/Multipro. 16 kB 16 kB 16 / 48 kB 16 / 48 kB
Constant Memory 64 kB 64 kB 64 kB 64 kB
Local Memory
Global Memory Up to 4 GB Up to 4 GB Up to 12 GB Up to 12 GB
Cache on Global Mem No/- No/- Yes/L1-L2 Yes/L1-L2
Size of L2 Cache - - 768 kB Up to 1536 kB
Size of L1 Cache/multipro
- - 16 / 48 kB 16 / 48 kB
www.caps-entreprise.com 236
1/31/2014
119
Thread Batching:
Grids and Blocks
A kernel is executed as a grid of thread blocks o All threads share data
memory space
A thread block is a batch of threads that can cooperate with each other by: o Synchronizing their execution
• For hazard-free shared memory accesses
o Efficiently sharing data through a low latency shared memory
Two threads from two different blocks cannot cooperate o Atomic operations added in HW
1.1
www.caps-entreprise.com 237
Host
Kernel
1
Kernel
2
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Block and Thread IDs
Threads and blocks have IDs o So each thread can
decide what data to work on
o Block ID: 1D, 2D or 3D o Thread ID: 1D, 2D, or 3D
Simplifies memory addressing when processing multidimensional data o Image processing o Solving PDEs on volumes o …
www.caps-entreprise.com 238
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
1/31/2014
120
Block and Thread keywords
Block keywords threadIdx.{x,y,z} defines
the thread index inside the block
blockDim.{x,y,z} defines the block dimensions
Grid keywords
blockIdx.{x,y,z} defines the block index inside the grid
gridDim.{x,y,z} defines the grid dimension
www.caps-entreprise.com 239
Block (1, 1)
Thread
(0,1,0)
Thread
(1,1,0)
Thread
(2,1,0)
Thread
(3,1,0)
Thread
(4,1,0)
Thread
(0,2,0)
Thread
(1,2,0)
Thread
(2,2,0)
Thread
(3,2,0)
Thread
(4,2,0)
Thread
(0,0,0)
Thread
(1,0,0)
Thread
(2,0,0)
Thread
(3,0,0)
Thread
(4,0,0)
Thread
(0,1,1)
Thread
(1,1,1)
Thread
(2,1,1)
Thread
(3,1,1)
Thread
(4,1,1)
Thread
(0,2,1)
Thread
(1,2,1)
Thread
(2,2,1)
Thread
(3,2,1)
Thread
(4,2,1)
Thread
(0,0,1)
Thread
(1,0,1)
Thread
(2,0,1)
Thread
(3,0,1)
Thread
(4,0,1)
blockDim.x
Grid 1
Block
(0 ,0)Block
(0 ,1)Block
(0 ,2)Block
(0 ,3)
Block
(1 ,3)
Block
(1 ,0)Block
(1 ,1)Block
(1 ,2)
gridDim.x
gri
dD
im.y
Memory Space Overview
Each thread can: o R/W per-thread registers o R/W per-thread shared
memory o R/W per-block local
memory o R/W per-grid global
memory o Read only per-grid
constant memory o Read only per-grid texture
memory
The host can: o R/W global memory o R/W constant memory o R/W texture memory
www.caps-entreprise.com 240
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
1/31/2014
121
Memory Allocation
• cudaMalloc()
o Allocates object in the device Global Memory
o Requires two parameters
• Address of a pointer to the allocated object
• Size of of allocated object
• cudaFree()
o Frees object from device Global Memory
• Pointer to freed object
www.caps-entreprise.com 241
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
CUDA Host-Device Data Transfer
• cudaMemcpy() o memory data transfer o Requires 4 parameters
• Pointer to source • Pointer to destination • Number of bytes copied • Type of transfer
– Host to Host – Host to Device – Device to Host – Device to Device
• Asynchronous variant supported on 1.1+HW
www.caps-entreprise.com 242
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
1/31/2014
122
CUDA Function Declarations
__global__ defines a kernel function
o Must return void
www.caps-entreprise.com 243
Executed on the: Only callable from the:
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device host
__host__ float HostFunc() host host
CUDA Functions Declaration
Address of __device__ functions cannot be taken
For functions executed on the device:
o No recursion (HW < 2.0)
o Recursion possible for __device__ function (HW >= 2.0)
o No static variable declarations inside the function
o No variable number of arguments
www.caps-entreprise.com 244
1/31/2014
123
Calling a Kernel Function
Thread Creation
A kernel function must be called with an execution configuration:
o Any call to a kernel function is asynchronous, explicit synchronization needed for blocking
www.caps-entreprise.com 245
__global__ void KernelFunc(...);
dim3 DimGrid(100, 50); // 5000 thread blocks
dim3 DimBlock(8, 8, 4); // 256 threads per block
…
KernelFunc<<< DimGrid, DimBlock >>>(...);
…
cudaThreadSynchronize();
Compilation
Any source file containing CUDA language extensions must be compiled with nvcc
nvcc is a compiler driver o Works by invoking all the necessary tools and compilers like
cudacc, g++, cl, ...
nvcc can output: o Either C code
• That must then be compiled with the rest of the application using another tool
o Or object code directly
www.caps-entreprise.com 246
1/31/2014
124
Linking
Any executable with CUDA code requires one dynamic library: o The CUDA runtime library (cudart)
With gcc, you may need to link with the sandard C++ library o libstdc++
www.caps-entreprise.com 247
Debugging : CudaGDB
On Linux or Mac OS X
Compile your application with nvcc and –g and –G options
Execute the debugger with : cuda-gdb
Possible to : o Get device information, gridDim and blockDim
o Break on the host and in the kernel
o Switch between CUDA Threads and host thread
Can be integrated to : o Emacs GUI
o DDD
Another available debugger o Allinea DDT
www.caps-entreprise.com 248
1/31/2014
125
GPU Debugging Pitfalls
But not all illegal program behavior can be caught
Conditions to Debug application on the local machine o Linux
• If single GPU, no Graphical Server running on the system
• 2 GPUs on the machine, 1 running the Graphical Server and the second running the CUDA program
o Windows
• Only possible if there is two GPU
• 1 for the visualization
• 1 to debug the CUDA application
On a remote machine, no problem
www.caps-entreprise.com 249
Profiler
31/01/2014 www.caps-entreprise.com 250
1/31/2014
126
Parallel NSight
Available on Windows and Linux o Integrated to Microsoft
Visual Studio
o Integrated to Eclipse IDE
Debugging CUDA application o Using Microsoft Visual
Studio windows : Memory, Locals, Watches and Breakpoints
Analyzing the performance of your GPGPU applications o CUDA
o OpenCL
o DirectCompute
www.caps-entreprise.com 251
Warps
Each block of threads is split into warps
Each warp contains the same number of threads: 32
Each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread #0
Each warp is executed by a multiprocessor in a SIMD fashion
www.caps-entreprise.com 252
1/31/2014
127
Warps (2)
Divergent branches within a warp cause serialization
o If all threads in a warp take the same branch, no extra cost
o If each thread take one or two different branches, entire warp pays cost of both branches of code
o If threads take n different branches, entire warp pays cost of n branches of code
www.caps-entreprise.com 253
Coalescing 1.0/1.1
A coordinated read by a half-warp (16 threads)
A contiguous region of global memory o 64 bytes: each thread reads a word (int, float, ...)
o 128 bytes: each thread reads a double-word (int2, float2, ...)
o 256 bytes: each thread reads a quad-word (int4, float4,...)
Additional restrictions on G8x/G9x architecture: o Starting address for a region must be a multiple of region size
o The kth thread in a half-warp must access the kth element in a block being read
Exception: not all threads must be participating o Predicated access, divergence within a half-warp
www.caps-entreprise.com 254
1/31/2014
128
Coalesced Access 1.0/1.1:
Reading floats
www.caps-entreprise.com 255
Uncoalesced Access 1.0/1.1:
Reading floats
www.caps-entreprise.com 256
1/31/2014
129
Coalesced Access 2.x
Cache on global memory may hide coalescing issues
2 level of cache
o 16-48 KB of L1 per SM
o 768 KB of L2 global for all SM
Memory Latency
o Global: 400-800 cycles
o L2 Cache: 100-200 cycles
o L1 Cache: about 4 cycles (without bank conflict)
www.caps-entreprise.com 257
Shared Memory
Hundreds of times faster than global memory
o About same latency as registers
o 32 banks can be accessed simultaneously with 2.x compute capability
o Successive 32 bits words are assigned to successive banks
Threads of a same block can cooperate via shared memory
o Up to 48 KBytes with 2.x compute capability by multiprocessor
Can be used to avoid non-coalesced accesses
www.caps-entreprise.com 258
1/31/2014
130
Shared Memory:
Performance Issues
The fast case:
o If all threads of a half-warp (or warp with cc 2.x) access different banks, there is no bank conflict
o If all threads of a half-warp (or warp with cc 2.x) read the identical address, there is no bank conflict (broadcast)
The slow case:
o Bank conflict: multiple threads in the same half-warp (or warp with cc 2.x)
o Must serialize the accesses
o Cost = max # of simultaneous accesses to a single bank
www.caps-entreprise.com 259
Shared Memory Access
www.caps-entreprise.com 260
Pattern accesses with no
bank conflicts:
each thread of the half-warp
accesses a different bank
1/31/2014
131
Shared Memory Access (2)
www.caps-entreprise.com 261
Each thread reads one
address from the same
bank: no conflict (broadcast)
Threads accessing the
same bank to different
value:
conflict!
Optimizing Threads per Block
Choose threads per block as a multiple warp size o Avoid wasting computation on under populated warps
More threads per block == better memory lattency hiding o Kernel invocations can fail if too many registers are used
Heuristics o Minimum Required by the HW: 64 threads per block
• Only if multiple concurrent blocks
o 192 or 256 threads a better choice
• Usually still enough regs to compile and invoke successfully
o This all depends on your computation, so experiment!
www.caps-entreprise.com 262
1/31/2014
132
Grid/Block Size Heuristics
# of blocks > # of multiprocessors o So all multiprocessors have at least one block to execute
# of blocks / # of multiprocessors > 2 o Multiple blocks can run concurrently in a multiprocessor
o Blocks that aren’t waiting at a _synchthreads() keep the hardware busy
o Subject to resource availbility - registers, shared memory
# of blocks > 100 to scale to future device o Blocks executed in pipeline fashion
o 1000 blocks per grid will scale accross multiple generations
www.caps-entreprise.com 263
Asynchronicity & Overlapping
Default CUDA API
o Kernel launches are asynchronous with CPU
o Memcopies block CPU thread (H2D=HostToDevice,
D2H=DeviceToHost)
o CUDA calls are sequential on GPU, serialized by the driver
But CUDA also offers asynchronicity and overlapping
o Asynchronous memcopies (D2H, H2D) with CPU
o Ability to concurrently execute a kernel and a memcopy
www.caps-entreprise.com 264
1/31/2014
133
Page-locked Memory, Principles (1)
Operating systems handle memory with a mechanism called
paged virtual memory:
o Divides the virtual address space of an application into memory pages
(default on x86 is 4 KiBytes)
o Allows applications to use more memory than the physical RAM
available on the system, by swapping pages to a disk
o Physical address of the page can change, this is tranparent to the
application, as the virtual address does not change
Pages can be locked by the OS into physical memory to prevent
swapping and to guarantee a permanent physical address
www.caps-entreprise.com 265
Page-locked Memory, Principles (2)
A PCI-express device can only directly access physical addresses, never an application's virtual address space o So only page-locked memory can be directly exploited by the
hardware
CUDA allows the application to request page-locked memory from the CUDA kernel driver
Both the application and the device can use directly such memory o No need for time-consuming intermediate copies between the
application virtual address space and the device's on-board memory
www.caps-entreprise.com 266
1/31/2014
134
CUDA page-locked memory
All CUDA version allows application to request page-locked
memory, often called pinned memory
o No other applications, not even the OS, can use the locked pages.
Do not use too much page-locked memory!
All CUDA memory copy functions take advantage of pinned
memory
Pinned memory is a prerequisite for asynchronous memory
copies
www.caps-entreprise.com 267
Different way to use Page-Locked Memory
Allocation directly in Page-Locked Memory
www.caps-entreprise.com 268
//Allocate the data in physical RAM
cudaMallocHost((void**) &hostPtr, size);
…
cudaFreeHost(hostPtr);
//Do not forget it or the data will stay alive in your Main memory
1/31/2014
135
Asynchronicity (1)
Synchronous execution example:
o The application waits for the GPU to complete the requested task.
www.caps-entreprise.com 269
Asynchronicity (2)
The asynchronous version:
o Control is returned to the application before the device has completed the requested task.
www.caps-entreprise.com 270
1/31/2014
136
Asynchronicity (3)
Advantages
o Enables full exploitation of the hardware available on the machine
(CPU + GPU together)
o Kernel launches are already asynchronous, no need to modify your
code
Drawback
o Needs explicit synchronization for data coherency
o Transfers require extra work to setup asynchronicity
But speed benefit already makes the extra work useful
www.caps-entreprise.com 271
Overlapping
Concurrent execution of GPU kernel and transfers from/to GPU o Makes use of asynchronicity
o Particularly handy when data make frequent,
o expensive round-trips between CPU and GPU
Typical cases o Several independent problems
o Several instances of a problem
o Single problem splitted into a set of sub-problems
Requires to use streams in your CUDA code
www.caps-entreprise.com 272
1/31/2014
137
Basics of CUDA Streams (1)
You said “stream”?
o A sequence of operations that execute in order on GPU
Streams have the following properties:
Streams use asynchronicity
o Concurrent execution between CPU and GPU
Streams enable overlapping
o Concurrent execution of a kernel on the GPU and
o transfers from or to the GPU
www.caps-entreprise.com 273
Basics of CUDA Streams (2)
How it works:
Operations from different streams can be interleaved
A kernel and a memcopy from different streams can be overlapped
www.caps-entreprise.com 274
1/31/2014
138
Code Example
// data allocation
float * hostPtr;
cudaMallocHost((void**) &hostPtr, 2 * size);
// streams declaration
cudaStream_t stream[2];
for(int i = 0; i < 2; ++i)
cudaStreamCreate(&stream[i]);
// streamed copy from host to device
for(int i = 0; i < 2; ++i)
cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size, size,
cudaMemcpyHostToDevice, stream[i]);
// streamed execution of the kernel
for(int i = 0; i < 2; ++i)
myKernel<<<100, 512, 0, stream[i]>>>(outputDevPtr + i * size, inputDevPtr + i * size, size);
// streamed copy from device to host
for(int i = 0; i < 2; ++i)
cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size, size,
cudaMemcpyDeviceToHost, stream[i]);
// threads synchronization
cudaThreadSynchronize();
www.caps-entreprise.com 275
Using multiple CUDA Accelerators with MPI
#CUDA accelerators > #cores o Multiple MPI processes per core
(beware of CPU overload)
#CUDA accelerators == #cores o The ideal case: generally one
MPI process per core and GPU
o CPU may be idle while GPU is working
#CUDA accelerators < #cores o Share the GPUs?
o Lock the GPUs?
o Load Balancing CPU & GPU?
www.caps-entreprise.com 276
CPU
CPU
GPU
GPU
CPU GPU
GPU
CPU
CPU
GPU
GPU
CPU
CPU
1/31/2014
139
Resident Data
Think differently : instead of
Use resident data mechanism
www.caps-entreprise.com 277
Transfer
CPU->GPU
Compute
Kernel A
Transfer
GPU->CPU
Compute
Kernel B
CPU CPU GPU
CPU
Transfer
CPU->GPU
Transfer
CPU->GPU
Transfer
GPU->CPU
Compute
Kernel A
Transfer
GPU->CPU
Compute
Kernel B
CPU GPU CPU CPU GPU
Reducing Transfers
Use GPU-resident data as much as possible o Send once, use many times, read once
o Can tremendously boost performance
o Transfers can easily be the dominant factor in GPU usage
• Then follow Amdahl’s Law by optimizing transfers rather than kernels
Examples o Multiple steps of computations in a loop
o Multiple steps of computations in sequence
Do everything requiring the resident data on the GPU if possible o Unless the computations do not fit GPU at all
www.caps-entreprise.com 278
1/31/2014
140
Partial transfers
Think differently : instead of
Use partial transfer mechanism
www.caps-entreprise.com 279
CPU
Transfer
CPU->GPU
Transfer
GPU->CPU
CPU
Compute
Kernel A
GPU
Transfer
CPU->GPU
CPU
Transfer
GPU->CPU
CPU
Compute
Kernel B
GPU CPU
I/O
CPU
Transfer
CPU->GPU GPU->CPU
CPU
Compute
Kernel A
GPU
CPU->GPU
CPU
Transfer
GPU->CPU
CPU
Compute
Kernel B
GPU CPU
I/O
Minimizing Quantities
Again, maximize resident data, this time by keeping sub-
arrays on the GPU
o Send once, use and update many times, read once
o If some data must absolutely come from outside the GPU
o If some data must absolutely goes to outside the GPU
Network or disk I/O
Computation steps unimplementable on GPU
Warning: each data transfer as an initial overhead
www.caps-entreprise.com 280
1/31/2014
141
Reducing Transfers
The GPU computes faster than it performs transfers o Sometimes it is better to re-compute data than retrieving it from a remote
memory
Don’t try to factorize data to save memory, think performance o Memory saving is often a performance killer
• Allocate more memory to re-align data onto the GPU’s global memory
• Allocate more memory to avoid bank conflicts in shared memory
• Re-compute data to avoid transfers…
Avoid to compute borders on the GPU o Border cases are often performance killer due to:
• Incomplete warps
• Branch divergences
• Incomplete coalesced segments
www.caps-entreprise.com 281
CuComplex Header
Complex numbers : CuComplex Header
o Simple or double precision (HW >= 1.3)
o Include cuComplex.h
www.caps-entreprise.com 282
1/31/2014
142
CuBLAS Library
Basic Linear Algebra Subprograms
o Include cublas.h
o Link with libcublas.so (linux) or cublas.dll (windows)
o Up to BLAS3 (same arguments)
Available functions
o Dot-product : cublasXdot()
o Matrix mutlitplication : cublasXgemm()
o …
User guide : http://developer.nvidia.com/cuda-toolkit-40
www.caps-entreprise.com 283
CuFFT Library
Fast Fourier Transform
o Include cufft.h
o Link with libcufft.so (linux) or cufft.dll (windows)
o 1D, 2D or 3D
Datatype
o Real or Complex type
o Simple or double precision (HW 1.3)
User guide : http://developer.nvidia.com/cuda-toolkit-40
www.caps-entreprise.com 284
1/31/2014
143
Thrust
Templated Performance Primitives Library for CUDA
Similar to the C++ STL
Available functionnality
o Containers
o Iterators
o Sort
o Scan
o Reduction
o …
www.caps-entreprise.com 285
NPP Library
NVIDIA Performance Primitives library
GPU-accelerated image, video, and signal processing
functions
5x to 10x faster performance than CPU
Available functions
o Filter functions
o JPEG functions
o Geometry transforms
o Stastictics functions
o …
www.caps-entreprise.com 286
1/31/2014
144
OpenCL
Before OpenCL
GPGPU o Vertex / pixel shaders
o Heavily constrained and not adapted
CTM / Brook o Then Brook+
o Then CAL/IL
CUDA o Widely broadcasted
No one of these technologies is hardware agnostic o Portability is not possible
www.caps-entreprise.com 288
1/31/2014
145
What is Hybrid Computing with OpenCL?
OpenCL is o Open, royalty-free, standard
o For cross-platform, parallel programming of modern processors
o An Apple initiative
o Approved by Intel, Nvidia, AMD, etc.
o Specified by the Khronos group (same as OpenGL)
It intends to unify the access to heterogeneous hardware accelerators o CPUs (Intel i7, …)
o GPUs (Nvidia GTX & Tesla, AMD/ATI 58xx, …)
What’s the difference with CUDA or CAL/IL? o Portability over Nvidia, ATI, S3… platforms + CPUs
www.caps-entreprise.com 289
OpenCL Devices
NVIDIA
o All CUDA cards
AMD GPUs
o Radeon & Radeon HD
o FirePro, FireStream
o Mobility…
Intel & AMD CPUs
o X86 w/ >= SSE 3.x
Cell/B.E.
DSP
ARM
www.caps-entreprise.com 290
1/31/2014
146
Inputs/Outputs with OpenCL programming
OpenCL architecture
www.caps-entreprise.com 291
Application
OpenCL kernels
OpenCL framework
OpenCL C language OpenCL API
OpenCL runtime
Driver
GPU hardware
OpenCL and C for CUDA
www.caps-entreprise.com 292
PTX
GPU
Entry point for developers who prefer high level C
Entry point for developers who prefer low level API
Shared backend compiler and optimization technology
C for OpenCL
C for CUDA
1/31/2014
147
Compilation & Execution
Really simple
Include the OpenCL o #include <CL/cl.h>
Link with the OpenCL library
To execute
www.caps-entreprise.com 293
$ g++ -o myprogram myprogram.cc –L/PATH/TO/OPENCLLIB -lOpenCL
$ ./myprogram
OpenCL APIs
C language API o Binding C++ (official)
o Binding Java
o Binding Python
o …
In the remainder we will only see the C API o And lab sessions focus on the C API
The C++ API is available on the Khronos Website o http://www.khronos.org/registry/cl/
Extensions exist to o OpenGL
o Direct3D
www.caps-entreprise.com 294
1/31/2014
148
Platform Model
Model consists of one or more interconnected devices
Computations occur within the Processing Elements of each device
www.caps-entreprise.com 295
Platform Version
3 different kind of versions for an OpenCL device
The platform version
o Version of the OpenCL runtime linked with the application
The device version
o Version of the hardware
The language version
o Higher revision of the OpenCL standard that this device supports
www.caps-entreprise.com 296
1/31/2014
149
Execution Model
Kernels are submitted by the host application to devices throw command queues
Kernel instances, called Work-Item (WI), are identified by their point in the NDRange index space o This enables to parallelize the execution of the kernels
But still 2 programming models are supported o Data-parallel
o Task parallel
So even if we have a single programming model, we should have two different programming approaches according to the paradigm we are considering
www.caps-entreprise.com 297
NDRange
NDRange is a N-Dimensional index space
o N is 1, 2 or 3
o NDRange is defined by an integer array of length N specifying the
extent of the index space on each dimension
www.caps-entreprise.com 298
1/31/2014
150
Work-Groups & Work-Items
Work-Items are organized into Work-Groups (WG)
Each Work-group has a unique global ID in the NDRange
Each Work-item has
o A unique global ID in the NDRange
o A unique local ID in its work-group
www.caps-entreprise.com 299
Parallelism Grains
CPU cores can handle only a few tasks
o But more complex
• Hard control flows
• Memory cache
o They can be either CPU threads or processes
• CPU threads: OpenMP, Pthread
• CPU Processes: MPI, fork()…
GPU threads are extremely lightweight
o Very little creation overhead
o Simple and regular computations
o GPU needs 1000s of threads (w.i.) for full efficiency
www.caps-entreprise.com 300
1/31/2014
151
Memory Model
Four distinct memory regions o Global Memory
o Local Memory
o Constant Memory
o Private Memory
Global and Constant memories are common to all WI o May be cached depending on the hardware capabilities
Local memory is shared by all WI of a WG
Private memory is private to each WI
www.caps-entreprise.com 301
Memory Architecture
www.caps-entreprise.com 302
1/31/2014
152
Memory Transfer (1)
2 types of transfers o Blocking (“synchronous”)
o Non-Blocking (“asynchronous”)
In the function clEnqueueRead/Write, set the “blocking” attribute to: o CL_TRUE, make a blocking transfer
o CL_FALSE, make a non-blocking transfer
For a non-blocking transfer o Need to link an event to the transfer command
o The event will be used for producer-consumer relationship, and/or explicit waiting
www.caps-entreprise.com 303
Memory Transfer (2)
Synchronizing ensures that data have been transferred to/from the device at this point
Example
Or can be used for the dependency flow in out-of-order queues o Use clEnqueueWaitForEvents() to synchronize in-order queues
www.caps-entreprise.com 304
cl_int clWaitForEvents (cl_uint num_events, const cl_event *evt_list)
cl_event evt;
err = clEnqueueReadBuffer( cmd_queue, buf_on_device, CL_FALSE, 0,
size, buf_on_host, 0, NULL, &evt );
clFlush(cmd_queue);
… //some work that doesn’t change the content of buf_on_host
clWaitForEvents(1, &evt);
… //work that may change buf_on_host (after transfer)
1/31/2014
153
Queue Policy
A command queue is linked to a specific device
By default the command queue is in-order
o But you can use this option to make it out-of-order
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
www.caps-entreprise.com 305
Out-of-order Queue
With out-of-order queues
o No guarantee about the execution order
o So we need events dependence to ensure producer/consumer
relationship
When there is a dependence between commands
o Link cl_event objects to these commands
o Create list of events for this dependence
o Enqueue the command with its list of dependences
o The command will be launched only when all listed events have
terminated
Larger granularity : barrier
o Force waiting for all commands before the barrier to complete
www.caps-entreprise.com 306
1/31/2014
154
Synchronizing Queues and Commands
May I synchronize an event according others in…
• OO: Out-of-Order
• IO: In-Order
• ✓ available
www.caps-entreprise.com 307
The same OO
queue
Another OO
queue
The same IO
queue
Another IO
queue
clWaitForEvents() ✓ ✓ ✓ ✓
clEnqueueWaitForEvents() ✓ ✓ Useless ✓
clEnqueueBarrier() ✓ ✓
Intel® Xeon PhiTM
1/31/2014
155
Phi Specifications
Discrete accelerators
Connected to host by PCIe
Passively or actively cooled
Embeds 50+ 64-bit x86 CPU cores
Has its own DRAM
Runs its own Linux OS
www.caps-entreprise.com 309
Host CPU
PCIe
Phi Products
3120P/3120A 5110P/5120D 7110P/7120X
Max. # of Cores 57 60 61
Frequency 1.1 1.053 1.238
Cache (MB) 28.5 30 30.5
Memory Capacity (GB) 6 8 16
Memory Bandwidth (GB/s) 240 320 / 352 352
Peak DP (TFLOPS) 1.003 1.011 1.208
TDP (W) 300 225 / 245 300
Cooling Passive / Active Passive / Dense Form Factor
Passive / None
Applications Compute-bound workloads (MonteCarlo, Black-Scholes, …)
Memory bandwidth-bound workloads (STREAM … )
Supercomputing centers
www.caps-entreprise.com 310
1/31/2014
156
Architectures Comparison
311
MIC
DRAM
ALU
ALU
DRAM
ALU
ALU Control
Cache
CPU
Power-efficient
multiprocessor
General-purpose architecture
GPU
DRAM
Massively data parallel
www.caps-entreprise.com
Coprocessor Card Design
Up to 16 channels of GDDR5 memory
Up to 8 GB GDDR5 (350 GB/s peak)
PCIe Gen2 compliant
Flash memory for coprocessor startup
System Management Controller (SMC) handles monitoring and control chores
www.caps-entreprise.com 312
1/31/2014
157
Microarchitecture of the Entire Coprocessor
Can be viewed as: o a symmetric multiprocessor
(SMP)
o with a shared uniform memory access (UMA) system
Up to 61 cores interconnected by a ring interconnect (ODI) o Transactions are managed
transparently by the hardware
512 kB (L2) cache per core o Cache coherency across the
entire multiprocessor thanks to TD mechanism
o Data remains consistent without software intervention
www.caps-entreprise.com 313
Individual Coprocessor Core Architecture
22nm using Trigate transistors
Designed for high-level of power-efficient parallelism o Based on Intel architecture for programmability
64-bit execution
In-order code execution model with multithreading
4 hardware threads per core
Clocked at ~1GHz
32 kB of L1 instruction cache
32 kB of L1 data cache
512 kB of private (local) L2 cache
Instructions are fetched from memory, decoded, dispatched and executed either by o A scalar unit using traditional x86 and x87
instructions
o A vector processing unit (VPU) using the Intel Initial Many Core Instructions (IMCI) utilizing a 512-bit wide vector length
• No support for MMX instructions, Intel AVX or SSE extensions
22nm using 3D Trigate transistors
Highly parallel and power-efficient design o Based on Intel architecture for programmability
64-bit execution
In-order code execution model with multithreading
4 hardware threads per core
Clocked at ~1GHz
32 kB of L1 instruction cache
32 kB of L1 data cache
512 kB of private (local) L2 cache
Instructions are fetched from memory, decoded, dispatched and executed either by o A scalar unit using traditional x86 and x87
instructions
o A vector processing unit (VPU) using the Intel Initial Many Core Instructions (IMCI) utilizing a 512-bit wide vector length
• No support for MMX instructions, Intel AVX or SSE extensions
www.caps-entreprise.com 314
Instruction Fetch
Instruction Decode
Scalar Unit Vector Unit
Scalar Registers Vector Registers
L1 32 kB Icache and 32 kB Dcache
512 kB L2 Cache Local Subset
On-Die Interconnect
1/31/2014
158
Instruction and Multithread Processing
Derived from Pentium design
Two instructions per clock cycle o One on the U-pipe
o One on the V-pipe
V-pipe cannot execute all instruction types o Vector instructions are mainly
executed only on the U-pipe
Instruction decoder is a two-cycle fully pipelined unit o Single-threaded code will use
50%
o At least 2 threads should be run per core
Instruction latency: 4 cycles
www.caps-entreprise.com 315
Xeon Phi Multithreading Capabilities
Xeon Phi has 4 hardware threads
o Xeon Phi can handle 4 live threads at the same time on each core
o Unlike Hyper-threading enabled CPUs, hardware threads cannot be
turned off
Intended to hide latencies
o Memory accesses and computations
o Inherent to in-order architecture
Maximum performance may be reached before that number
o Saturation may happen with two or three threads only
316 www.caps-entreprise.com
1/31/2014
159
Cache Memory Considerations
Cores can supply L2 cache data to each other on-die: data may be replicated
If no data or code is shared between the cores o L2 size is 30.5 MB (61 cores)
If every core shares the same code and data o L2 size is 512 kB
L2 usable size depends on how code and data are shared among the cores
L1 cache latency: 1 cycle
L2 cache latency: 15-30 cycles
GDDR5 latency: 500-1000 cycles
317 www.caps-entreprise.com
VPU Architecture
8 mask registers for SIMD lane predicated execution
Extended Mathematical Unit (EMU) for SP exponent, logarithm, reciprocal and reciprocal square root operations
IEEE 754 2008 compliant
318
V0 V1 V3
V31 V30
…
512 bits
K0 K1 K2 K3 K4 K5 K6 K7
16 bits
32 Vector Registers
Vector Mask Registers
…
16 SP or 8 DP elements / cycle
www.caps-entreprise.com
1/31/2014
160
Vectorization Brings Performance
Maximizing vectorization is as important as using enough
threads
Single precision
o 1.1 GHz * 61 cores * 16 lanes * 2 ops/core = 2.147 TFLOPS
Double precision
o 1.1 GHz * 61 cores * 8 lanes * 2 ops/core = 1.074 TFLOPS
319
VPU
www.caps-entreprise.com
Coprocessor Software Overview
Software tools are similar to those available on the host
Development Tools and Runtimes o Intel Parallel Studio XE 2013
• Intel Composer: Intel C/C++ and Intel Fortran compilers, parallel debugger, performance libraries
• Intel Inspector: identifies memory errors and threading errors
• Intel Advisor: helps programmer parallelize their code
• Intel Amplifier: performance profiler
Intel Manycore Platform Software Stack (Intel MPSS) o Specific to the coprocessor
• Middlewares
• Device drivers
• Coprocessor management utilities
• Linux OS
• GNU tools (gcc, gdb, …)
320 www.caps-entreprise.com
1/31/2014
161
Coprocessor Linux OS
Intel Xeon Phi runs an autonomous linux OS
o Linux kernel version 2.6.34 or greater
o Minimal, with small memory footprint
• Includes Linux Standard Base (LSB) Core libraries and Busybox minimal
shell environment
o Controlled by the host via the PCIe bus
The host boots the card and provides the linux boot image
Be careful, memory on the Xeon Phi is volatile
o Data is lost each time the card is rebooted
321 www.caps-entreprise.com
MPSS Boot
Host coprocessor driver mic.ko:
o Provides PCIe access
o Loads linux kernel into the accelerator’s memory
o Starts booting
mpssd daemon
o Controls booting based on configuration files
o mpssd is a linux service
micctrl application
o Configures the linux OS on the coprocessor
322 www.caps-entreprise.com
1/31/2014
162
Coprocessor Startup Configuration
Enable root access
o Add SSH keys in /root/.ssh
Generate default configuration files
o default.conf and micN.conf in /etc/sysconfig/mic
Start MPSS service
323
user@host $ sudo micctrl --initdefaults
user@host $ sudo service mpss start
www.caps-entreprise.com
Coprocessor Administration
Use micctrl utility
o Check coprocessor(s) status
o Check coprocessor(s) configuration
o Re-boot coprocessor(s)
o Shut down coprocessor(s)
o Add user
324
user@host $ micctrl --config
user@host $ micctrl -s
user@host $ micctrl -R [mic coprocessorlist]
user@host $ micctrl -S [mic coprocessorlist]
user@host $ micctrl –-useradd=<name> \
-sshkeys=<keydir> [mic coprocessorlist]
www.caps-entreprise.com
1/31/2014
163
Available Execution Models
Native execution model o Application is compiled for and executed on Xeon Phi
o Or application is both compiled for host and Xeon Phi
• May run on both architectures
• May introduce communications between both versions
• Requires architecture-agnostic infrastructure
o Original application can be reused
Processor-centric execution model o Application runs on the host
o And offloads selected parts of code onto Xeon Phi
o Communications are driven by the host
o Application has to be modified
325 www.caps-entreprise.com
Available Execution Models
Intel MIC
Native
MPI OpenMP
Accelerator
Intel Offload
Intel MKL OpenCL OpenACC
326 www.caps-entreprise.com
1/31/2014
164
Intel Offload
OpenMP Specifications
Currently: version 3.1
o Released in 2011
o Does NOT support accelerators
Heading toward version 4.0
o Released
o Supports accelerators
Intel expected that OpenMP 4.0 will reuse offload directives
and intends to support it when finalized
Intel MIC Programming 328
1/31/2014
165
Intel Offload Directive Model
Syntax o In C:
o In FORTRAN:
Implements following behaviors: o Coprocessor memory allocation
o Data transfer from host to coprocessor
o Execution on coprocessor
o Data transfer from coprocessor to host
o Coprocessor memory deallocation
Intel MIC Programming 329
#pragma offload <clauses>
<statement>
!DIR$ offload <clauses>
<statement>
Offload Computations to the Coprocessor
Intel MIC Programming 330
! Fortran OpenMP
!dir$ omp offload target(mic)
!$omp parallel do
Do i=1, count
A(i)=B(i)*c+d
End do
!$omp end parallel do
// C/C++ OpenMP
#pragma offload target(mic)
#pragma omp parallel for
for(i=0; i<count; i++)
{
a[i]=b[i]*c+d;
}
Next statement can execute on coprocessor “mic” if available, else processor
Next OpenMP parallel construct can execute on coprocessor “mic” if available, else processor
1/31/2014
166
Function and Variables on Coprocessor
Compile functions for, or allocate variables on, both the host
and the coprocessor
In C
In FORTRAN
Intel MIC Programming 331
__attribute__((target(mic))) <var/function>
__declspec(target(mic)) <var/function>
!DIR$ attributes offload:<mic>::<var/function>
#pragma offload_attribute(push,target(mic))
code
#pragma offload_attribute(pop)
Copy Clauses
Clauses Syntax Semantics
Inputs in(var-list : modifiers) Copy from the host to the accelerator
Outputs out(var-list : modifiers) Copy from the accelerator to the host
Inputs / Outputs inout(var-list : modifiers) Copy to the accelerator before offloading computations and back afterwards
Non-copied data nocopy(var-list : modifiers) Data is local to accelerator target
Intel MIC Programming 332
1/31/2014
167
Modifier Options
Modifiers Syntax Semantics
Specify pointer length length(N) Transfer N elements of the pointer’s type (N*sizeof(element) bytes). It cannot be used to offload part of an array
Control target data allocation alloc(array section reference) Limits memory allocation the shape of array specified
Control pointer memory allocation alloc_if(condition) Allocate memory for the pointer if condition is true
Control freeing of pointer memory free_if(condition) Free memory used by pointer if condition is true
Move data from one variable to another
into(array section reference) Transfers data from a variable on the host to another variable on the coprocessor and vice versa
Control target data alignment align(expression) Specify minimum memory alignment on accelerator
Intel MIC Programming 333
Allocating Partial Arrays in C
In the example above:
o 7000 integers are allocated on the host
o 6000 integers are allocated on the coprocessor
o The first element on the coprocessor has index 10
o The last element on the coprocessor has index 6009
o 500 elements, in the range ptr[100]-ptr[599], are copied on the
coprocessor
Intel MIC Programming 334
int *ptr = (int*) malloc(7000*sizeof(int));
#pragma offload in(ptr[100:500]:alloc(ptr[10:6000])
{ … }
1/31/2014
168
Allocating Partial Arrays in FORTRAN
In the example above:
o 30 integers are allocated on the host
o 18 integers are allocated on the coprocessor
o The first element on the coprocessor has index 3
o The last element on the coprocessor has index 20
o 8 elements, in the range T(7)-T(14), are copied on the coprocessor
Intel MIC Programming 335
INTEGER :: T(30)
!DIR$ OFFLOAD IN((T(7:14):ALLOC(3:20))
…
Moving Data Example
The example above performs a copy of the first 1500
elements of t1 to the last 1500 elements of t2
Intel MIC Programming 336
int t1[2000], t2[5000];
#pragma offload in(t1[0:1500]:into(t2[3500:1500]))
…
1/31/2014
169
Memory Management
By default, directives allocate fresh memory on the Xeon Phi
memory for each variable when entering the construct
By default, memory is deallocated when exiting the construct
Memory allocation is expensive: modifiers can change the
default behavior to reuse memory space
o If data has been allocated by a previous offload construct
o If data has been allocated by an attribute directive
o If data should be reused by an other offload construct
Intel MIC Programming 337
Persistent Storage Example
Intel MIC Programming 338
int nb = 1000;
int *t = (int*) malloc(nb*sizeof(int));
void bar()
{
#pragma offload in(t[0:nb]:alloc_if(1))
foo(t, nb);
}
void foo(int * ptr, int size)
{
#pragma offload in(ptr[0:size]:alloc_if(0))
…
}
Allocation of t of size nb on the coprocessor
Reuse of t already on the coprocessor and free t at the end of offload section
1/31/2014
170
Static and Dynamic Memory Example
Intel MIC Programming 339
__declspec(target(mic)) int array_host_mic[5000];
int array_host[5000];
void bar()
{
foo(&array_host[0], 5000);
foo(&array_host_mic[0], 5000);
}
void foo(int *t, int nb)
{
#pragma offload in(t[0:nb]:alloc_if(0))
…
}
Dynamic allocation of t on the coprocessor
Reuse of static allocation of array_host_mic on the coprocessor
Allocation on the processor and the coprocessor
Asynchronous Behavior
By default, Intel Offload directives cause the host thread to wait for the completion of the Xeon Phi instruction before going on to the next statement
Asynchronous behavior can be used specifying a signal clause to the offload directive
A offload_wait directive should be used to ensure completion
Intel MIC Programming 340
CPU MIC
1
2
3
4
5
CPU MIC
1
2
3
4
5
1/31/2014
171
Asynchronous Computations Example (C)
341
char sig;
int count = 1000;
__attribute__((target(mic))) mic_compute()
do {
#pragma offload target(mic) signal(&sig)
{
mic_compute();
}
host_activity();
#pragma offload_wait(&sig)
count = count – 1;
} while(count > 0);
www.caps-entreprise.com
Asynchronous Computations Example (FORTRAN)
342
integer sig
integer count
count = 1000
!dir$ attributes offload:mic::mic_compute
do while (count .gt. 0)
!dir$ offload target(mic:0) signal(sig)
call mic_compute()
call host_activity()
!dir$ offload_wait target(mic:0) wait(sig)
count = count – 1
end do
www.caps-entreprise.com
1/31/2014
172
Asynchronous Transfers
Use offload_transfer directives instead, for example in C
343
char sig1, sig2, sig3;
float *p = (float*)malloc(N*size(float));
#pragma offload_transfer in(p:length(N)) signal(&sig1)
host_activity();
#pragma offload wait(sig1) signal(&sig2)
{
foo(N, p);
}
#pragma offload_transfer wait(sig2) out(p:length(N)) signal(&sig3)
host_activity();
#pragma offload_wait(&sig3)
www.caps-entreprise.com
Compile and Execute Intel Offload Applications
Source Intel Compiler environment
Compile with –openmp
o To ignore Intel Offload directive use –no-offload flag
o To display offload optimizer report use –opt-report-phase=offload
Execute from the host
To retrieve basic profiling information
To retrieve timing information
344
user@host $ icc –openmp myProgram.c –o myApp.mic
user@host $ ./myApp.mic
user@host $ source /opt/intel/composer_xe_2013.2.146/bin/compilervars.sh intel64
user@host $ export H_TIME=1
user@host $ export H_TRACE=1
www.caps-entreprise.com
1/31/2014
173
Directive Standard
Directive-based Programming (1)
Three ways of programming GPGPU applications:
www.caps-entreprise.com 346
Libraries
Ready-to-use
Acceleration
Directives
Quickly Accelerate
Existing Applications
Programming
Languages
Maximum Performance
1/31/2014
174
Advantages of Directive-based Programming
Simple and fast development of accelerated applications
Non-intrusive
Helps to keep a unique version of code o To preserve code assets
o To reduce maintenance cost
o To be portable on several accelerators
Incremental approach
Enables "portable" performance
www.caps-entreprise.com 347
OpenACC
1/31/2014
175
Various Many-core Paths
www.caps-entreprise.com 349
• Large number of small cores • Data parallelism is key • PCIe to CPU connection
AMD Discrete GPU
AMD APU
• Integrated CPU+GPU cores • Target power efficient
devices at this stage • Shared memory system with
partitions
INTEL Many Integrated Cores
• 50+ number of x86 cores • Support conventional programming • Vectorization is key • Run as an accelerator or standalone
NVIDIA GPU
• Large number of small cores • Data parallelism is key • Support nested and dynamic
parallelism • PCIe to host CPU or low
power ARM CPU (CARMA)
OpenACC Initiative
www.caps-entreprise.com 350
Launched by CAPS, Cray, Nvidia and PGI in november 2011 o Allinea, Georgia Tech, U. Houston, ORNL, Rogue
Wave, Sandia NL, Swiss National Computing Center, TUD joined in 2012
Open Standard
A directive-based approach for programming heterogeneous many-core hardware for C/C++ and FORTRAN applications
Specification version 2.0 (June 2013)
http://www.openacc-standard.com
1/31/2014
176
OpenACC Compilers (1)
CAPS Compilers:
Source-to-source
compilers
Support Intel Xeon Phi,
NVIDIA GPUs, AMD
GPUs and APUs
PGI Accelerator
Extension of x86 PGI
compiler
Support Intel Xeon Phi,
NVIDIA GPUs, AMD
GPUs and APUs
Intel MIC Programming 351
Cray Compiler:
Provided with Cray systems only
CAPS Compilers (2)
Take the original application as input and generate another
application source code as output
o Automatically turn the OpenACC source code into a accelerator-
specific source code (CUDA, OpenCL)
Compile the entire hybrid application
Just prefix the original compilation line with capsmc to
produce a hybrid application
352
$ capsmc gcc myprogram.c
$ capsmc gfortran myprogram.f90
Intel MIC Programming
1/31/2014
177
Compilation Paths
CAPS Compilers drives all compilation passes
Host application compilation o Calls traditional CPU
compilers o CAPS Runtime is linked
to the host part of the application
Device code production o According to the
specified target o A dynamic library is
built
www.caps-entreprise.com 353
Fun #3
C++ Frontend
C Frontend
Fortran Frontend
CUDA Code Generation
Executable (mybin.exe)
Instrumen-tation module
CPU compiler (gcc, ifort, …)
CUDA compilers
HWA Code (Dynamic library)
OpenCL Generation
OpenCL compilers
Extraction module
Fun #2
Host code
codelets
CAPS Runtime
Fun #1
CAPS Compilers Options
Usage:
To specify accelerator-specific code
To display the compilation process
354
$ capsmc –-openacc-target OPENCL –d -c gcc myprogram.c
$ capsmc –-openacc-target CUDA gcc myprogram.c #(default)
$ capsmc –-openacc-target OPENCL gcc myprogram.c #(for Xeon Phi)
$ capsmc [CAPSMC_FLAGS] <host_compiler> [HOST_COMPILER_FLAGS] <source_files>
Intel MIC Programming
1/31/2014
178
Programming Model
Express data and computations to be executed on an accelerator
o Using marked code regions
Main OpenACC constructs
o Parallel and kernel regions
o Parallel loops
o Data regions
o Runtime API
355
Data/stream/vector
parallelism to be
exploited by HWA e.g. CUDA / OpenCL
CPU and HWA linked with a
PCIx bus
Intel MIC Programming
Execution Model
Among a bulk of computations executed by the CPU, some regions can be offloaded to hardware accelerators o Parallel regions o Kernels regions
Host is responsible for: o Allocating memory space on accelerator o Initiating data transfers o Launching computations o Waiting for completion o Deallocating memory space
Accelerators execute parallel regions: o Use work-sharing directives o Specify level of parallelization
356 Intel MIC Programming
1/31/2014
179
OpenACC Execution Model
Host-controlled execution
Based on three parallelism levels
o Gangs – coarse grain
o Workers – fine grain
o Vectors – finest grain
357
Device
Gang
Worker
Vectors
Gang
Worker
Vectors
…
Intel MIC Programming
Gangs, Workers, Vectors
In CAPS Compilers, gangs, workers and vectors correspond
to the following in an OpenCL grid
Beware: this implementation is compiler-dependent
358
numGroups(1) = 1
numGroups(0) = number of gangs
localSize(1) = number of workers
localSize(0) = number of vectors
Intel MIC Programming
1/31/2014
180
Directive Syntax
C
Fortran
359
!$acc directive-name [clause [, clause] …]
code to offload
!$acc end directive-name
#pragma acc directive-name [clause [, clause] …]
{
code to offload
}
Intel MIC Programming
Parallel Construct
Starts parallel execution on the accelerator
Creates gangs and workers
The number of gangs and workers remains constant for the parallel region
One worker in each gang begins executing the code in the region
www.caps-entreprise.com 360
#pragma acc parallel […]
{
…
for(i=0; i < n; i++) {
for(j=0; j < n; j++) {
…
}
}
…
}
Code executed on the hardware
accelerator
1/31/2014
181
Gangs, Workers, Vectors in Parallel Constructs
In parallel constructs, the number of gangs, workers and vectors is the same for the entire section
The clauses: o num_gangs o num_workers o vector_length
Enable to specify the number of gangs, workers and vectors in the corresponding parallel section
www.caps-entreprise.com 361
#pragma acc parallel, num_gangs(128) \
num_workers(256)
{
…
for(i=0; i < n; i++) {
for(j=0; j < m; j++) {
…
}
}
…
}
…
… … …
256
128
Loop Constructs
A Loop directive applies to a loop that immediately follow the directive
The parallelism to use is described by one of the following clause:
o Gang for coarse-grain parallelism
o Worker for middle-grain parallelism
o Vector for fine-grain parallelism
www.caps-entreprise.com 362
1/31/2014
182
Loop Directive Example
With gang, worker or vector clauses, the iterations of the following loop are executed in parallel
Gang, worker or vector clauses enable to distribute the iterations between the available gangs, workers or vectors
www.caps-entreprise.com 363
#pragma acc parallel, num_gangs(128) \
num_workers(192) \
vector_length(32)
{
…
#pragma acc loop gang
for(i=0; i < n; i++) {
#pragma acc loop worker
for(j=0; j < m; j++) {
#pragma acc loop vector
for(k=0; k < l; k++) {
…
}
}
}
…
}
…
192
128
i=0
j=0 j=1 j=2
…
i=0
…
…
i=1
…
… k=0 k=1 k=2
Kernels Construct
Defines a region of code to be compiled into a sequence of accelerator kernels o Typically, each loop nest will be a distinct kernel
The number of gangs and workers can be different for each kernel
www.caps-entreprise.com 364
#pragma acc kernels […]
{
for(i=0; i < n; i++) {
…
}
…
for(j=0; j < n; j++) {
…
}
}
$!acc kernels […]
DO i=1,n
…
END DO
…
DO j=1,n
…
END DO
$!acc end kernels
1st Kernel
2nd Kernel
1/31/2014
183
Gang, Worker, Vector in Kernels Constructs
The parallelism description is the same as in parallel sections
However, these clauses accept an argument to specify the number of gangs, workers or vectors to use
Every loop can have a different number of gangs, workers or vectors in the same kernels region
www.caps-entreprise.com 365
#pragma acc kernels
{
…
#pragma acc loop gang(128)
for(i=0; i < n; i++) {
…
}
…
#pragma acc loop gang(64)
for(j=0; j < m; j++) {
…
}
}
…
64
…
i=0
…
i=0
…
i=2
… …
i=0
…
i=0
…
i=2
128
Data Independency
In kernels sections, the clause independent on loop directive specifies that iterations of the loop are data-independent
The user does not have to think about gangs, workers or vector parameters
It allows the compiler to generate code to execute the iterations in parallel with no synchronization
www.caps-entreprise.com 366
A[0] = 0;
#pragma acc loop independent
for(i=1; i<n; i++)
{
A[i] = A[i]-1;
}
1/31/2014
184
What is the problem using discrete accelerators?
PCIe transfers have huge latencies
In kernels and parallel regions, data are implicitly managed
o Data are automatically transferred to and from the device
o Implies possible useless communications
Avoiding transfers leads to a better performance
OpenACC offers a solution to control transfers
www.caps-entreprise.com 367
Device Memory Reuse
In this example: o A and B are allocated
and transferred for the first kernels region
o A and C are allocated and transferred for the second kernels region
How to reuse A between the two kernels regions? o And save transfer and
allocation time
www.caps-entreprise.com 368
float A[n];
#pragma acc kernels
{
for(i=0; i < n; i++) {
A[i] = B[n – i];
}
}
…
init(C)
…
#pragma acc kernels
{
for(i=0; i < n; i++) {
C[i] += A[i] * alpha;
}
}
1/31/2014
185
Memory Allocations
Avoid data reallocation using the create clause o It declares variables, arrays or subarrays to be allocated in the device
memory
o No data specified in this clause will be copied between host and device
The scope of such a clause corresponds to a data region
Kernels and Parallel regions implicitly define data regions
The present clause declares data that are already present on the device
www.caps-entreprise.com 369
Create and Present Clause Example
www.caps-entreprise.com 370
float A[n];
#pragma acc data create(A)
{
#pragma acc kernels present(A)
{
for(i=0; i < n; i++) {
A[i] = B[n – i];
}
}
…
init(C)
…
#pragma acc kernels present(A)
{
for(i=0; i < n; i++) {
C[i] += A[i] * alpha;
}
}
}
Allocation of A of size n on the device
Deallocation of A on the device
Reuse of A already allocated on the device
Reuse of A already allocated on the device
1/31/2014
186
Data Storage: Mirroring
How is the data stored in a data region?
A data construct defines a section of code where data are mirrored between host and device
Mirroring duplicates a CPU memory block into the HWA memory o Users ensure consistency of copies via directives
www.caps-entreprise.com 371
Host Memory
Master copy
…………………………………………………….
HWA Memory
CAPS RT Descriptor
…………………………………………………….
Mirror copy
Arrays and Subarrays
In C and C++, arrays are specified with start and length
o For example, with an array of size n
In FORTRAN, arrays are specified with a list of range specifications
o For example, with an array a of size (n,m)
In any language, any array or subarray must be a contiguous block of memory
www.caps-entreprise.com 372
#pragma acc data create a[0:n]
!$acc data create a(0:n,0:m)
1/31/2014
187
Transfers: Copyin Clause
Declares data that need only to be copied from the host to the device when entering the data section
o Performs input transfers only
It defines scalars, arrays and subarrays to be allocated on the device memory for the duration of the data region
www.caps-entreprise.com 373
#pragma acc data create(A[:n])
{
#pragma acc kernels present(A[:n]) \
copyin(B[:n])
{
for(i=0; i < n; i++) {
A[i] = B[n – i];
}
}
…
#pragma acc kernels present(A[:n])
{
for(i=0; i < n; i++) {
C[i] = A[i] * alpha;
}
}
}
Transfers: Copyout Clause
Declares data that need only to be copied from the device to the host when exiting data section
o Performs output transfers only
It defines scalars, arrays and subarrays to be allocated on the device memory for the duration of the data region
www.caps-entreprise.com 374
#pragma acc data create(A[:n])
{
#pragma acc kernels present(A[:n]) \
copyin(B[:n])
{
for(i=0; i < n; i++) {
A[i] = B[n – i];
}
}
…
#pragma acc kernels present(A[:n]) \
copyout(C[:n])
{
for(i=0; i < n; i++) {
C[i] = A[i] * alpha;
}
}
}
1/31/2014
188
Transfers: Copy Clause
If we change the example, how to express that input and output transfers of C are required?
Use copy clause to: o Declare data that need to be
copied from the host to the device when entering the data section
o Assign values on the device that need to be copied back to the host when exiting the data section
o Allocate scalars, arrays and subarrays on the device memory for the duration of the data region
www.caps-entreprise.com 375
#pragma acc data create(A[:n])
{
#pragma acc kernels present(A[:n]) \
copyin(B[:n])
{
for(i=0; i < n; i++) {
A[i] = B[n – i];
}
}
…
init(C)
…
#pragma acc kernels present(A[:n]) \
copy(C[:n])
{
for(i=0; i < n; i++) {
C[i] += A[i] * alpha;
}
}
}
Present_or_create Clause
Combines two behaviors
Declares data that may be present
o If data is already present, use value in the device memory
o If not, allocate data on device when entering region and deallocate when exiting
May be shortened to pcreate
www.caps-entreprise.com 376
1/31/2014
189
Present_or_copyin/copyout/copy Clauses
If data is already present, use value in the device memory
If not: o present_or_copyin/present_or_copyout/present_or_copy allocate
memory on device at region entry
o present_or_copyin/present_or_copy transfer data from the host at region entry
o present_or_copyout/copy transfer data from the device to the host at region exit
o present_or_copyin/present_or_copyout/prsent_or_copy deallocate memory at region exit
May be shortened to pcopyin, pcopyout and pcopy
www.caps-entreprise.com 377
Present_or_* Clauses Example
www.caps-entreprise.com 378
program main
…
!$acc data create(A(1:n))
call f1( n, A, B )
…
!$acc end data
…
call f1( n, A, C )
…
contains
subroutine f1( n, A, B )
…
!$acc kernels pcopyout(A(1:n)) \
copyin(B(1:n))
do i=1,n
A(i) = B(n – i)
end do
!$acc end kernels
end subroutine f1
…
end program main
Allocation of A of size n on the device
Reuse of A already allocated on the device Allocation of B of size n on the device for the duration of the subroutine and input transfer of B
Deallocation of A on the device
Allocation of A and B of size n on the device for the duration of the subroutine Input transfer of B and output transfer of A
Present_or_* clauses are generally safer
1/31/2014
190
Default Behavior
CAPS Compilers is able to detect the variables required on the device for the kernels and parallel constructs.
According to the specification, depending on the type of the variables, they follow the following policies
o Tables: present_or_copy behavior
o Scalar
• if not live in or live out variable: private behavior
• copy behavior otherwise
www.caps-entreprise.com 379
OpenACC 2.0: New Features (1)
Atomic operations: o Different kinds of atomic
sections can be executed in parallel/kernels constructs
• Read, write, update, capture
Routine directives o Function invocation from
kernels/parallel constructs
o Functions can be executed from host or device
SC'13
#pragma acc atomic read v=x; #pragma acc atomic write x=42; #pragma acc atomic update x++; #pragma acc atomic capture v = x++;
#pragma acc routine (myfunc) void myfunc( … ) { … } int main() { #pragma acc kernels { myfunc( … ); } … myfunc( … ); }
380
1/31/2014
191
OpenACC 2.0: New Features (2)
Enter/Exit data directives o Enable scope
free data management
Device type clause o Enable
architecture specific optimizations
SC'13
void init(int* array, int size){ array = malloc(sizeof(int)*size); #pragma acc enter data create(array[0:size]) } int main(){ int *array; … init(array, size); … #pragma acc exit data delete(array) … }
381
#pragma acc kernels device_type(nvidia) num_gangs(64) \ device_type(xeonphi) num_gangs(128) { #pragma acc loop gang for(int i=0; i<size; i++){ … } }
OpenMP 4.0
1/31/2014
192
Intel Offload / OpenMP 4.0 / OpenACC 1.0
Directive Set Intel Offload OpenMP 4.0 (To be released)
OpenACC 1.0
Offloading computations #pragma offload !dir$ offload
#pragma omp target !$omp target
#pragma acc kernels #pragma acc parallel !$acc kernels !$acc parallel
Work Sharing #pragma omp parallel !$omp parallel
#pragma omp parallel !$omp parallel #pragma omp teams !$omp teams #pragma omp distribute !$omp distribute
#pragma acc loop !$acc loop #pragma loop gang/worker/vector !$acc loop gang/worker/vector
Data Clauses in, out, inout alloc_if, free_if
map to, map from, map tofrom map alloc
copyin, copyout, copy create, present, pcopyin, pcopyout, pcopy
Data regions - #pragma omp target data !$omp target data
#pragma acc data !$acc data
Transfer directives #pragma offload_transfer !dir$ offload_transfer
#pragma omp target update !$omp target update
#pragma acc update !$acc update
Intel MIC Programming 383
CAPEX / OPEX
with GPU
1/31/2014
193
Goals – Why Using GPUs
Performance
Energy saving
Cheaper machine
Preparing code to manycore
www.caps-entreprise.com 385
Is the Machine Cheaper?
You may want
o To run faster than you competitor
o To run faster than the streamed data come
o To run faster in order to use less energy to compute
o To run differently to save energy
OPEX or CAPEX?
o Capital Expenditures
• Machine cost and software migration, surface cost
o Operational Expenditures
• Energy consumption, hardware and software maintenance
www.caps-entreprise.com 386
1/31/2014
194
CAPEX-OPEX Analysis for a Heterogeneous
System
Capital Expenses (CapEx) o System acquisition cost
o Software migration cost
o Software acquisition cost
o Teaching cost
o Real estate cost
Operational Expenses (OpEx) o Energy cost (system + cooling)
o Maintenance cost
For a given amount of compute work, the CapEx-Opex analysis indicates the “real” value of a given system o For instance, if I add GPU do I save money?
o And how many should I add?
o Then should I use slower CPU?
www.caps-entreprise.com 387
Application Speedup and CapEx-OpEx
Adding GPUs/accelerators to the system o Increases system cost
o Increases base energy consumption (one GPU = x10 watt idle)
Exploiting the GPUs/accelerators o Decreases execution time, so potentially the energy consumption for a
given amount of work
o Reduces the number of nodes of the architecture • Threshold effect on the number of routers etc.
o Requires to migrate the code
Multiple views of the value of application speedup o Shorten time-to-market
• Threshold effect
o More work performed during the lifetime of the system
www.caps-entreprise.com 388
1/31/2014
195
CapEx Hardware Parameters
Choice of the hardware configuration can be: o Fast CPU + Fast GPU (expensive node)
o Slow CPU + Fast GPU
o Fast CPU + Slow GPU
o Slow CPU + Slow GPU
o Fast CPU
o Slow CPU
Nodes performance impact on the number of nodes o More nodes means more network with non negligible cost and energy
consumption
o Less nodes may limit scalability issues if any
Application workload analysis is the only way to decide o Optimizing software can significantly increase performance and so reduce
needed hardware
o Code migration to GPU is on the critical path
www.caps-entreprise.com 389
Small systems: - a few nodes (1-8) - cost x10k€ Large systems - many nodes (x100) - cost x1M€
CapEx: Code Migration Cost
Migration cost o Learning cost
o Software environment cost
o Porting cost
Migration cost is mostly hardware size independent o Not an issue for dedicated large systems
o Different if the machine aims at serving a large community
Main migration benefit is to highlight manycore parallelism o Not specific to one kind of device
o Implementation is specific
Constructor specific implementation solution o Amortize period similar to the one of the hardware (3 years)
Agnostic parallelism expression o Using portable solution for multiple hardware generations (amortized on 10 years)
o Of course not that simple! Still requires some level of tuning
May be very useful for non scalable message passing code
www.caps-entreprise.com 390
Mastering the cost of migration has a significant impact on the total cost for small systems Typical effort: - Manpower: a few Man-Months - Cost: x 10k€
1/31/2014
196
Two Applications Examples
Application 1
• Field: Monte Carlo simulation for thermal radiation
• MPI code
• Migration cost: 1 man month
Application 2
• Field: astrophysics, hydrodynamic
• MPI code
• Requires 3 GPUs per node for having enough memory space
• Migration cost: 2 man month
www.caps-entreprise.com 391
Power Consumption Application 1
www.caps-entreprise.com 392
0 = Baseline Energy Consumption
CPU energy
GPU energy
Power usage effectiveness (PUE) = Total facility power / IT equipment power Current 1.9, best practice 1.3 Src: http://www.google.com/corporate/datacenter/efficiency-measurements.html
1/31/2014
197
Power Consumption Application 2
www.caps-entreprise.com 393
Configuration Execution
time (s) System Costs
Maintenance
Costs
Energy
Costs
CAPEX
+OPEX
Application 1 Migration cost = 1 man.month
4 nodes 6862 1.87€ 0.19€ 0.37€ 2.43€
4 nodes + 4 GPUs 1744 0.71€ 0.07€ 0,12€ 0.90€
4 nodes + 8 GPUs 1000 0.51€ 0.05€ 0,08€ 0.64€
4 nodes + 12 GPUs 731 0.45€
0.04€ 0,08€ 0.57€
Application 2 Migration cost = 2 man.month
4 nodes 713 0.19€ 0.02€ 0.025€ 0.239€
4 nodes + 12 GPUs 485 0.30€ 0.03€ 0.034€ 0.358€
4 nodes (slow ck)+ 12 GPUs 500 (estim.) 0.24€ 0.02€ 0.034€ 0.302€
CAPEX-OPEX Overview
Comparison on an equivalent workload
o CAPEX = System costs + Migration costs
o OPEX = Energy cost + Computer maintenance cost (10% Computer costs)
www.caps-entreprise.com 394
1/31/2014
198
Cost per Run
Application 1 Application 2
www.caps-entreprise.com 395
- €
0,50 €
1,00 €
1,50 €
2,00 €
2,50 €
3,00 €
no GPU 1 GPU/node 2 GPU/node 3 GPU/node
Migration Costs(4 nodes)
Energy Costs(power + cooling)
Maintenance Costs(10% CC)
System Costs
- €
0,05 €
0,10 €
0,15 €
0,20 €
0,25 €
0,30 €
0,35 €
0,40 €
0,45 €
no GPU 3 GPU/node 3 GPU/node(slow ck)
Porting Methodology
1/31/2014
199
Methodology to port applications
www.caps-entreprise.com 397
Migration process
Code analysis and definition step o This step performs a diagnosis of the application, specifies the main
porting operations, and makes a gross estimation of the potential speedup as well as porting cost
• This step is concluded by a Go / NoGo Analysis
First port of the application o This step implements a first GPGPU version of the code
o With this version, a GPGPU execution profile can obtained and bottlenecks can be identified
Fine tune of the code and setup for production o This is the last step of the migration process that aims at getting a
well-optimized production code
www.caps-entreprise.com 398
1/31/2014
200
Phase 1 (details)
www.caps-entreprise.com 399
Step 1: code analysis
Hotspot identification o Profile the application
o Identify hotspot to convert to GPU kernels
o Code restructuration may be needed to get parallel hotspot that are computation intensive enough
CPU analysis o Check the efficiency of the application on the CPU
Parallelism discovery o Ensure kernels can be execute in parallel
o If not, reconsider the algorithms
www.caps-entreprise.com 400
1/31/2014
201
Step 2: Getting an Hybrid GPGPU Program
Converts hotspots to GPU kernels
o Using HMPP codelets
Helps to identify GPGPU issues
Helps also to validate parallel implementation
o Difference between CPU and GPU results
www.caps-entreprise.com 401
Phase 2 (details)
www.caps-entreprise.com 402
1/31/2014
202
Step 3: Optimizing the hybrid program
Optimize the GPGPU kernels with code transformations
o Loop unrolling & jamming
o Increase the grid dimension
o Distribute loops
o Fuse loops
o …
Make it Incrementally
o Each transformation at a time
Easier to be rigorous than find bugs
www.caps-entreprise.com 403
How to Avoid Over-Spending Manpower?
Implement a version control management
Incrementally port the code
Check-point restart techniques
Do not delay integration
Stick to the plan
Check everything is available
Report every hard issues
Reconsider basic assumptions
Do not minimize the workload
www.caps-entreprise.com 404
1/31/2014
203
An Example of Checklist
Do you know what is the target machine?
Do you have access to all the necessary codes and libraries?
Do you know how to run the codes?
Are there input data sets available?
Do you have an execution profile representative for performance measurements?
Do you have access to the target architecture?
Do you know how to check the results?
Are you allowed to change floating point rounding?
Do you have somebody to ask questions about the application?
Do you need an application domain consultant?
Have you stated the performance gain to achieve with the end-user?
Are debugging tools installed on the target machines?
Did you check the drivers/libraries/OS versions on the target machines?
Has the application code already been running on the target machine?
www.caps-entreprise.com 405
Checking the Results
The validation of output data is essential in a porting
process, and often neglected
It ensures that the transformations do not affect the
application’s global behavior
o Considering the application as a black-box, delivering the same
output data for a given input data set is not always a good thing
• Since readability of the application is sometime critical
www.caps-entreprise.com 406
1/31/2014
204
Hardware
What is the targeted hardware?
o AMD, Nvidia, Xeon Phi?
Do you intend to use multitargets
o For example multi-GPUs
Is the target hardware fixed forever?
o Is it supposed to change in the next years?
www.caps-entreprise.com 407
Software
May I change the compiler?
Software target is OpenCL or CUDA?
o Is the software target hardware agnostic?
Shall I use OpenACC?
o Can be the native CPU version of the application alterated?
www.caps-entreprise.com 408
1/31/2014
205
Accelerator Programming model
Directive-based programming
Parallel Computing
OpenHMPP OpenACC GPGPU
Many-Core programming
Parallelization
HPC OpenCL
Code speedup NVIDIA Cuda
High Performance Computing
CAPS Compilers
CAPS Workbench
Portability
Performance
Visit CAPS Website: www.caps-entreprise.com