1/31/2014 1 Hardware and Software for Parallel Computing Florent Lebeau, CAPS entreprise UMPC - janvier 2014 Agenda Day 1 The Need for Parallel Computing Introduction to Parallel Computing o Different Levels of Parallelism History of Supercomputers o Hardware Accelerators Multiprocessing o Fork/join o MPI Multithreading o Pthread o TBB o Cilk o OpenMP Day 2 Hardware Accelerators o GPU • CUDA • OpenCL o Xeon Phi • Intel Offload o Directive Standards • OpenACC • OpenMP 4.0 CapEx / OpEx with GPU Porting Methodology www.caps-entreprise.com 2
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1/31/2014
1
Hardware and Software for Parallel
Computing
Florent Lebeau, CAPS entreprise
UMPC - janvier 2014
Agenda
Day 1
The Need for Parallel Computing
Introduction to Parallel Computing
o Different Levels of Parallelism
History of Supercomputers
o Hardware Accelerators
Multiprocessing
o Fork/join
o MPI
Multithreading
o Pthread
o TBB
o Cilk
o OpenMP
Day 2
Hardware Accelerators
o GPU
• CUDA
• OpenCL
o Xeon Phi
• Intel Offload
o Directive Standards
• OpenACC
• OpenMP 4.0
CapEx / OpEx with GPU
Porting Methodology
www.caps-entreprise.com 2
1/31/2014
2
The Need for Parallel
Computing
The Demand (1)
Compute for research o Simulate complex physical phenomenon
o Validate a mathematical model
Compute for industry o Quality control by image processing
o DNA sequence alignment
o Oil & gas prospection
o Meteorology
Compute for the masses o Playing a 1080p DVD in real time
o Compressing / uncompressing streams
o Image processing
www.caps-entreprise.com 4
1/31/2014
3
The Demand (2)
Computing needs o Data
o An operation
The computing cost (time) may be
o And even worse, sometimes
To reduce the computation time, one can o Reduce the amount of data
o Reduce the amount of operations
o Increase the computation speed
www.caps-entreprise.com 5
Qcomp = Qdata *Qop
The Demand (3)
The amount of computations keeps increasing
o Games and screens resolutions keep improving
o Longer weather prediction
o More accurate weather prediction
• Which increases the amount of data
Amount of data grows each year
But the given time to exploit these data is still the same
o 24h for another day of weather prediction
o 1/50s for a video stream image
www.caps-entreprise.com 6
1/31/2014
4
The Demand (4)
So we need to increase computations per second
At a lower cost o To purchase
o To develop
o To maintain
Technologically sustainable o Easy adaptation to next architecture
According to the company strategy o Green computing, industrial partnerships with providers…
www.caps-entreprise.com 7
The Solution (1)
By the end of the 20th century, most of applications were written for CPUs (mainly x86) o The regular frequency increase of
micro-processors ensured performance gains without code modification
o The effort focused on hardware vendors, less on developers
Today frequency increase is stuck o Because of thermal dissipation and
power leakage
o Because of components distance and die surface
Computing faster is less and less feasible (sequentially) o But we can compute more things
simultaneously (in parallel)
www.caps-entreprise.com 8
1/31/2014
5
The Solution (2)
Advanced optimizations of code to get the best of cutting-edge CPU functionalities o Vectorization
Code parallelization o Multi-threading
o Parallel computing requires parallel codes
Porting onto specialized architectures o FPGA, cluster, GPU…
o Not only developer’s choice, because may imply long-term hardware investments
www.caps-entreprise.com 9
Introduction to Parallel
Computing
1/31/2014
6
Flynn’s Taxinomy
Classification of computer architectures
o Established by Michael J. Flynn en 1966
4 categories based on the data and instruction flow
o SISD
o SIMD
o MISD
o MIMD
• With shared memory (CPU cores)
• With distributed memory (clusters)
www.caps-entreprise.com 11
Flynn’s Taxinomy : SISD
SISD : Single Instruction Single Data
www.caps-entreprise.com 12
Data Instruction
Processor Memory Control
1/31/2014
7
Flynn’s Taxinomy : SIMD
SIMD : Single Instruction Multiple Data
www.caps-entreprise.com 13
Data Instruction
Processor 0
Processor 1
Processor 2
Memory Control
Flynn’s Taxinomy : MISD
MISD : Multiple Instruction Single Data
www.caps-entreprise.com 14
Data Instruction
Processor 0
Processor 1
Processor 2
Memory Control 1
Control 0
Control 2
1/31/2014
8
Flynn’s Taxinomy : MIMD
MIMD : Multiple Instruction Multiple Data
www.caps-entreprise.com 15
Data Instruction
Processor 0
Processor 1
Processor 2
Memory 0 Control 0
Memory 1
Memory 2
Control 1
Control 2
Distributed Memory Architectures
Processors only see their own memory
They communicate explicitly by message-passing if needed
A processor cannot access to the memory of another
So the distribution must be done to avoid communications
www.caps-entreprise.com 16
Netw
ork
Processor Processor
Processor Processor
1/31/2014
9
Shared Memory Architectures (1)
A unique address space is provided by the hardware
If there is, cache consistency is maintained by hardware
www.caps-entreprise.com 17
Network
Processor Processor Processor Processor
Shared Memory Architectures (2)
Global memory space, accessible by all processors
Processors may have local memory (i.e. cache) to hold copies of some global memory
Consistency of these copies is usually maintained by hardware o Referred as Cache-Coherent
User-friendly
Programmer is in charge of correct synchronization between processes/threads
Suffer from lack of scalability
www.caps-entreprise.com 18
1/31/2014
10
UMA : Unified Memory Access
Most commonly represented today by SMP machines
Identical processors
Equal access and access times to memory
Sometimes called CC-UMA - Cache Coherent UMA
www.caps-entreprise.com 19
NUMA : Non-Uniform Memory Access
Often made by physically linking two or more SMPs
One SMP can directly access memory of another SMP
Memory access across link is slower
www.caps-entreprise.com 20
1/31/2014
11
The Amdahl’s Law
Amdahl’s law states that the sequential part of the execution limits the potential benefit from parallelization o The execution time TP using P cores is given by:
• where seq (in [0,1]) is percentage of execution that is inherently
sequential
Consequences of this law o Potential performance dominated by sequential parts of the
application
www.caps-entreprise.com 21
TP seq T1 (1 seq) *T1
P
What is a Hotspot?
A small part of code
o Most of the execution time spent in it
o Mostly loops that concentrate computation
Identified using performance profilers
Also known as kernels or compute intensive kernels
o But sometimes a hotspot can be implemented as several kernels
www.caps-entreprise.com 22
1/31/2014
12
Speedup
Speedup
o Ratio between execution on 1 process and on P processes
Efficiency
o Ratio between the speedup and the number of cores used for the
parallel version
o Parallel application is scalable when efficiency is close to 1 with
number of cores increasing
www.caps-entreprise.com 23
SP =T1
TP
EP =T1
P*TP
Amdahl’s law
www.caps-entreprise.com 24
1/31/2014
13
Speedup
Speedup
o Ratio between the original time and the optimized time
www.caps-entreprise.com 25
Sp Tseq
Tp
Gustafson’s Law
States that increasing the amount of data increases the
parallelism potential of the application
o The more computations you have, the more computations you may
overlap
A parallel architecture need to be well-exploited to get good
performances
o The more you send parallel computations on it, the best you get
www.caps-entreprise.com 26
1/31/2014
14
Scalability
Scalability gives us an idea about the system behaviour
when the number of processors is increased.
Applications can often exploit large parallel machines by
scaling the problems to larger instances
To improve the scalability :
o Increase the parallelism of the application
www.caps-entreprise.com 27
Load Balancing
It is the capacity of
the application to
adapt the amount of
work between each
Proceesing Elements
It can be statically or
dynamically adapted
www.caps-entreprise.com 28
1/31/2014
15
Granularity
Granularity means the amount of computation compared to
communications
Larger parallel tasks usually provide better speedups
o Reduce startup overhead
o Reduce communication and synchronization overhead
Smaller granularity can be exploited on strongly coupled
architecture, such as multicore
o Can require deep rewriting of the application
www.caps-entreprise.com 29
Different Levels of Parallelism
1/31/2014
16
Different Levels of Parallelism
ALU o Vectorization (MMX / Now!, SSE)
Instruction Pipeline o Instruction Level Parallelism (ILP)
Mono/multicore o Dual-core (2005 avec AMD Opteron)
o Quad-core, etc.
o “Duplication of the processors”
Mono/multisocket o Intel Xeon bi-socket
www.caps-entreprise.com 32
1/31/2014
17
Scalar Processors
One data is computed at a time
o The architecture is designed to perform one instruction on one data
per clock cycle
o Contrarily to vector (or superscalar architecture)
One data = a value or scalar variable
o i.e.: a value or a recipient determined by a type
o As opposed to a composite data:
• A vector
• A character string in certain programming languages
Ex : Intel 80486
www.caps-entreprise.com 33
Superscalar Processors
Able to perform many instructions simultaneously o Each in a different pipeline
Ex : Intel P5 (1993)
www.caps-entreprise.com 34
1/31/2014
18
Vector Processors
Their architecture is based on pipelines
o A vector instruction executes the same operation on all the data of a
vector
Ex : Cray, NEC, IBM, DEC, Ti, Apple Altivec G4 et G5…
www.caps-entreprise.com 35
SIMD Processors
Single Instruction Multiple Data
Ex : MMX, SSE, ARM Neon, SPARC VIS, MIPS…
www.caps-entreprise.com 36
1/31/2014
19
Today
Most architectures are based on superscalar processors
o And SIMD
Mono-socket for mass market
o Dual-socket or more in clusters or professional workstations
www.caps-entreprise.com 37
History of Supercomputers
1/31/2014
20
Top500.org
Lists the world’s 500 largest supercomputers
www.caps-entreprise.com 39
Architecture Types
www.caps-entreprise.com 40
1/31/2014
21
Architecture Types
Single Processor o But a big one
Cluster
UP^n avec SAS : o SMP if n is small
o MPP if n is large
(UP^n SAS)^m : o If n << m and /SAS : cluster
o Constellation otherwise
www.caps-entreprise.com 41
UP
withoutSAS
nUP )(
withSAS
nUP )(
withSAS
nUP )(
withoutSAS
m
withSAS
nUP )))(((
withoutSAS
m
withSAS
nUP )))(((
Architecture Types
www.caps-entreprise.com 42
1/31/2014
22
Processor Types
www.caps-entreprise.com 43
Processor Types
www.caps-entreprise.com 44
1/31/2014
23
Number of Processors
www.caps-entreprise.com 45
Interconnect Type
www.caps-entreprise.com 46
1/31/2014
24
Installation Type
www.caps-entreprise.com 47
Clusters
1/31/2014
25
An Exemple: Nova Cluster
CAPS entreprise, 2009
www.caps-entreprise.com 49
Nova Architecture
Nova is composed of:
o 1 login node (Nova0)
o 3 storage nodes over Lustre (Nova1 to Nova3)
o 20 compute nodes (Nova4 to Nova23)
Each compute node is made up of:
o A dual-socket Intel Nehalem (bi-processor) machine
• Each Nehalem processor is quad-core (4 CPU cores)
o 2 Nvidia Tesla C1060 GPUs
o 24 GB of memory
www.caps-entreprise.com 50
1/31/2014
26
Nova’s Compute Nodes Architecture
www.caps-entreprise.com 51
12 GB
Intel QPI
12 GB
Intel S5520
chipset
PCIe 2.0 16x
Tesla C1060
Tesla C1060
Nova Architecture
www.caps-entreprise.com 52
nova0.caps-entreprise.com
Nova0
File system
(Nova1-3)
Compute Nodes
(Nova4-23)
Internet
1/31/2014
27
Pros
• Less expensive
• Than one multiprocessor server of similar computing power
• In particular SMP
• More flexible
• The size is adapted to the needs and budget, contrarily to monolithic
configurations
www.caps-entreprise.com 53
Exploiting clusters
As a distributed system
o That is what they actually are
o Resources can be shared among users, applications …
o More complicated to program
As a virtual SMP
o Kerrighed
o The OS hide the underlying architecture
o Easier to program but less control
• A cluster is NUMA-type. Data distribution is important
www.caps-entreprise.com 54
1/31/2014
28
Exploiting Clusters in Distributed Mode
In distributed mode, it is generally provided with a task
scheduler
o Enables to add more servers
o Enable to manage breakdowns
o Slurm, sge, PBS,…
www.caps-entreprise.com 55
$ srun –n 4 ./myProgram.exe myProgram is running on node 13 myProgram is running on node 14 myProgram is running on node 15 myProgram is running on node 16 $
Connection to the Login Node
Before offloading your computations on Nova’s compute
nodes, you need to login to Nova0
o “Secondary” will automatically connect you to Nova0
MPI_Bcast broadcasts a message from the process with
rank "root" to all other processes of the group
Before a MPI_Bcast :
After a MPI_Bcast :
www.caps-entreprise.com 128
10
Process 1 Process 2 Process 3 Process 4
10
Process 1
10
Process 2
10
Process 3
10
Process 4
1/31/2014
65
Broadcast (2)
C
o int MPI_Bcast ( void *buff, int count, MPI_Datatype datatype, int root,
MPI_Comm comm )
Fortran :
o Subroutine MPI_BCAST(buff, count, datatype, root, comm)
Buff, count and datatype describe the receive/sender buffer
in comm
Root is the rank of the sender
www.caps-entreprise.com 129
Scatter (1)
MPI_Scatter sends data from one task to all other tasks in a
group
Before a MPI_Scatter :
After a MPI_Scatter :
www.caps-entreprise.com 130
10 11 12 13
Process 1 Process 2 Process 3 Process 4
10
Process 1
11
Process 2
12
Process 3
13
Process 4
1/31/2014
66
Scatter (2)
C o int MPI_Scatter ( void *sendbuf, int sendcnt, MPI_Datatype sendtype,
void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm )
Fortran o Subroutine MPI_SCATTER(sendbuf, sendcount, sendtype, recvbuf,
recvcount, recvtype, root, comm)
sendbuff, sendcount and sendtype describe the root sender in comm
recvbuff, recvcount and recvtype describe the receivers in comm
www.caps-entreprise.com 131
Gather (1)
MPI_Gather gathers together values from a group of
processes
Before a MPI_Gather :
After a MPI_Gather :
www.caps-entreprise.com 132
10 11 12 13
Process 1 Process 2 Process 3 Process 4
10
Process 1
11
Process 2
12
Process 3
13
Process 4
1/31/2014
67
Gather (1)
C o int MPI_Gather ( void *sendbuf, int sendcnt, MPI_Datatype sendtype,
void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm )
Fortran o Subroutine MPI_GATHER(sendbuf, sendcount, sendtype, recvbuf,
recvcount, recvtype, root, comm)
sendbuff, sendcount and sendtype describe the senders in comm
recvbuff, recvcount and recvtype describe the root receiver in comm
www.caps-entreprise.com 133
Reduce (1)
MPI_Reduce reduces values on all processes to a single
value. The example is using the operation MPI_SUM.
Before a MPI_Reduce :
After a MPI_Reduce :
www.caps-entreprise.com 134
46
Process 1 Process 2 Process 3 Process 4
10
Process 1
11
Process 2
12
Process 3
13
Process 4
1/31/2014
68
Reduce (2)
C o int MPI_Reduce ( void *sendbuf, void *recvbuf, int count,
MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)
Fortran o Subroutine MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root,
comm, err)
sendbuff, count and datatype describe the senders buffer in comm
Recvbuff and root describe the receiver in comm
op specifies the operator of reduction
www.caps-entreprise.com 135
Reduction Operators
www.caps-entreprise.com 136
Name Meaning
MPI_MAX maximum
MPI_MIN minimum
MPI_SUM sum
MPI_PROD product
MPI_LAND logical and
MPI_BAND bit-wise and
MPI_LOR logical or
MPI_BOR bit-wise or
MPI_LXOR logical xor
MPI_BXOR bit-wise xor
MPI_MAXLOC max value and location
MPI_MINLOC min value and location
1/31/2014
69
Barrier
C
o int MPI_Barrier ( MPI_Comm comm )
Fortran
o Subroutine MPI_BARRIER(comm)
comm is the communicator
Blocks the caller until all group members have called it
www.caps-entreprise.com 137
To go further
It exists different versions of the classical send o Basic send user-provided buffering (MPI_Bsend)
o Blocking ready send (MPI_Rsend)
o Blocking synchronous send (MPI_Ssend)
o …
It exists some completely collective communications o MPI_Allgather
o MPI_Allscatter
o MPI_Alltoall
o MPI_Allreduce
o …
You can mix OpenMP with MPI
www.caps-entreprise.com 138
1/31/2014
70
Multithreading
www.caps-entreprise.com 139
What is a Thread?
An independent stream of instructions
o That can be scheduled to run as such by the operating system
o To the software developer, the concept of a "procedure" that runs
independently from its main program may best describe a thread
Consider a main program (a.out) that contains a number of
procedures being able to be scheduled to run
simultaneously and/or independently by the operating
system
o That would describe a "multi-threaded" program
www.caps-entreprise.com 140
1/31/2014
71
What is a Thread?
So, in summary, in the UNIX environment a thread: o Exists within a process and uses the process resources
o Has its own independent flow of control as long as its parent process exists and the OS supports it
o Duplicates only the essential resources it needs to be independently schedulable
o May share the process resources with other threads that act equally independently (and dependently)
o Dies if the parent process dies - or something similar
o Is "lightweight" because most of the overhead has already been accomplished through the creation of its process
Because threads within the same process share resources: o Changes made by one thread to shared system resources (such as closing a file)
will be seen by all other threads
o Two pointers having the same value point to the same data
o Reading and writing to the same memory locations is possible, and therefore requires explicit synchronization by the programmer
www.caps-entreprise.com 141
Pthread
1/31/2014
72
What are Pthreads?
Historically, hardware vendors have implemented their own proprietary versions of threads o These implementations differed substantially from each other making
it difficult for programmers to develop portable threaded applications
In order to take full advantage of the capabilities provided by threads, a standardized programming interface was required o For UNIX systems, this interface has been specified by the IEEE
POSIX 1003.1c standard (1995)
o Implementations adhering to this standard are referred to as POSIX threads, or Pthreads
o Most hardware vendors now offer Pthreads in addition to their proprietary API’s
www.caps-entreprise.com 143
Why Pthreads?
When compared to the cost of creating and managing a
process, a thread can be created with much less operating
system overhead
www.caps-entreprise.com 144
1/31/2014
73
On-node Communications
MPI libraries usually implement on-node task communication via shared
memory, which involves at least one memory copy operation (process to
process)
o For Pthreads there is no intermediate memory copy required because
threads share the same address space within a single process.
o There is no data transfer, per se. It becomes more of a cache-to-CPU or
memory-to-CPU bandwidth (worst case) situation
www.caps-entreprise.com 145
Threads = Shared Memory Model
All threads have access to the same global, shared memory
o Threads also have their own private data
o Programmers are responsible for synchronizing access (protecting)
globally shared data
www.caps-entreprise.com 146
1/31/2014
74
Thread-safeness
Refers an application's ability to execute multiple threads simultaneously without damaging shared data or creating race conditions
o The implication to users of external library routines is that if you aren't
100% certain the routine is thread-safe, then you take your chances with problems that could arise
www.caps-entreprise.com 147
Pthread API
For C/C++
From Intel, PathScale, PGI, GNU, IBM
Initially, your main() program comprises a single, default
thread
o All other threads must be explicitly created by the programmer
int main (int argc, char *argv[]) { pthread_t threads[NUM_THREADS]; int rc; long t; for(t=0; t<NUM_THREADS; t++){ printf("In main: creating thread %ld\n", t); rc = pthread_create(&threads[t], NULL, PrintHello, (void *)t); if (rc){ printf("ERROR; return code is %d\n", rc); exit(-1); } } /* Last thing that main() should do */ pthread_exit(NULL); }
1/31/2014
76
Pthread Management Basis
Routines
Joining is one way to accomplish synchronization between
threads
o The pthread_join() subroutine blocks the calling thread until the
Global Memory Up to 4 GB Up to 4 GB Up to 12 GB Up to 12 GB
Cache on Global Mem No/- No/- Yes/L1-L2 Yes/L1-L2
Size of L2 Cache - - 768 kB Up to 1536 kB
Size of L1 Cache/multipro
- - 16 / 48 kB 16 / 48 kB
www.caps-entreprise.com 236
1/31/2014
119
Thread Batching:
Grids and Blocks
A kernel is executed as a grid of thread blocks o All threads share data
memory space
A thread block is a batch of threads that can cooperate with each other by: o Synchronizing their execution
• For hazard-free shared memory accesses
o Efficiently sharing data through a low latency shared memory
Two threads from two different blocks cannot cooperate o Atomic operations added in HW
1.1
www.caps-entreprise.com 237
Host
Kernel
1
Kernel
2
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Block and Thread IDs
Threads and blocks have IDs o So each thread can
decide what data to work on
o Block ID: 1D, 2D or 3D o Thread ID: 1D, 2D, or 3D
Simplifies memory addressing when processing multidimensional data o Image processing o Solving PDEs on volumes o …
www.caps-entreprise.com 238
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
1/31/2014
120
Block and Thread keywords
Block keywords threadIdx.{x,y,z} defines
the thread index inside the block
blockDim.{x,y,z} defines the block dimensions
Grid keywords
blockIdx.{x,y,z} defines the block index inside the grid
gridDim.{x,y,z} defines the grid dimension
www.caps-entreprise.com 239
Block (1, 1)
Thread
(0,1,0)
Thread
(1,1,0)
Thread
(2,1,0)
Thread
(3,1,0)
Thread
(4,1,0)
Thread
(0,2,0)
Thread
(1,2,0)
Thread
(2,2,0)
Thread
(3,2,0)
Thread
(4,2,0)
Thread
(0,0,0)
Thread
(1,0,0)
Thread
(2,0,0)
Thread
(3,0,0)
Thread
(4,0,0)
Thread
(0,1,1)
Thread
(1,1,1)
Thread
(2,1,1)
Thread
(3,1,1)
Thread
(4,1,1)
Thread
(0,2,1)
Thread
(1,2,1)
Thread
(2,2,1)
Thread
(3,2,1)
Thread
(4,2,1)
Thread
(0,0,1)
Thread
(1,0,1)
Thread
(2,0,1)
Thread
(3,0,1)
Thread
(4,0,1)
blockDim.x
Grid 1
Block
(0 ,0)Block
(0 ,1)Block
(0 ,2)Block
(0 ,3)
Block
(1 ,3)
Block
(1 ,0)Block
(1 ,1)Block
(1 ,2)
gridDim.x
gri
dD
im.y
Memory Space Overview
Each thread can: o R/W per-thread registers o R/W per-thread shared
memory o R/W per-block local
memory o R/W per-grid global
memory o Read only per-grid
constant memory o Read only per-grid texture
memory
The host can: o R/W global memory o R/W constant memory o R/W texture memory
www.caps-entreprise.com 240
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
1/31/2014
121
Memory Allocation
• cudaMalloc()
o Allocates object in the device Global Memory
o Requires two parameters
• Address of a pointer to the allocated object
• Size of of allocated object
• cudaFree()
o Frees object from device Global Memory
• Pointer to freed object
www.caps-entreprise.com 241
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
CUDA Host-Device Data Transfer
• cudaMemcpy() o memory data transfer o Requires 4 parameters
• Pointer to source • Pointer to destination • Number of bytes copied • Type of transfer
– Host to Host – Host to Device – Device to Host – Device to Device
• Asynchronous variant supported on 1.1+HW
www.caps-entreprise.com 242
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
1/31/2014
122
CUDA Function Declarations
__global__ defines a kernel function
o Must return void
www.caps-entreprise.com 243
Executed on the: Only callable from the:
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device host
__host__ float HostFunc() host host
CUDA Functions Declaration
Address of __device__ functions cannot be taken
For functions executed on the device:
o No recursion (HW < 2.0)
o Recursion possible for __device__ function (HW >= 2.0)
o No static variable declarations inside the function
o No variable number of arguments
www.caps-entreprise.com 244
1/31/2014
123
Calling a Kernel Function
Thread Creation
A kernel function must be called with an execution configuration:
o Any call to a kernel function is asynchronous, explicit synchronization needed for blocking
www.caps-entreprise.com 245
__global__ void KernelFunc(...);
dim3 DimGrid(100, 50); // 5000 thread blocks
dim3 DimBlock(8, 8, 4); // 256 threads per block
…
KernelFunc<<< DimGrid, DimBlock >>>(...);
…
cudaThreadSynchronize();
Compilation
Any source file containing CUDA language extensions must be compiled with nvcc
nvcc is a compiler driver o Works by invoking all the necessary tools and compilers like
cudacc, g++, cl, ...
nvcc can output: o Either C code
• That must then be compiled with the rest of the application using another tool
o Or object code directly
www.caps-entreprise.com 246
1/31/2014
124
Linking
Any executable with CUDA code requires one dynamic library: o The CUDA runtime library (cudart)
With gcc, you may need to link with the sandard C++ library o libstdc++
www.caps-entreprise.com 247
Debugging : CudaGDB
On Linux or Mac OS X
Compile your application with nvcc and –g and –G options
Execute the debugger with : cuda-gdb
Possible to : o Get device information, gridDim and blockDim
o Break on the host and in the kernel
o Switch between CUDA Threads and host thread
Can be integrated to : o Emacs GUI
o DDD
Another available debugger o Allinea DDT
www.caps-entreprise.com 248
1/31/2014
125
GPU Debugging Pitfalls
But not all illegal program behavior can be caught
Conditions to Debug application on the local machine o Linux
• If single GPU, no Graphical Server running on the system
• 2 GPUs on the machine, 1 running the Graphical Server and the second running the CUDA program
o Windows
• Only possible if there is two GPU
• 1 for the visualization
• 1 to debug the CUDA application
On a remote machine, no problem
www.caps-entreprise.com 249
Profiler
31/01/2014 www.caps-entreprise.com 250
1/31/2014
126
Parallel NSight
Available on Windows and Linux o Integrated to Microsoft
Visual Studio
o Integrated to Eclipse IDE
Debugging CUDA application o Using Microsoft Visual
Studio windows : Memory, Locals, Watches and Breakpoints
Analyzing the performance of your GPGPU applications o CUDA
o OpenCL
o DirectCompute
www.caps-entreprise.com 251
Warps
Each block of threads is split into warps
Each warp contains the same number of threads: 32
Each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread #0
Each warp is executed by a multiprocessor in a SIMD fashion
www.caps-entreprise.com 252
1/31/2014
127
Warps (2)
Divergent branches within a warp cause serialization
o If all threads in a warp take the same branch, no extra cost
o If each thread take one or two different branches, entire warp pays cost of both branches of code
o If threads take n different branches, entire warp pays cost of n branches of code
www.caps-entreprise.com 253
Coalescing 1.0/1.1
A coordinated read by a half-warp (16 threads)
A contiguous region of global memory o 64 bytes: each thread reads a word (int, float, ...)
o 128 bytes: each thread reads a double-word (int2, float2, ...)
o 256 bytes: each thread reads a quad-word (int4, float4,...)
Additional restrictions on G8x/G9x architecture: o Starting address for a region must be a multiple of region size
o The kth thread in a half-warp must access the kth element in a block being read
Exception: not all threads must be participating o Predicated access, divergence within a half-warp
www.caps-entreprise.com 254
1/31/2014
128
Coalesced Access 1.0/1.1:
Reading floats
www.caps-entreprise.com 255
Uncoalesced Access 1.0/1.1:
Reading floats
www.caps-entreprise.com 256
1/31/2014
129
Coalesced Access 2.x
Cache on global memory may hide coalescing issues
2 level of cache
o 16-48 KB of L1 per SM
o 768 KB of L2 global for all SM
Memory Latency
o Global: 400-800 cycles
o L2 Cache: 100-200 cycles
o L1 Cache: about 4 cycles (without bank conflict)
www.caps-entreprise.com 257
Shared Memory
Hundreds of times faster than global memory
o About same latency as registers
o 32 banks can be accessed simultaneously with 2.x compute capability
o Successive 32 bits words are assigned to successive banks
Threads of a same block can cooperate via shared memory
o Up to 48 KBytes with 2.x compute capability by multiprocessor
Can be used to avoid non-coalesced accesses
www.caps-entreprise.com 258
1/31/2014
130
Shared Memory:
Performance Issues
The fast case:
o If all threads of a half-warp (or warp with cc 2.x) access different banks, there is no bank conflict
o If all threads of a half-warp (or warp with cc 2.x) read the identical address, there is no bank conflict (broadcast)
The slow case:
o Bank conflict: multiple threads in the same half-warp (or warp with cc 2.x)
o Must serialize the accesses
o Cost = max # of simultaneous accesses to a single bank
www.caps-entreprise.com 259
Shared Memory Access
www.caps-entreprise.com 260
Pattern accesses with no
bank conflicts:
each thread of the half-warp
accesses a different bank
1/31/2014
131
Shared Memory Access (2)
www.caps-entreprise.com 261
Each thread reads one
address from the same
bank: no conflict (broadcast)
Threads accessing the
same bank to different
value:
conflict!
Optimizing Threads per Block
Choose threads per block as a multiple warp size o Avoid wasting computation on under populated warps
More threads per block == better memory lattency hiding o Kernel invocations can fail if too many registers are used
Heuristics o Minimum Required by the HW: 64 threads per block
• Only if multiple concurrent blocks
o 192 or 256 threads a better choice
• Usually still enough regs to compile and invoke successfully
o This all depends on your computation, so experiment!
www.caps-entreprise.com 262
1/31/2014
132
Grid/Block Size Heuristics
# of blocks > # of multiprocessors o So all multiprocessors have at least one block to execute
# of blocks / # of multiprocessors > 2 o Multiple blocks can run concurrently in a multiprocessor
o Blocks that aren’t waiting at a _synchthreads() keep the hardware busy
o Subject to resource availbility - registers, shared memory
# of blocks > 100 to scale to future device o Blocks executed in pipeline fashion
o 1000 blocks per grid will scale accross multiple generations
www.caps-entreprise.com 263
Asynchronicity & Overlapping
Default CUDA API
o Kernel launches are asynchronous with CPU
o Memcopies block CPU thread (H2D=HostToDevice,
D2H=DeviceToHost)
o CUDA calls are sequential on GPU, serialized by the driver
But CUDA also offers asynchronicity and overlapping
o Asynchronous memcopies (D2H, H2D) with CPU
o Ability to concurrently execute a kernel and a memcopy
www.caps-entreprise.com 264
1/31/2014
133
Page-locked Memory, Principles (1)
Operating systems handle memory with a mechanism called
paged virtual memory:
o Divides the virtual address space of an application into memory pages
(default on x86 is 4 KiBytes)
o Allows applications to use more memory than the physical RAM
available on the system, by swapping pages to a disk
o Physical address of the page can change, this is tranparent to the
application, as the virtual address does not change
Pages can be locked by the OS into physical memory to prevent
swapping and to guarantee a permanent physical address
www.caps-entreprise.com 265
Page-locked Memory, Principles (2)
A PCI-express device can only directly access physical addresses, never an application's virtual address space o So only page-locked memory can be directly exploited by the
hardware
CUDA allows the application to request page-locked memory from the CUDA kernel driver
Both the application and the device can use directly such memory o No need for time-consuming intermediate copies between the
application virtual address space and the device's on-board memory
www.caps-entreprise.com 266
1/31/2014
134
CUDA page-locked memory
All CUDA version allows application to request page-locked
memory, often called pinned memory
o No other applications, not even the OS, can use the locked pages.
Do not use too much page-locked memory!
All CUDA memory copy functions take advantage of pinned
memory
Pinned memory is a prerequisite for asynchronous memory
copies
www.caps-entreprise.com 267
Different way to use Page-Locked Memory
Allocation directly in Page-Locked Memory
www.caps-entreprise.com 268
//Allocate the data in physical RAM
cudaMallocHost((void**) &hostPtr, size);
…
cudaFreeHost(hostPtr);
//Do not forget it or the data will stay alive in your Main memory
1/31/2014
135
Asynchronicity (1)
Synchronous execution example:
o The application waits for the GPU to complete the requested task.
www.caps-entreprise.com 269
Asynchronicity (2)
The asynchronous version:
o Control is returned to the application before the device has completed the requested task.
www.caps-entreprise.com 270
1/31/2014
136
Asynchronicity (3)
Advantages
o Enables full exploitation of the hardware available on the machine
(CPU + GPU together)
o Kernel launches are already asynchronous, no need to modify your
code
Drawback
o Needs explicit synchronization for data coherency
o Transfers require extra work to setup asynchronicity
But speed benefit already makes the extra work useful
www.caps-entreprise.com 271
Overlapping
Concurrent execution of GPU kernel and transfers from/to GPU o Makes use of asynchronicity
o Particularly handy when data make frequent,
o expensive round-trips between CPU and GPU
Typical cases o Several independent problems
o Several instances of a problem
o Single problem splitted into a set of sub-problems
Requires to use streams in your CUDA code
www.caps-entreprise.com 272
1/31/2014
137
Basics of CUDA Streams (1)
You said “stream”?
o A sequence of operations that execute in order on GPU
Streams have the following properties:
Streams use asynchronicity
o Concurrent execution between CPU and GPU
Streams enable overlapping
o Concurrent execution of a kernel on the GPU and
o transfers from or to the GPU
www.caps-entreprise.com 273
Basics of CUDA Streams (2)
How it works:
Operations from different streams can be interleaved
A kernel and a memcopy from different streams can be overlapped
www.caps-entreprise.com 274
1/31/2014
138
Code Example
// data allocation
float * hostPtr;
cudaMallocHost((void**) &hostPtr, 2 * size);
// streams declaration
cudaStream_t stream[2];
for(int i = 0; i < 2; ++i)
cudaStreamCreate(&stream[i]);
// streamed copy from host to device
for(int i = 0; i < 2; ++i)
cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size, size,
cudaMemcpyHostToDevice, stream[i]);
// streamed execution of the kernel
for(int i = 0; i < 2; ++i)
myKernel<<<100, 512, 0, stream[i]>>>(outputDevPtr + i * size, inputDevPtr + i * size, size);
// streamed copy from device to host
for(int i = 0; i < 2; ++i)
cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size, size,
cudaMemcpyDeviceToHost, stream[i]);
// threads synchronization
cudaThreadSynchronize();
www.caps-entreprise.com 275
Using multiple CUDA Accelerators with MPI
#CUDA accelerators > #cores o Multiple MPI processes per core
(beware of CPU overload)
#CUDA accelerators == #cores o The ideal case: generally one
MPI process per core and GPU
o CPU may be idle while GPU is working
#CUDA accelerators < #cores o Share the GPUs?
o Lock the GPUs?
o Load Balancing CPU & GPU?
www.caps-entreprise.com 276
CPU
CPU
GPU
GPU
CPU GPU
GPU
CPU
CPU
GPU
GPU
CPU
CPU
1/31/2014
139
Resident Data
Think differently : instead of
Use resident data mechanism
www.caps-entreprise.com 277
Transfer
CPU->GPU
Compute
Kernel A
Transfer
GPU->CPU
Compute
Kernel B
CPU CPU GPU
CPU
Transfer
CPU->GPU
Transfer
CPU->GPU
Transfer
GPU->CPU
Compute
Kernel A
Transfer
GPU->CPU
Compute
Kernel B
CPU GPU CPU CPU GPU
Reducing Transfers
Use GPU-resident data as much as possible o Send once, use many times, read once
o Can tremendously boost performance
o Transfers can easily be the dominant factor in GPU usage
• Then follow Amdahl’s Law by optimizing transfers rather than kernels
Examples o Multiple steps of computations in a loop
o Multiple steps of computations in sequence
Do everything requiring the resident data on the GPU if possible o Unless the computations do not fit GPU at all
www.caps-entreprise.com 278
1/31/2014
140
Partial transfers
Think differently : instead of
Use partial transfer mechanism
www.caps-entreprise.com 279
CPU
Transfer
CPU->GPU
Transfer
GPU->CPU
CPU
Compute
Kernel A
GPU
Transfer
CPU->GPU
CPU
Transfer
GPU->CPU
CPU
Compute
Kernel B
GPU CPU
I/O
CPU
Transfer
CPU->GPU GPU->CPU
CPU
Compute
Kernel A
GPU
CPU->GPU
CPU
Transfer
GPU->CPU
CPU
Compute
Kernel B
GPU CPU
I/O
Minimizing Quantities
Again, maximize resident data, this time by keeping sub-
arrays on the GPU
o Send once, use and update many times, read once
o If some data must absolutely come from outside the GPU
o If some data must absolutely goes to outside the GPU
Network or disk I/O
Computation steps unimplementable on GPU
Warning: each data transfer as an initial overhead
www.caps-entreprise.com 280
1/31/2014
141
Reducing Transfers
The GPU computes faster than it performs transfers o Sometimes it is better to re-compute data than retrieving it from a remote
memory
Don’t try to factorize data to save memory, think performance o Memory saving is often a performance killer
• Allocate more memory to re-align data onto the GPU’s global memory
• Allocate more memory to avoid bank conflicts in shared memory
• Re-compute data to avoid transfers…
Avoid to compute borders on the GPU o Border cases are often performance killer due to:
• Incomplete warps
• Branch divergences
• Incomplete coalesced segments
www.caps-entreprise.com 281
CuComplex Header
Complex numbers : CuComplex Header
o Simple or double precision (HW >= 1.3)
o Include cuComplex.h
www.caps-entreprise.com 282
1/31/2014
142
CuBLAS Library
Basic Linear Algebra Subprograms
o Include cublas.h
o Link with libcublas.so (linux) or cublas.dll (windows)
o Up to BLAS3 (same arguments)
Available functions
o Dot-product : cublasXdot()
o Matrix mutlitplication : cublasXgemm()
o …
User guide : http://developer.nvidia.com/cuda-toolkit-40
www.caps-entreprise.com 283
CuFFT Library
Fast Fourier Transform
o Include cufft.h
o Link with libcufft.so (linux) or cufft.dll (windows)
o 1D, 2D or 3D
Datatype
o Real or Complex type
o Simple or double precision (HW 1.3)
User guide : http://developer.nvidia.com/cuda-toolkit-40
www.caps-entreprise.com 284
1/31/2014
143
Thrust
Templated Performance Primitives Library for CUDA
Similar to the C++ STL
Available functionnality
o Containers
o Iterators
o Sort
o Scan
o Reduction
o …
www.caps-entreprise.com 285
NPP Library
NVIDIA Performance Primitives library
GPU-accelerated image, video, and signal processing
functions
5x to 10x faster performance than CPU
Available functions
o Filter functions
o JPEG functions
o Geometry transforms
o Stastictics functions
o …
www.caps-entreprise.com 286
1/31/2014
144
OpenCL
Before OpenCL
GPGPU o Vertex / pixel shaders
o Heavily constrained and not adapted
CTM / Brook o Then Brook+
o Then CAL/IL
CUDA o Widely broadcasted
No one of these technologies is hardware agnostic o Portability is not possible
www.caps-entreprise.com 288
1/31/2014
145
What is Hybrid Computing with OpenCL?
OpenCL is o Open, royalty-free, standard
o For cross-platform, parallel programming of modern processors
o An Apple initiative
o Approved by Intel, Nvidia, AMD, etc.
o Specified by the Khronos group (same as OpenGL)
It intends to unify the access to heterogeneous hardware accelerators o CPUs (Intel i7, …)
o GPUs (Nvidia GTX & Tesla, AMD/ATI 58xx, …)
What’s the difference with CUDA or CAL/IL? o Portability over Nvidia, ATI, S3… platforms + CPUs
www.caps-entreprise.com 289
OpenCL Devices
NVIDIA
o All CUDA cards
AMD GPUs
o Radeon & Radeon HD
o FirePro, FireStream
o Mobility…
Intel & AMD CPUs
o X86 w/ >= SSE 3.x
Cell/B.E.
DSP
ARM
www.caps-entreprise.com 290
1/31/2014
146
Inputs/Outputs with OpenCL programming
OpenCL architecture
www.caps-entreprise.com 291
Application
OpenCL kernels
OpenCL framework
OpenCL C language OpenCL API
OpenCL runtime
Driver
GPU hardware
OpenCL and C for CUDA
www.caps-entreprise.com 292
PTX
GPU
Entry point for developers who prefer high level C
Entry point for developers who prefer low level API
Shared backend compiler and optimization technology
In the remainder we will only see the C API o And lab sessions focus on the C API
The C++ API is available on the Khronos Website o http://www.khronos.org/registry/cl/
Extensions exist to o OpenGL
o Direct3D
www.caps-entreprise.com 294
1/31/2014
148
Platform Model
Model consists of one or more interconnected devices
Computations occur within the Processing Elements of each device
www.caps-entreprise.com 295
Platform Version
3 different kind of versions for an OpenCL device
The platform version
o Version of the OpenCL runtime linked with the application
The device version
o Version of the hardware
The language version
o Higher revision of the OpenCL standard that this device supports
www.caps-entreprise.com 296
1/31/2014
149
Execution Model
Kernels are submitted by the host application to devices throw command queues
Kernel instances, called Work-Item (WI), are identified by their point in the NDRange index space o This enables to parallelize the execution of the kernels
But still 2 programming models are supported o Data-parallel
o Task parallel
So even if we have a single programming model, we should have two different programming approaches according to the paradigm we are considering
www.caps-entreprise.com 297
NDRange
NDRange is a N-Dimensional index space
o N is 1, 2 or 3
o NDRange is defined by an integer array of length N specifying the
extent of the index space on each dimension
www.caps-entreprise.com 298
1/31/2014
150
Work-Groups & Work-Items
Work-Items are organized into Work-Groups (WG)
Each Work-group has a unique global ID in the NDRange
Each Work-item has
o A unique global ID in the NDRange
o A unique local ID in its work-group
www.caps-entreprise.com 299
Parallelism Grains
CPU cores can handle only a few tasks
o But more complex
• Hard control flows
• Memory cache
o They can be either CPU threads or processes
• CPU threads: OpenMP, Pthread
• CPU Processes: MPI, fork()…
GPU threads are extremely lightweight
o Very little creation overhead
o Simple and regular computations
o GPU needs 1000s of threads (w.i.) for full efficiency
www.caps-entreprise.com 300
1/31/2014
151
Memory Model
Four distinct memory regions o Global Memory
o Local Memory
o Constant Memory
o Private Memory
Global and Constant memories are common to all WI o May be cached depending on the hardware capabilities
Local memory is shared by all WI of a WG
Private memory is private to each WI
www.caps-entreprise.com 301
Memory Architecture
www.caps-entreprise.com 302
1/31/2014
152
Memory Transfer (1)
2 types of transfers o Blocking (“synchronous”)
o Non-Blocking (“asynchronous”)
In the function clEnqueueRead/Write, set the “blocking” attribute to: o CL_TRUE, make a blocking transfer
o CL_FALSE, make a non-blocking transfer
For a non-blocking transfer o Need to link an event to the transfer command
o The event will be used for producer-consumer relationship, and/or explicit waiting
www.caps-entreprise.com 303
Memory Transfer (2)
Synchronizing ensures that data have been transferred to/from the device at this point
Example
Or can be used for the dependency flow in out-of-order queues o Use clEnqueueWaitForEvents() to synchronize in-order queues
By default, directives allocate fresh memory on the Xeon Phi
memory for each variable when entering the construct
By default, memory is deallocated when exiting the construct
Memory allocation is expensive: modifiers can change the
default behavior to reuse memory space
o If data has been allocated by a previous offload construct
o If data has been allocated by an attribute directive
o If data should be reused by an other offload construct
Intel MIC Programming 337
Persistent Storage Example
Intel MIC Programming 338
int nb = 1000;
int *t = (int*) malloc(nb*sizeof(int));
void bar()
{
#pragma offload in(t[0:nb]:alloc_if(1))
foo(t, nb);
}
void foo(int * ptr, int size)
{
#pragma offload in(ptr[0:size]:alloc_if(0))
…
}
Allocation of t of size nb on the coprocessor
Reuse of t already on the coprocessor and free t at the end of offload section
1/31/2014
170
Static and Dynamic Memory Example
Intel MIC Programming 339
__declspec(target(mic)) int array_host_mic[5000];
int array_host[5000];
void bar()
{
foo(&array_host[0], 5000);
foo(&array_host_mic[0], 5000);
}
void foo(int *t, int nb)
{
#pragma offload in(t[0:nb]:alloc_if(0))
…
}
Dynamic allocation of t on the coprocessor
Reuse of static allocation of array_host_mic on the coprocessor
Allocation on the processor and the coprocessor
Asynchronous Behavior
By default, Intel Offload directives cause the host thread to wait for the completion of the Xeon Phi instruction before going on to the next statement
Asynchronous behavior can be used specifying a signal clause to the offload directive
A offload_wait directive should be used to ensure completion
Intel MIC Programming 340
CPU MIC
1
2
3
4
5
CPU MIC
1
2
3
4
5
1/31/2014
171
Asynchronous Computations Example (C)
341
char sig;
int count = 1000;
__attribute__((target(mic))) mic_compute()
do {
#pragma offload target(mic) signal(&sig)
{
mic_compute();
}
host_activity();
#pragma offload_wait(&sig)
count = count – 1;
} while(count > 0);
www.caps-entreprise.com
Asynchronous Computations Example (FORTRAN)
342
integer sig
integer count
count = 1000
!dir$ attributes offload:mic::mic_compute
do while (count .gt. 0)
!dir$ offload target(mic:0) signal(sig)
call mic_compute()
call host_activity()
!dir$ offload_wait target(mic:0) wait(sig)
count = count – 1
end do
www.caps-entreprise.com
1/31/2014
172
Asynchronous Transfers
Use offload_transfer directives instead, for example in C
Express data and computations to be executed on an accelerator
o Using marked code regions
Main OpenACC constructs
o Parallel and kernel regions
o Parallel loops
o Data regions
o Runtime API
355
Data/stream/vector
parallelism to be
exploited by HWA e.g. CUDA / OpenCL
CPU and HWA linked with a
PCIx bus
Intel MIC Programming
Execution Model
Among a bulk of computations executed by the CPU, some regions can be offloaded to hardware accelerators o Parallel regions o Kernels regions
Host is responsible for: o Allocating memory space on accelerator o Initiating data transfers o Launching computations o Waiting for completion o Deallocating memory space
Accelerators execute parallel regions: o Use work-sharing directives o Specify level of parallelization
356 Intel MIC Programming
1/31/2014
179
OpenACC Execution Model
Host-controlled execution
Based on three parallelism levels
o Gangs – coarse grain
o Workers – fine grain
o Vectors – finest grain
357
Device
Gang
Worker
Vectors
Gang
Worker
Vectors
…
Intel MIC Programming
Gangs, Workers, Vectors
In CAPS Compilers, gangs, workers and vectors correspond
to the following in an OpenCL grid
Beware: this implementation is compiler-dependent
358
numGroups(1) = 1
numGroups(0) = number of gangs
localSize(1) = number of workers
localSize(0) = number of vectors
Intel MIC Programming
1/31/2014
180
Directive Syntax
C
Fortran
359
!$acc directive-name [clause [, clause] …]
code to offload
!$acc end directive-name
#pragma acc directive-name [clause [, clause] …]
{
code to offload
}
Intel MIC Programming
Parallel Construct
Starts parallel execution on the accelerator
Creates gangs and workers
The number of gangs and workers remains constant for the parallel region
One worker in each gang begins executing the code in the region
www.caps-entreprise.com 360
#pragma acc parallel […]
{
…
for(i=0; i < n; i++) {
for(j=0; j < n; j++) {
…
}
}
…
}
Code executed on the hardware
accelerator
1/31/2014
181
Gangs, Workers, Vectors in Parallel Constructs
In parallel constructs, the number of gangs, workers and vectors is the same for the entire section
The clauses: o num_gangs o num_workers o vector_length
Enable to specify the number of gangs, workers and vectors in the corresponding parallel section
www.caps-entreprise.com 361
#pragma acc parallel, num_gangs(128) \
num_workers(256)
{
…
for(i=0; i < n; i++) {
for(j=0; j < m; j++) {
…
}
}
…
}
…
… … …
256
128
Loop Constructs
A Loop directive applies to a loop that immediately follow the directive
The parallelism to use is described by one of the following clause:
o Gang for coarse-grain parallelism
o Worker for middle-grain parallelism
o Vector for fine-grain parallelism
www.caps-entreprise.com 362
1/31/2014
182
Loop Directive Example
With gang, worker or vector clauses, the iterations of the following loop are executed in parallel
Gang, worker or vector clauses enable to distribute the iterations between the available gangs, workers or vectors
www.caps-entreprise.com 363
#pragma acc parallel, num_gangs(128) \
num_workers(192) \
vector_length(32)
{
…
#pragma acc loop gang
for(i=0; i < n; i++) {
#pragma acc loop worker
for(j=0; j < m; j++) {
#pragma acc loop vector
for(k=0; k < l; k++) {
…
}
}
}
…
}
…
192
128
i=0
j=0 j=1 j=2
…
i=0
…
…
i=1
…
… k=0 k=1 k=2
Kernels Construct
Defines a region of code to be compiled into a sequence of accelerator kernels o Typically, each loop nest will be a distinct kernel
The number of gangs and workers can be different for each kernel
www.caps-entreprise.com 364
#pragma acc kernels […]
{
for(i=0; i < n; i++) {
…
}
…
for(j=0; j < n; j++) {
…
}
}
$!acc kernels […]
DO i=1,n
…
END DO
…
DO j=1,n
…
END DO
$!acc end kernels
1st Kernel
2nd Kernel
1/31/2014
183
Gang, Worker, Vector in Kernels Constructs
The parallelism description is the same as in parallel sections
However, these clauses accept an argument to specify the number of gangs, workers or vectors to use
Every loop can have a different number of gangs, workers or vectors in the same kernels region
www.caps-entreprise.com 365
#pragma acc kernels
{
…
#pragma acc loop gang(128)
for(i=0; i < n; i++) {
…
}
…
#pragma acc loop gang(64)
for(j=0; j < m; j++) {
…
}
}
…
64
…
i=0
…
i=0
…
i=2
… …
i=0
…
i=0
…
i=2
128
Data Independency
In kernels sections, the clause independent on loop directive specifies that iterations of the loop are data-independent
The user does not have to think about gangs, workers or vector parameters
It allows the compiler to generate code to execute the iterations in parallel with no synchronization
www.caps-entreprise.com 366
A[0] = 0;
#pragma acc loop independent
for(i=1; i<n; i++)
{
A[i] = A[i]-1;
}
1/31/2014
184
What is the problem using discrete accelerators?
PCIe transfers have huge latencies
In kernels and parallel regions, data are implicitly managed
o Data are automatically transferred to and from the device
o Implies possible useless communications
Avoiding transfers leads to a better performance
OpenACC offers a solution to control transfers
www.caps-entreprise.com 367
Device Memory Reuse
In this example: o A and B are allocated
and transferred for the first kernels region
o A and C are allocated and transferred for the second kernels region
How to reuse A between the two kernels regions? o And save transfer and
allocation time
www.caps-entreprise.com 368
float A[n];
#pragma acc kernels
{
for(i=0; i < n; i++) {
A[i] = B[n – i];
}
}
…
init(C)
…
#pragma acc kernels
{
for(i=0; i < n; i++) {
C[i] += A[i] * alpha;
}
}
1/31/2014
185
Memory Allocations
Avoid data reallocation using the create clause o It declares variables, arrays or subarrays to be allocated in the device
memory
o No data specified in this clause will be copied between host and device
The scope of such a clause corresponds to a data region
Kernels and Parallel regions implicitly define data regions
The present clause declares data that are already present on the device
www.caps-entreprise.com 369
Create and Present Clause Example
www.caps-entreprise.com 370
float A[n];
#pragma acc data create(A)
{
#pragma acc kernels present(A)
{
for(i=0; i < n; i++) {
A[i] = B[n – i];
}
}
…
init(C)
…
#pragma acc kernels present(A)
{
for(i=0; i < n; i++) {
C[i] += A[i] * alpha;
}
}
}
Allocation of A of size n on the device
Deallocation of A on the device
Reuse of A already allocated on the device
Reuse of A already allocated on the device
1/31/2014
186
Data Storage: Mirroring
How is the data stored in a data region?
A data construct defines a section of code where data are mirrored between host and device
Mirroring duplicates a CPU memory block into the HWA memory o Users ensure consistency of copies via directives
www.caps-entreprise.com 371
Host Memory
Master copy
…………………………………………………….
HWA Memory
CAPS RT Descriptor
…………………………………………………….
Mirror copy
Arrays and Subarrays
In C and C++, arrays are specified with start and length
o For example, with an array of size n
In FORTRAN, arrays are specified with a list of range specifications
o For example, with an array a of size (n,m)
In any language, any array or subarray must be a contiguous block of memory
www.caps-entreprise.com 372
#pragma acc data create a[0:n]
!$acc data create a(0:n,0:m)
1/31/2014
187
Transfers: Copyin Clause
Declares data that need only to be copied from the host to the device when entering the data section
o Performs input transfers only
It defines scalars, arrays and subarrays to be allocated on the device memory for the duration of the data region
www.caps-entreprise.com 373
#pragma acc data create(A[:n])
{
#pragma acc kernels present(A[:n]) \
copyin(B[:n])
{
for(i=0; i < n; i++) {
A[i] = B[n – i];
}
}
…
#pragma acc kernels present(A[:n])
{
for(i=0; i < n; i++) {
C[i] = A[i] * alpha;
}
}
}
Transfers: Copyout Clause
Declares data that need only to be copied from the device to the host when exiting data section
o Performs output transfers only
It defines scalars, arrays and subarrays to be allocated on the device memory for the duration of the data region
www.caps-entreprise.com 374
#pragma acc data create(A[:n])
{
#pragma acc kernels present(A[:n]) \
copyin(B[:n])
{
for(i=0; i < n; i++) {
A[i] = B[n – i];
}
}
…
#pragma acc kernels present(A[:n]) \
copyout(C[:n])
{
for(i=0; i < n; i++) {
C[i] = A[i] * alpha;
}
}
}
1/31/2014
188
Transfers: Copy Clause
If we change the example, how to express that input and output transfers of C are required?
Use copy clause to: o Declare data that need to be
copied from the host to the device when entering the data section
o Assign values on the device that need to be copied back to the host when exiting the data section
o Allocate scalars, arrays and subarrays on the device memory for the duration of the data region
www.caps-entreprise.com 375
#pragma acc data create(A[:n])
{
#pragma acc kernels present(A[:n]) \
copyin(B[:n])
{
for(i=0; i < n; i++) {
A[i] = B[n – i];
}
}
…
init(C)
…
#pragma acc kernels present(A[:n]) \
copy(C[:n])
{
for(i=0; i < n; i++) {
C[i] += A[i] * alpha;
}
}
}
Present_or_create Clause
Combines two behaviors
Declares data that may be present
o If data is already present, use value in the device memory
o If not, allocate data on device when entering region and deallocate when exiting
May be shortened to pcreate
www.caps-entreprise.com 376
1/31/2014
189
Present_or_copyin/copyout/copy Clauses
If data is already present, use value in the device memory
If not: o present_or_copyin/present_or_copyout/present_or_copy allocate
memory on device at region entry
o present_or_copyin/present_or_copy transfer data from the host at region entry
o present_or_copyout/copy transfer data from the device to the host at region exit
o present_or_copyin/present_or_copyout/prsent_or_copy deallocate memory at region exit
May be shortened to pcopyin, pcopyout and pcopy
www.caps-entreprise.com 377
Present_or_* Clauses Example
www.caps-entreprise.com 378
program main
…
!$acc data create(A(1:n))
call f1( n, A, B )
…
!$acc end data
…
call f1( n, A, C )
…
contains
subroutine f1( n, A, B )
…
!$acc kernels pcopyout(A(1:n)) \
copyin(B(1:n))
do i=1,n
A(i) = B(n – i)
end do
!$acc end kernels
end subroutine f1
…
end program main
Allocation of A of size n on the device
Reuse of A already allocated on the device Allocation of B of size n on the device for the duration of the subroutine and input transfer of B
Deallocation of A on the device
Allocation of A and B of size n on the device for the duration of the subroutine Input transfer of B and output transfer of A
Present_or_* clauses are generally safer
1/31/2014
190
Default Behavior
CAPS Compilers is able to detect the variables required on the device for the kernels and parallel constructs.
According to the specification, depending on the type of the variables, they follow the following policies
o Tables: present_or_copy behavior
o Scalar
• if not live in or live out variable: private behavior
• copy behavior otherwise
www.caps-entreprise.com 379
OpenACC 2.0: New Features (1)
Atomic operations: o Different kinds of atomic
sections can be executed in parallel/kernels constructs
void init(int* array, int size){ array = malloc(sizeof(int)*size); #pragma acc enter data create(array[0:size]) } int main(){ int *array; … init(array, size); … #pragma acc exit data delete(array) … }
Data regions - #pragma omp target data !$omp target data
#pragma acc data !$acc data
Transfer directives #pragma offload_transfer !dir$ offload_transfer
#pragma omp target update !$omp target update
#pragma acc update !$acc update
Intel MIC Programming 383
CAPEX / OPEX
with GPU
1/31/2014
193
Goals – Why Using GPUs
Performance
Energy saving
Cheaper machine
Preparing code to manycore
www.caps-entreprise.com 385
Is the Machine Cheaper?
You may want
o To run faster than you competitor
o To run faster than the streamed data come
o To run faster in order to use less energy to compute
o To run differently to save energy
OPEX or CAPEX?
o Capital Expenditures
• Machine cost and software migration, surface cost
o Operational Expenditures
• Energy consumption, hardware and software maintenance
www.caps-entreprise.com 386
1/31/2014
194
CAPEX-OPEX Analysis for a Heterogeneous
System
Capital Expenses (CapEx) o System acquisition cost
o Software migration cost
o Software acquisition cost
o Teaching cost
o Real estate cost
Operational Expenses (OpEx) o Energy cost (system + cooling)
o Maintenance cost
For a given amount of compute work, the CapEx-Opex analysis indicates the “real” value of a given system o For instance, if I add GPU do I save money?
o And how many should I add?
o Then should I use slower CPU?
www.caps-entreprise.com 387
Application Speedup and CapEx-OpEx
Adding GPUs/accelerators to the system o Increases system cost
o Increases base energy consumption (one GPU = x10 watt idle)
Exploiting the GPUs/accelerators o Decreases execution time, so potentially the energy consumption for a
given amount of work
o Reduces the number of nodes of the architecture • Threshold effect on the number of routers etc.
o Requires to migrate the code
Multiple views of the value of application speedup o Shorten time-to-market
• Threshold effect
o More work performed during the lifetime of the system
www.caps-entreprise.com 388
1/31/2014
195
CapEx Hardware Parameters
Choice of the hardware configuration can be: o Fast CPU + Fast GPU (expensive node)
o Slow CPU + Fast GPU
o Fast CPU + Slow GPU
o Slow CPU + Slow GPU
o Fast CPU
o Slow CPU
Nodes performance impact on the number of nodes o More nodes means more network with non negligible cost and energy
consumption
o Less nodes may limit scalability issues if any
Application workload analysis is the only way to decide o Optimizing software can significantly increase performance and so reduce
needed hardware
o Code migration to GPU is on the critical path
www.caps-entreprise.com 389
Small systems: - a few nodes (1-8) - cost x10k€ Large systems - many nodes (x100) - cost x1M€
CapEx: Code Migration Cost
Migration cost o Learning cost
o Software environment cost
o Porting cost
Migration cost is mostly hardware size independent o Not an issue for dedicated large systems
o Different if the machine aims at serving a large community
Main migration benefit is to highlight manycore parallelism o Not specific to one kind of device
o Implementation is specific
Constructor specific implementation solution o Amortize period similar to the one of the hardware (3 years)
Agnostic parallelism expression o Using portable solution for multiple hardware generations (amortized on 10 years)
o Of course not that simple! Still requires some level of tuning
May be very useful for non scalable message passing code
www.caps-entreprise.com 390
Mastering the cost of migration has a significant impact on the total cost for small systems Typical effort: - Manpower: a few Man-Months - Cost: x 10k€
1/31/2014
196
Two Applications Examples
Application 1
• Field: Monte Carlo simulation for thermal radiation
• MPI code
• Migration cost: 1 man month
Application 2
• Field: astrophysics, hydrodynamic
• MPI code
• Requires 3 GPUs per node for having enough memory space
• Migration cost: 2 man month
www.caps-entreprise.com 391
Power Consumption Application 1
www.caps-entreprise.com 392
0 = Baseline Energy Consumption
CPU energy
GPU energy
Power usage effectiveness (PUE) = Total facility power / IT equipment power Current 1.9, best practice 1.3 Src: http://www.google.com/corporate/datacenter/efficiency-measurements.html