Hardware and Software for Parallel Computing

1/31/2014

1

Hardware and Software for Parallel

Computing

Florent Lebeau, CAPS entreprise

UMPC - janvier 2014

Agenda

Day 1

The Need for Parallel Computing

Introduction to Parallel Computing

o Different Levels of Parallelism

History of Supercomputers

o Hardware Accelerators

Multiprocessing

o Fork/join

o MPI

Multithreading

o Pthread

o TBB

o Cilk

o OpenMP

Day 2

Hardware Accelerators

o GPU

• CUDA

• OpenCL

o Xeon Phi

• Intel Offload

o Directive Standards

• OpenACC

• OpenMP 4.0

CapEx / OpEx with GPU

Porting Methodology

www.caps-entreprise.com 2

1/31/2014

2

The Need for Parallel

Computing

The Demand (1)

Compute for research o Simulate complex physical phenomenon

o Validate a mathematical model

Compute for industry o Quality control by image processing

o DNA sequence alignment

o Oil & gas prospection

o Meteorology

Compute for the masses o Playing a 1080p DVD in real time

o Compressing / uncompressing streams

o Image processing


1/31/2014

3

The Demand (2)

Computing needs o Data

o An operation

The computing cost (time) may be

o And even worse, sometimes

To reduce the computation time, one can o Reduce the amount of data

o Reduce the amount of operations

o Increase the computation speed


Qcomp = Qdata *Qop

The Demand (3)

The amount of computations keeps increasing

o Games and screens resolutions keep improving

o Longer weather prediction

o More accurate weather prediction

• Which increases the amount of data

Amount of data grows each year

But the given time to exploit these data is still the same

o 24h for another day of weather prediction

o 1/50s for a video stream image


1/31/2014

4

The Demand (4)

So we need to increase computations per second

At a lower cost o To purchase

o To develop

o To maintain

Technologically sustainable o Easy adaptation to next architecture

According to the company strategy o Green computing, industrial partnerships with providers…


The Solution (1)

By the end of the 20th century, most of applications were written for CPUs (mainly x86) o The regular frequency increase of

micro-processors ensured performance gains without code modification

o The effort focused on hardware vendors, less on developers

Today frequency increase is stuck o Because of thermal dissipation and

power leakage

o Because of components distance and die surface

Computing faster is less and less feasible (sequentially) o But we can compute more things

simultaneously (in parallel)


1/31/2014

5

The Solution (2)

Advanced optimizations of code to get the best of cutting-edge CPU functionalities o Vectorization

Code parallelization o Multi-threading

o Parallel computing requires parallel codes

Porting onto specialized architectures o FPGA, cluster, GPU…

o Not only developer’s choice, because may imply long-term hardware investments


Introduction to Parallel

Computing

1/31/2014

6

Flynn’s Taxinomy

Classification of computer architectures

o Established by Michael J. Flynn en 1966

4 categories based on the data and instruction flow

o SISD

o SIMD

o MISD

o MIMD

• With shared memory (CPU cores)

• With distributed memory (clusters)


Flynn’s Taxinomy : SISD

SISD : Single Instruction Single Data


Data Instruction

Processor Memory Control

1/31/2014

7

Flynn’s Taxinomy : SIMD

SIMD : Single Instruction Multiple Data


Data Instruction

Processor 0

Processor 1

Processor 2

Memory Control

Flynn’s Taxinomy : MISD

MISD : Multiple Instruction Single Data


Data Instruction

Processor 0

Processor 1

Processor 2

Memory Control 1

Control 0

Control 2

1/31/2014

8

Flynn’s Taxinomy : MIMD

MIMD : Multiple Instruction Multiple Data


Data Instruction

Processor 0

Processor 1

Processor 2

Memory 0 Control 0

Memory 1

Memory 2

Control 1

Control 2

Distributed Memory Architectures

Processors only see their own memory

They communicate explicitly by message-passing if needed

A processor cannot access to the memory of another

So the distribution must be done to avoid communications


Netw

ork

Processor Processor

Processor Processor

1/31/2014

9

Shared Memory Architectures (1)

A unique address space is provided by the hardware

If there is, cache consistency is maintained by hardware


Network

Processor Processor Processor Processor

Shared Memory Architectures (2)

Global memory space, accessible by all processors

Processors may have local memory (i.e. cache) to hold copies of some global memory

Consistency of these copies is usually maintained by hardware o Referred as Cache-Coherent

User-friendly

Programmer is in charge of correct synchronization between processes/threads

Suffer from lack of scalability


1/31/2014

10

UMA : Unified Memory Access

Most commonly represented today by SMP machines

Identical processors

Equal access and access times to memory

Sometimes called CC-UMA - Cache Coherent UMA


NUMA : Non-Uniform Memory Access

Often made by physically linking two or more SMPs

One SMP can directly access memory of another SMP

Memory access across link is slower


1/31/2014

11

The Amdahl’s Law

Amdahl’s law states that the sequential part of the execution limits the potential benefit from parallelization o The execution time TP using P cores is given by:

• where seq (in [0,1]) is percentage of execution that is inherently

sequential

Consequences of this law o Potential performance dominated by sequential parts of the

application


TP seq T1 (1 seq) *T1

P

What is a Hotspot?

A small part of code

o Most of the execution time spent in it

o Mostly loops that concentrate computation

Identified using performance profilers

Also known as kernels or compute intensive kernels

o But sometimes a hotspot can be implemented as several kernels


1/31/2014

12

Speedup

Speedup

o Ratio between execution on 1 process and on P processes

Efficiency

o Ratio between the speedup and the number of cores used for the

parallel version

o Parallel application is scalable when efficiency is close to 1 with

number of cores increasing


SP =T1

TP

EP =T1

P*TP

Amdahl’s law


1/31/2014

13

Speedup

Speedup

o Ratio between the original time and the optimized time


Sp Tseq

Tp

Gustafson’s Law

States that increasing the amount of data increases the

parallelism potential of the application

o The more computations you have, the more computations you may

overlap

A parallel architecture need to be well-exploited to get good

performances

o The more you send parallel computations on it, the best you get


1/31/2014

14

Scalability

Scalability gives us an idea about the system behaviour

when the number of processors is increased.

Applications can often exploit large parallel machines by

scaling the problems to larger instances

To improve the scalability :

o Increase the parallelism of the application


Load Balancing

It is the capacity of

the application to

adapt the amount of

work between each

Proceesing Elements

It can be statically or

dynamically adapted


1/31/2014

15

Granularity

Granularity means the amount of computation compared to

communications

Larger parallel tasks usually provide better speedups

o Reduce startup overhead

o Reduce communication and synchronization overhead

Smaller granularity can be exploited on strongly coupled

architecture, such as multicore

o Can require deep rewriting of the application


Different Levels of Parallelism

1/31/2014

16

Different Levels of Parallelism

ALU o Vectorization (MMX / Now!, SSE)

Instruction Pipeline o Instruction Level Parallelism (ILP)

Multi-core o Mass market desktop workstations

Multi-socket o Bi-CPU desktop workstations

Multi-node o Cluster

Worldwide distributed computing o SETI@home


Parallelism in CPUs

Scalar / superscalar / SIMD o 80486 / Intel Pentium / Intel Pentium MMX

Mono/multicore o Dual-core (2005 avec AMD Opteron)

o Quad-core, etc.

o “Duplication of the processors”

Mono/multisocket o Intel Xeon bi-socket


1/31/2014

17

Scalar Processors

One data is computed at a time

o The architecture is designed to perform one instruction on one data

per clock cycle

o Contrarily to vector (or superscalar architecture)

One data = a value or scalar variable

o i.e.: a value or a recipient determined by a type

o As opposed to a composite data:

• A vector

• A character string in certain programming languages

Ex : Intel 80486


Superscalar Processors

Able to perform many instructions simultaneously o Each in a different pipeline

Ex : Intel P5 (1993)


1/31/2014

18

Vector Processors

Their architecture is based on pipelines

o A vector instruction executes the same operation on all the data of a

vector

Ex : Cray, NEC, IBM, DEC, Ti, Apple Altivec G4 et G5…


SIMD Processors

Single Instruction Multiple Data

Ex : MMX, SSE, ARM Neon, SPARC VIS, MIPS…


1/31/2014

19

Today

Most architectures are based on superscalar processors

o And SIMD

Mono-socket for mass market

o Dual-socket or more in clusters or professional workstations


History of Supercomputers

1/31/2014

20

Top500.org

Lists the world’s 500 largest supercomputers


Architecture Types


1/31/2014

21

Architecture Types

Single Processor o But a big one

Cluster

UP^n avec SAS : o SMP if n is small

o MPP if n is large

(UP^n SAS)^m : o If n << m and /SAS : cluster

o Constellation otherwise


UP

withoutSAS

nUP )(

withSAS

nUP )(

withSAS

nUP )(

withoutSAS

m

withSAS

nUP )))(((

withoutSAS

m

withSAS

nUP )))(((

Architecture Types


1/31/2014

22

Processor Types


Processor Types


1/31/2014

23

Number of Processors


Interconnect Type


1/31/2014

24

Installation Type


Clusters

1/31/2014

25

An Exemple: Nova Cluster

CAPS entreprise, 2009


Nova Architecture

Nova is composed of:

o 1 login node (Nova0)

o 3 storage nodes over Lustre (Nova1 to Nova3)

o 20 compute nodes (Nova4 to Nova23)

Each compute node is made up of:

o A dual-socket Intel Nehalem (bi-processor) machine

• Each Nehalem processor is quad-core (4 CPU cores)

o 2 Nvidia Tesla C1060 GPUs

o 24 GB of memory


1/31/2014

26

Nova’s Compute Nodes Architecture


12 GB

Intel QPI

12 GB

Intel S5520

chipset

PCIe 2.0 16x

Tesla C1060

Tesla C1060

Nova Architecture


nova0.caps-entreprise.com

Nova0

File system

(Nova1-3)

Compute Nodes

(Nova4-23)

Internet

1/31/2014

27

Pros

• Less expensive

• Than one multiprocessor server of similar computing power

• In particular SMP

• More flexible

• The size is adapted to the needs and budget, contrarily to monolithic

configurations


Exploiting clusters

As a distributed system

o That is what they actually are

o Resources can be shared among users, applications …

o More complicated to program

As a virtual SMP

o Kerrighed

o The OS hide the underlying architecture

o Easier to program but less control

• A cluster is NUMA-type. Data distribution is important


1/31/2014

28

Exploiting Clusters in Distributed Mode

In distributed mode, it is generally provided with a task

scheduler

o Enables to add more servers

o Enable to manage breakdowns

o Slurm, sge, PBS,…


$ srun –n 4 ./myProgram.exe myProgram is running on node 13 myProgram is running on node 14 myProgram is running on node 15 myProgram is running on node 16 $

Connection to the Login Node

Before offloading your computations on Nova’s compute

nodes, you need to login to Nova0

o “Secondary” will automatically connect you to Nova0


$ ssh [email protected]

mylogin@Nova0 $ #you can now work!

1/31/2014

29

Module and Slurm

Nova is provided with module and srun commands

Useful Module commands

Useful Slurm commands


mylogin@Nova0 $ module avail

mylogin@Nova0 $ module load

mylogin@Nova0 $ module unload

mylogin@Nova0 $ module list

mylogin@Nova0 $ module switch

mylogin@Nova0 $ srun

mylogin@Nova0 $ sinfo

mylogin@Nova0 $ salloc

mylogin@Nova0 $ sacct

mylogin@Nova0 $ sreport

Running your Applications

Launch a Bash command on multiple nodes

Launch an application on multiple nodes

Launch n copies of a binary on N nodes

Launch on a specific partition


mylogin@Nova0 $ srun –N4 ls

mylogin@Nova0 $ srun –N3 ./myApp

mylogin@Nova0 $ srun –N1 –n3 ./myApp

mylogin@Nova0 $ srun –p hugePart –N16 ./myLargeJob

1/31/2014

30

Clusters and Cloud Computing

Some companies provide an access to their server farm

o Amazon AWS/EC2

o Google App Engine

o IBM Blue Cloud

o Rackspace Mosso

o Argia Faascape


Amazon EC2


1/31/2014

31


Clearspeed: Description

Clearspeed designs SIMD

processors for HPC and

embedded systems

Designs e710, e720 and

CATS-700 accelerators

units, based on the CSX700

processor

CSX700 released in 2008


1/31/2014

32

Clearspeed: Architecture Overview

CSX700: o Made of 2 SIMD array

processors

o 96 cores on each array

o 256 kb on-chip scratchpad memory

o 2 x 64-bit DDR2 DRAM interface with ECC support

o PCIe 16x host interface

o 96 GFLOPS SP and DP

o 9 W power dissipation


Clearspeed: Software Tools

o ANSI C-based Compiler: Cn

o Eclipse IDE

o GDB-based debugger

o Visual profiler

o Libraries: FFT, BLAS, LAPACK …

o MS Windows and Linux tools


1/31/2014

33

Clearspeed: Applications

Finance:

o « Fastest solution for credit risk analysis »

Space engineering

HPC


CELL: Description

Alliance of Sony, Toshiba and IBM

Dates back to the mid-2000’s

Based on the IBM POWER architecture core

Design to bridge the gap between CPU and GPU


1/31/2014

34

CELL: Architecture Overview

1 main processor (PPE: Power Processing Element)

8 coprocessors (SPE: Synergistic Processing Element)

256 GFLOPS SP 26 GFLOPS DP

60 to 80 W power consumption


CELL: Software Tools

IBM online resource center no longer available, but the latest

SDK (v3.1) can still be found on the web.

SDK v3.1 contains:

o Eclipse IDE

o GNU toolchain tools (gcc, gdb)

o Performance tools

o Libraries (FFT, LAPACK, BLAS, Monte Carlo)

Linux only


1/31/2014

35

CELL: Applications

Multimedia:

o Console video games

o Home cinema

HPC

o IBM Roadrunner


Tilera: Description

Fabless semiconductor

company focusing on

scalable multicore embedded

processors

TILE family processors

TILE-based platforms


1/31/2014

36

Tilera: Architecture Overview

TILEProX64:

o 8 x 8 grid of identical RISC

processor cores

o 5.6 Mbytes on-chip cache

memory

o 19 – 23 W power

consumption

o 4 DDR2 memory

controllers with optional

ECC

o 443 Giga OPS


Tilera: Software Tools

Tilera Multicore

Development

Environment (MDE):

o ANSI C / C++ compiler

o Eclipse IDE

o Tools for gdb, gprof and

oprofile

o Graphical multicore

debugger and profiler

Linux only


1/31/2014

37

Tilera: Applications

Cloud computing

o 3X perfomance-per-Watt

when running Memcache,

according to Facebook

Networking

Multimedia


Kalray: Description

Spin-off from the CEA

Fabless and software

company delivering

manycore processors

Developed MPPA (Multi-

Purpose Processor Array)


1/31/2014

38

Kalray: Architecture Overview

256 VLIW processors organized in 16 clusters of 16 cores for the basic edition

High-end products up to 64 clusters, ie 1024 cores

DDR3 memory controllers

5 W power dissipation

2 Tera OPS for the 1024 cores version


Kalray: Software Tools

AccessCore SDK: o C-based programming

model: • Core algorithms in C

• Primitives for task and data parallelism

o GNU gcc and gdb are used for compilation and debug

o Eclipse 3.x IDE

o Libraries

o Linux only


1/31/2014

39

Kalray: Applications

Signal Processing

o SIMILAN project

Video processing, industrial Imaging, transportation

o CHAPI project


Calxeda: Description

Provides ARM-based

SoC

EnergyCore family

processors

EnergyCards boards


1/31/2014

40

Calxeda: Architecture Overview

4 ARM A9 cores

4 Mbytes L2 cache and

DDR3 controller with

ECC

220 MIPS / ARM core

5 W per node


Calxeda: Software Tools

Use ARM-dedicated tools:

o GNU gcc ARM

o JTAG debugger


1/31/2014

41

Calxeda: Applications

Server market

o Web servers clusters

o Content distribution

networks

o Cloud storage


Supercomputers

Vector Architecture

o CRAY, NEC SX-9

Cluster

o Contemporary Supercomputers


1/31/2014

42

CRAY Supercomputers

Seymour Cray

o 1957: co-founder of CDC (Control Data Corporation). Takes part in

the design of the first supercomputer: the CDC 6600.

o 1972: co-founder of Cray Research, Inc. Creates the Cray-1 in 1976

and the Cray-2 in 1985.

o 1989: founder of Cray Computer Corporation. The Cray-3 has never

been built and the company went bankrupt in 1995.


Multiprocessing

1/31/2014

43

What are Processes?

An instance of a computer program that is being executed

o If a program is launched twice, two instances of this program will be

spawned

According to the OS

o A process can be switched with another

o A process is interruptible

Switches could be performed:

o tasks perform input/output operations

o when a task indicates that it can be switched

o or on hardware interrupts


Processes

With processes you can switch between processes o And if you do it quick enough it seems like they execute concurrently

Processes enable multi-tasking o OS, drivers and programs run “concurrently” on the microprocessor

o They actually share the multiprocessor core

• But with dual-cores CPUs, they are distributed over two cores

Processes make multicore possible o And take advantages of it by executing even faster

Processes have separate address spaces o And interact only through system-provided inter-process communication

mechanisms


1/31/2014

44

Processes

Imply explicit

communications

o Through inter-

process

communication

mechanisms


Multiprocessing with Multiple Cores

Enables to execute different programs concurrently

o One on each core: the overall time needed is reduced

o You have to launch programs asynchronously

Enables to execute several instances of a same program

o On different data -> SPMD

o You need either to use another command line or specific tools like

MPI


$ riri.exe & fifi.exe & loulou.exe &

$ srun –N 9 donald.exe data1-9.txt

1/31/2014

45

Tools for Multiprocessing

By hand

o Fork / join

o …

With the help of dedicated tools

o MPI

o PVM

o …


Fork/Join


1/31/2014

46

Fork / join

The fork() system call will spawn a new child process which

is an identical process to the parent

o Except that has a new system process ID

The process is copied in memory from the parent and a new

process structure is assigned by the kernel

o All data and properties are inherited from parent, but separated

…


Fork / Join Example


#include …

using namespace std;

main()

{

string sIdentifier;

pid_t pID = fork();

if (pID == 0) // Code only executed by child process

{

sIdentifier = "Child Process: ";

}

else if (pID < 0) // failed to fork

{cerr << "Failed to fork" << endl; exit(1);}

else // Code only executed by parent process

{

sIdentifier = "Parent Process:";

}

// Code executed by both parent and child

cout << sIdentifier << pID << endl;

}

$ g++ -o ForkTest ForkTest.cpp

$ ./ForkTest

Parent Process pID: 1234

Child Process pID: 1245

1/31/2014

47

Multiprocessing on Distributed Systems

The Fork / Join model is on-node programming

But since processes embed all their execution context

o They are quite autonomous

o And may be distributed over separated computers (nodes)

MPI

o Designed to implement multiprocessing on distributed systems

o Is based on the fork / join model

o Is portable (systems of different kinds can communicate)


MPI

1/31/2014

48

MPI: Message Passing Interface

A high-level message-passing API

Designed for high performance, scalability and portability

A standard which comes in two versions

o 1995: v1.2 (MPI-1)

o 1997: v2.1 (MPI-2)

An API which comes in several implementations

o Some vendor-specific extensions...

o ...can break the portability of the application


MPI distribution

Free implementation

o MPICH (MPI-1)

o MPICH2 (MPI-2)

o OpenMPI

o LAM/MPI

Constructor

o HP MPI

o Intel MPI


1/31/2014

49

Programming Paradigms

Sequential Programming Paradigm

Message-Passing Programming Paradigm


Data

Program

Sub-Data

Sub-Program

Sub-Data

Sub-Program

Sub-Data

Sub-Program

Sub-Data

Sub-Program

Network

MPI Programming

Each node in an MPI program runs a sub-program

o Written in a sequential language (C, Fortran, ...)

o Typically the same on each node (SPMD)

The variables of each sub-program:

o Have the same name

o Have different locations and different values (distributed memory)

o all variables are private to the sub-program

Communicate via send & receive routines


1/31/2014

50

Data & Work Distribution

Each process is given a unique rank from 0 to N-1

System of N independent processes

Data and work distribution decisions based on rank


Rank=0 Data

Sub-Program

Rank=1 Data

Sub-Program

Rank=2 Data

Sub-Program

Rank=N-1 Data

Sub-Program

Network

Messages

Messages are blocks of data exchanged by sub-programs

For both send and receive steps, necessary information are:

o Ranks of the source/destination processes

o Data location

o Data type

o Data size


Network

1/31/2014

51

Include In The Program File

In C

o #include <mpi.h>

In C++

o #include <mpi++.h>

In Fortran

o #include "mpif.h" ! F77 ou F90

o USE MPI

• Before IMPLICIT NONE


MPI C/C++ Functions

Header

o #include <mpi.h>

o #include <mpi++.h>

Function format

o mpierror = MPI_?????( … );

o MPI_?????( … );

MPI_* prefix reserved for MPI macros & routines


1/31/2014

52

MPI Fortran Subroutines

Header

o include 'mpif.h’

Function format

o var = MPI_?????( … )

o CALL MPI_?????( … , mpierror)

MPI_* prefix reserved for MPI macros & routines


initialization & termination

Initializing MPI o int MPI_Init(int *argc, char **argv) (C)

o Subroutine MPI_Init( mpierror) (Fortran)

Must be the first called MPI routine

It initialize the MPI execution environment (Communicator, …)

Exiting MPI o int MPI_Finalize() (C)

o Subroutine MPI_Finalize(mpierror) (Fortran)

Must be called by all processes before exiting

Terminates MPI execution environment


1/31/2014

53

Handles

A handle is a value used to identify an MPI object

For the programmers, handles are:

o Predefined constants (exist only after call to MPI_Init())

• Ex: MPI_COMM_WORLD

o Values returned by MPI routines: defined as special MPI typedefs.

Handles refer to internal data structures


Communicator MPI_COMM_WORLD

Communicator: set of interconnected MPI processes

All processes of one MPI program are combined in the

communicator MPI_COMM_WORLD

o Predefined in mpi.h and mpif.h

Each process in a communicator has its own rank

o Starting from 0

o Ending at size-1


1/31/2014

54

Size & Rank

The size :

o How many processes in a communicator

• int MPI_Comm_size(MPI_Comm comm, int *size) (c)

• Subroutine MPI_Comm_size(MPI_COMM_WORLD, mpisize, mpierror)

(Fortran)

The rank :

o Uniquely identifies each process in a communicator

o Is the basis for work / data distribution

• int MPI_Comm_rank(MPI_Comm comm, int *rank) (C)

• Subroutine MPI_Comm_rank(MPI_COMM_WORLD, mpirank, mpierror)

(Fortran)


Basic Example in Fortran

program firstmpi

! The most basic MPI Program

include 'mpif.h'

integer :: mpierror, mpisize, mpirank

call MPI_Init(mpierror)

call MPI_Comm_size(MPI_COMM_WORLD, mpisize, mpierror)

call MPI_Comm_rank(MPI_COMM_WORLD, mpirank, mpierror)

! Do work here

call MPI_Finalize(mpierror)

end program firstmpi


1/31/2014

55

Basic Example in C

#include <mpi.h>

void main(int argc, char *argv[])

{

/* The most basic MPI Program */

int mpierror, mpisize, mpirank;

mpierror=MPI_Init(&argc, &argv);

mpierror=MPI_Comm_size(MPI_COMM_WORLD, &mpisize);

mpierror=MPI_Comm_rank(MPI_COMM_WORLD, &mpirank);

/* Do work here */

mpierror=MPI_Finalize();

}


Compilation step

Compilation with C code

o Unless you have a specific implementation (e.g. bullmpi)

Compilation with Fortran code


$ mpicc -o mpiProg mpiProg.c

$ gcc mpiProg.c -o mpiProg -lmpi

$ icc mpiProg.c -o mpiProg -lmpi

$ mpif90 -o mpiProg mpiProg.f90

$ gfortran mpiProg.c -o mpiProg -lmpi

$ ifort mpiProg.c -o mpiProg -lmpi

1/31/2014

56

Execution Step

If you execute on several nodes o Specify the number of nodes (-n 2 or -np 2)

o Specify the name of the machine in a file (ex:machines)

o Use the option -machinefile

If you use a cluster o Use the system of job

o Specify the number of nodes

• In a script with PBS or SGE

• In the command line with SLURM

• …


$ mpirun –np 2 –machinefile machines ./a.out

$ mpiexec –n 2 –machinefile machines ./a.out

Using MPI

The file of your application :

Introduce MPI in the application

Compile the application

Run the application


$ myProg.c

$ mpicc myProg.c

$ mpirun –np 2 –machinefile machines ./a.out

1/31/2014

57

Point-to-point communication

Communication between 2 processes

Source process sends message to destination process

Communication takes place within a communicator

o e.g.: MPI_COMM_WORLD

Processes are identified by their rank in the communicator


0

4 3

5

1

2 Message

Messages

Blocks of data are exchanged by sub-programs

Contain one or more elements sharing a same datatype

MPI datatypes:

o Basic datatypes

o Derived datatypes (built from basic & after derived datatypes)

Datatype handles are used to describe the data in memory


1/31/2014

58

Sending a message

C: o int MPI_Send(void *buff, int count, MPI_Datatype datatype, int dest,

int tag, MPI_Comm comm)

Fortran: o Subroutine MPI_Send(buff, count, datatype, dest, tag, comm, ierror)

buff is the starting point of the message with count elements of type datatype.

Dest is the rank of the destination in the communicator comm.

Tag is an integer added to the message, allowing the identification of the message.


Receiving a message

C:

o int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source,

int tag, MPI_Comm comm, MPI_Status *status);

Fortran:

o Subroutine MPI_Recv(buf, count, datatype, source, tag, comm,

status, ierror)

Buff, count and datatype describe the receive buffer.

Receive a message sent by process with rank source in

comm and with the same tag.

Additional info is returned in status.


1/31/2014

59

MPI Basic C datatypes

MPI Datatype C Datatype

MPI_CHAR signed char

MPI_SHORT signed short int

MPI_INT signed int

MPI_LONG signed long int

MPI_UNSIGNED_CHAR unsigned char

MPI_UNSIGNED_SHORT unsigned short int

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_LONG unsigned long int

MPI_FLOAT float

MPI_DOUBLE double


MPI Basic Fortran datatypes


MPI Datatypes Fortran Datatypes

MPI_INTEGER integer

MPI_REAL real

MPI_DOUBLE_PRECISION double precision

MPI_COMPLEX complex

MPI_LOGICAL logical

MPI_CHARACTER character

MPI_BYTE

MPI_PACKED

1/31/2014

60

Requirements

For a communication to be successful:

o Sender must specify a valid destination rank

o Receiver must specify a valid source rank

o Same communicator on both sides

o Matching tags (choosen by user)

o Matching datatypes

o A large enough buffer on the receiver’s side


Wildcarding

The receiver can wildcard.

To receive from any source

o source = MPI_ANY_SOURCE

To receive from any tag

o tag = MPI_ANY_TAG

Source & tag are returned within the receiver's status

parameter.


1/31/2014

61

A Simple Example (1)

#include <stdio.h>

#include <mpi.h>

#define MASTER 0 //We assume 2 MPI processes

#define SLAVE 1

int main(int argc, char **argv)

{

int val=0; //Value to be exchanged

int mpi_rank, mpi_size;

MPI_Status status;

MPI_Init(&argc,&argv);

MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);

MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);

...


A Simple Example (2)

...

if (mpi_size!=2) return -1;

if ( mpi_rank == MASTER ) /* The master sends a value */

{

val = 1492;

printf("I'm process %d and Value is %d\n", mpi_rank, val);

MPI_Send(&val, 1, MPI_INT, SLAVE, 777, MPI_COMM_WORLD);

}

else /* The slave prints the value */

{


MPI_Recv(&val, 1, MPI_INT, MASTER, 777, MPI_COMM_WORLD, &status);


}

MPI_Finalize();

}


1/31/2014

62

The Sendrecv

C: o int MPI_Sendrecv( void *sendbuf, int sendcount, MPI_Datatype sendtype, int

dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status *status )

Fortran : o Subroutine MPI_Sendrecv(sendbuf, sendcount, sendtype, dest, sendtag, recvbuf,

recvcount, recvtype, source, recvtag, comm, status, err)

sendbuff is the starting point of the message with sendcount elements of type sendtype.

recvbuff is the starting point of the message with recvcount elements of type recvtype.

dest and source are the ranks of the destination and the sender in the communicator comm.

Sendtag and recvtag are integers added to the message, allowing the identification of the message.


Asynchronous Sending

C: o int MPI_Isend(void *buff, int count, MPI_Datatype datatype, int dest, int

tag, MPI_Comm comm, MPI_Request *request)

Fortran: o sSbroutine MPI_Isend(buff, count, datatype, dest, tag, comm, request,

ierror)

buff is the starting point of the message with count elements of type datatype.

Dest is the rank of the destination in the communicator comm.

Tag is an integer added to the message, allowing the identification of the message.


request can be used later to query the status of the communication or wait for its completion


1/31/2014

63

Asynchronous Receiving

C: o int MPI_Irecv(void *buff, int count, MPI_Datatype datatype, int source, int

tag, MPI_Comm comm, MPI_Status *status, MPI_Request *request);

Fortran: o Subroutine MPI_Irecv(buff, count, datatype, source, tag, comm, status,

request, ierror)

Buff, count and datatype describe the receive buffer.

Receive a message sent by process with rank source in comm and with the same tag.


request can be used later to query the status of the communication or wait for its completion


The synchronization

C: o int MPI_Wait(MPI_Request *request, MPI_Status *status)

Fortran : o Subroutine MPI_Wait(request, status, ierror)

Wait until the operation identified by request is complete


One is allowed to call MPI_Wait with a request like MPI_REQUEST_NULL argument. In this case the operation returns immediately with empty status.


1/31/2014

64

Example

CALL MPI_INIT(ierr)

…

CALL MPI_COMM_RANK(comm, rank, ierr)

IF(rank.EQ.0) THEN

CALL MPI_ISEND(a(1), 10, MPI_REAL, 1, tag, comm, request, ierr)

**** do some computation to mask latency ****

CALL MPI_WAIT(request, status, ierr)

ELSE

CALL MPI_IRECV(a(1), 15, MPI_REAL, 0, tag, comm, request, ierr)

**** do some computation to mask latency ****

CALL MPI_WAIT(request, status, ierr)

END IF

…


Broadcast (1)

MPI_Bcast broadcasts a message from the process with

rank "root" to all other processes of the group

Before a MPI_Bcast :

After a MPI_Bcast :


10

Process 1 Process 2 Process 3 Process 4

10

Process 1

10

Process 2

10

Process 3

10

Process 4

1/31/2014

65

Broadcast (2)

C

o int MPI_Bcast ( void *buff, int count, MPI_Datatype datatype, int root,

MPI_Comm comm )

Fortran :

o Subroutine MPI_BCAST(buff, count, datatype, root, comm)

Buff, count and datatype describe the receive/sender buffer

in comm

Root is the rank of the sender


Scatter (1)

MPI_Scatter sends data from one task to all other tasks in a

group

Before a MPI_Scatter :

After a MPI_Scatter :


10 11 12 13


10

Process 1

11

Process 2

12

Process 3

13

Process 4

1/31/2014

66

Scatter (2)

C o int MPI_Scatter ( void *sendbuf, int sendcnt, MPI_Datatype sendtype,

void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm )

Fortran o Subroutine MPI_SCATTER(sendbuf, sendcount, sendtype, recvbuf,

recvcount, recvtype, root, comm)

sendbuff, sendcount and sendtype describe the root sender in comm

recvbuff, recvcount and recvtype describe the receivers in comm


Gather (1)

MPI_Gather gathers together values from a group of

processes

Before a MPI_Gather :

After a MPI_Gather :


10 11 12 13


10

Process 1

11

Process 2

12

Process 3

13

Process 4

1/31/2014

67

Gather (1)

C o int MPI_Gather ( void *sendbuf, int sendcnt, MPI_Datatype sendtype,

void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm )

Fortran o Subroutine MPI_GATHER(sendbuf, sendcount, sendtype, recvbuf,

recvcount, recvtype, root, comm)

sendbuff, sendcount and sendtype describe the senders in comm

recvbuff, recvcount and recvtype describe the root receiver in comm


Reduce (1)

MPI_Reduce reduces values on all processes to a single

value. The example is using the operation MPI_SUM.

Before a MPI_Reduce :

After a MPI_Reduce :


46


10

Process 1

11

Process 2

12

Process 3

13

Process 4

1/31/2014

68

Reduce (2)

C o int MPI_Reduce ( void *sendbuf, void *recvbuf, int count,

MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)

Fortran o Subroutine MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root,

comm, err)

sendbuff, count and datatype describe the senders buffer in comm

Recvbuff and root describe the receiver in comm

op specifies the operator of reduction


Reduction Operators


Name Meaning

MPI_MAX maximum

MPI_MIN minimum

MPI_SUM sum

MPI_PROD product

MPI_LAND logical and

MPI_BAND bit-wise and

MPI_LOR logical or

MPI_BOR bit-wise or

MPI_LXOR logical xor

MPI_BXOR bit-wise xor

MPI_MAXLOC max value and location

MPI_MINLOC min value and location

1/31/2014

69

Barrier

C

o int MPI_Barrier ( MPI_Comm comm )

Fortran

o Subroutine MPI_BARRIER(comm)

comm is the communicator

Blocks the caller until all group members have called it


To go further

It exists different versions of the classical send o Basic send user-provided buffering (MPI_Bsend)

o Blocking ready send (MPI_Rsend)

o Blocking synchronous send (MPI_Ssend)

o …

It exists some completely collective communications o MPI_Allgather

o MPI_Allscatter

o MPI_Alltoall

o MPI_Allreduce

o …

You can mix OpenMP with MPI


1/31/2014

70

Multithreading


What is a Thread?

An independent stream of instructions

o That can be scheduled to run as such by the operating system

o To the software developer, the concept of a "procedure" that runs

independently from its main program may best describe a thread

Consider a main program (a.out) that contains a number of

procedures being able to be scheduled to run

simultaneously and/or independently by the operating

system

o That would describe a "multi-threaded" program


1/31/2014

71

What is a Thread?

So, in summary, in the UNIX environment a thread: o Exists within a process and uses the process resources

o Has its own independent flow of control as long as its parent process exists and the OS supports it

o Duplicates only the essential resources it needs to be independently schedulable

o May share the process resources with other threads that act equally independently (and dependently)

o Dies if the parent process dies - or something similar

o Is "lightweight" because most of the overhead has already been accomplished through the creation of its process

Because threads within the same process share resources: o Changes made by one thread to shared system resources (such as closing a file)

will be seen by all other threads

o Two pointers having the same value point to the same data

o Reading and writing to the same memory locations is possible, and therefore requires explicit synchronization by the programmer


Pthread

1/31/2014

72

What are Pthreads?

Historically, hardware vendors have implemented their own proprietary versions of threads o These implementations differed substantially from each other making

it difficult for programmers to develop portable threaded applications

In order to take full advantage of the capabilities provided by threads, a standardized programming interface was required o For UNIX systems, this interface has been specified by the IEEE

POSIX 1003.1c standard (1995)

o Implementations adhering to this standard are referred to as POSIX threads, or Pthreads

o Most hardware vendors now offer Pthreads in addition to their proprietary API’s


Why Pthreads?

When compared to the cost of creating and managing a

process, a thread can be created with much less operating

system overhead


1/31/2014

73

On-node Communications

MPI libraries usually implement on-node task communication via shared

memory, which involves at least one memory copy operation (process to

process)

o For Pthreads there is no intermediate memory copy required because

threads share the same address space within a single process.

o There is no data transfer, per se. It becomes more of a cache-to-CPU or

memory-to-CPU bandwidth (worst case) situation


Threads = Shared Memory Model

All threads have access to the same global, shared memory

o Threads also have their own private data

o Programmers are responsible for synchronizing access (protecting)

globally shared data


1/31/2014

74

Thread-safeness

Refers an application's ability to execute multiple threads simultaneously without damaging shared data or creating race conditions

o The implication to users of external library routines is that if you aren't

100% certain the routine is thread-safe, then you take your chances with problems that could arise


Pthread API

For C/C++

From Intel, PathScale, PGI, GNU, IBM

Initially, your main() program comprises a single, default

thread

o All other threads must be explicitly created by the programmer


pthread_create () pthread_exit () pthread_cancel () pthread_attr_init () pthread_attr_destroy ()

1/31/2014

75

Threads Life

Threads can spawn other threads

Threads can wait for another (actively or passively), die…


Pthread Example


#include <pthread.h> #include <stdio.h> #define NUM_THREADS 5 void *PrintHello(void *threadid) { long tid; tid = (long)threadid; printf("Hello World! It's me, thread #%ld!\n", tid); pthread_exit(NULL); }

int main (int argc, char *argv[]) { pthread_t threads[NUM_THREADS]; int rc; long t; for(t=0; t<NUM_THREADS; t++){ printf("In main: creating thread %ld\n", t); rc = pthread_create(&threads[t], NULL, PrintHello, (void *)t); if (rc){ printf("ERROR; return code is %d\n", rc); exit(-1); } } /* Last thing that main() should do */ pthread_exit(NULL); }

1/31/2014

76

Pthread Management Basis

Routines

Joining is one way to accomplish synchronization between

threads

o The pthread_join() subroutine blocks the calling thread until the

specified threadid thread terminates


pthread_join () pthread_detach () pthread_attr_setdetachstate () pthread_attr_getdetachstate ()

Intel Thread Building Block: TBB

1/31/2014

77

Intel TBB

Initiated in 2006, open source since 2007

Express Parallelism in C++

Express tasks not threads

http://threadingbuildingblocks.org/

April 13 Intel MIC Programming 153

Intel TBB : Warnings

Far from other ways to express parallelism

Complex to implement

Dedicated to C++

















1/31/2014

78

Intel TBB : C++ vs imperative langages

Object Oriented langage with:

o complex inheritance

o Classes contain data and code (functions)

Template functions: function change with data types

STL widely used.


Intel TBB : Express tasks not threads

Intel TBB have its own scheduler

o Automatically invoked at the beginning of a program

o Can be accessed using task_scheduler_init class

Work stealing scheduler, can cohabit with Cilk Plus

scheduler.


1/31/2014

79

Intel TBB: overview

Link with tbb library (-ltbb or –ltbb_debug)


void SerialApplyFoo( float a[], size_t n )

{

for( size_t i=0; i!=n; ++i ) Foo(a[i]);

}

#include <tbb/tbb.h>

Using namespace tbb;

void SerialApplyFoo( float a[], size_t n )

{

for( size_t i=0; i!=n; ++i ) Foo(a[i]);

}

Intel TBB: parallel For (1)

Parallel for




class ApplyFoo {

float *const my_a;

public:

void operator()( const blocked_range<size_t>& r ) const {

float *a = my_a;

for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]);

}

ApplyFoo( float a[] ) : my_a(a) {}

};

1/31/2014

80


Declare a class containing an operator ()




class ApplyFoo {

float *const my_a;

public:


float *a = my_a;


}


};


This operator takes a blocked_range<T> argument




class ApplyFoo {

float *const my_a;

public:


float *a = my_a;


}


};

1/31/2014

81


Some other iteration spaces exists (ex: blocked_range2d)




class ApplyFoo {

float *const my_a;

public:


float *a = my_a;


}


};


A copy constructor must be declared




class ApplyFoo {

float *const my_a;

public:


float *a = my_a;


}


};

1/31/2014

82


Then invokation will be as followed:

Blocked_range take 3 arguments:begin, end , grainsize

o Begin, End : iteration start and end

o Grainsize: size of a chunk to be executed by a thread.


void ParallelApplyFoo( float a[], size_t n )

{

parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a));

}

Intel TBB: Tips on grainsize

No more than total num of iteration/ nb threads

One thread must execute at least 10 000 operations to

overtake overhead of work stealing scheduler.

Small loops with good load-balancing won’t suffer a lot from

scheduler overhead.


1/31/2014

83

Intel TBB: more on partitioner

You can choose among 3 partionners as a 3rd arg.

o Simple_partitioner

o Auto_partitioner (default)

o Affinity_partitioner


void ParallelApplyFoo( float a[], size_t n )

{

parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a), simple_partitioner ());

}

Intel TBB: Other constructs

Parallel_reduce

Parallel_while


1/31/2014

84

Intel Cilk Plus

What is Intel Cilk Plus ?

Intel Cilk Plus is an extension to C and C++ bringing:

o Multi-core support.

o Vector Processing support.

It comes with:

o 3 keywords to manage thread-programming

o Vectorization intrinsics and directives

o Array Notation.


1/31/2014

85

Where can I get Intel Cilk Plus ?

Available with Intel Composer XE Compilers.

Cilk Plus open projects: http://cilkplus.org

o Cilk Plus in GCC

Cilk Plus extension in LLVM.

Specifications are open : http://cilkplus.org


Data Parallelism with Intel Cilk Plus

3 keywords: _Cilk_spawn, _Cilk_sync, _Cilk_for

To simplify, <clik/cilk.h> header defines macros:

o cilk_spawn

o cilk_sync

o cilk_for

Clik Plus don’t create threads


1/31/2014

86

Cilk_spawn, Cilk_sync (1)

Cilk_spawn: the statement following can be executed with

an other thread.

o It does not create threads but a new task is queued.

Cilk_sync: all spawned statements using cilk_spawn have to

be completed before execution continues.

o Implicit cilk_sync at the end of a function.


Cilk_spawn, cilk_sync (2)

Recursive Fibonacci function


Int fib (int n)

{

if (n < 2)

return n;

int x = cilk_spawn fib(n-1);

int y = fib (n-2);

cilk_sync;

return x+y;

}

1/31/2014

87

Cilk_for

cilk_for is same as #pragma omp for

o Replace a for statement.

o Iterations shared among threads by the runtime and the compiler.


clik_for (int i = 0; i < 8; i++)

{

do_work(i);

}

Keeping serial execution

Possible to use pre-processing to hide Clik Plus keywords


#define cilk_spawn

#define cilk_sync

Int fib (int n)

{

if (n < 2)

return n;

int x = cilk_spawn fib(n-1);

int y = fib (n-2);

cilk_sync;

return x+y;

}

1/31/2014

88

Who creates the threads ?

At the beginning of a program, Cilk Plus Runtime asks OS to

create as many threads as available cores.

Each thread manages a queue containing tasks to execute.


Array notation (1)

Array Notation: Tell the compiler to use SIMD instructions.

Exist in Fortran90

In Cilk plus: A[<lower_bound>:<length>:<stride]


// for(i = 0; i < N; i++) A[i] = c* B[i];

A[:] = c* B[:];

// for(i = 0; i < N; i++) A[i+N/2] = c* B[i];

A[N/2:N/2] = c* B[0:N/2];

1/31/2014

89

Array notation (2)

Fortran90: arrays can overlap

o It generates temporary arrays

Cilk Plus Arrays Elements must not overlap:

o Better performances / undefined behavior if arrays overlaps


A[0:10] = c* A[1:11]; // undefined behavior

A[0:10] = c* A[0:10]; // ok, seen as a reduction

A[0:10:2] = c* A[1:11:2]; // ok, elements don’t overlap

Array Notation: Dynamic Arrays (3)

For Dynamically Allocated arrays:

o start and length have to be specified


Int f (float* a, float * b)

{

A[:] = c* B[:]; // COMPILATION ERROR !

A[0:N] = c*B[0:N]; // OK

}

1/31/2014

90

Array Notation: Multi-dimentional arrays (4)

Multi-dimentional arrays: same constraints as 1D.


// for (i = 0; i < N; i ++)

// for (i = 0; j < N; j++)

// B[i][j] = 0;

B[:][:] = 0;

Array Notation: conditionals (5)

Conditionals


// for (i = 0; i < N; i ++)

// if (a[i] == 0)

// result[i] += 1;

If (a[:] == 0)

result[:] += 1;

1/31/2014

91

Array Notation: builtin functions (6)

Built-in functions

o __sec_reduce_add (A[:]) // sum = 𝐴[𝑖]𝑁𝑖=0

o __sec_reduce_mul (A[:]) // product

o __sec_reduce_max(A[:]) // max among elements of A

o __sec_reduce_min(A[:]) // min

o __sec_reduce_max_index(A[:]) // return index of max element

o __sec_reduce_min_index(A[:]) // return index of min element

o __sec_reduce_all_zero (A[:]) // true if all zero

o __sec_reduce_any_zero (A[:]) // true if a zero exist

o __sec_reduce_all_nonzero (A[:]) // true if all elments are not zero

o __sec_reduce_any_nonzero (A[:]) // true if there is non-zero elements


Array Notation: elemental functions (6)

Function call can be declared with __declspec(vector) or

o 2 functions generated by compiler: 1 SIMD one scalar.


__declspec (vector) void calc (float* B)

{

B[0] = B[0] + rand ();

}

calc (B[:]); // SIMD version is called

Calc ( B[5]); // sequential version is called

1/31/2014

92

Vectorization directive

#pragma SIMD in front of loops can be used instead of Array

sections

Clauses exist: linear, reduction, private etc…


#pragma simd

For (int i = 0; i < N; i++)

{

B[i] = B[i] + i;

}

Cilk Plus with MIC

Can be used alone, or combined with other programming

model : offload directives, openMP etc..

Straighforward vectorization with Cilk Plus on MIC.

_Cilk_shared : share memory between host and MIC.

o Only apply on static allocated memories.


1/31/2014

93

OpenMP

The master thread

The master thread

executes all the

sequential region and

create some slave

threads

Thread ID = 0


1/31/2014

94

What’s OpenMP

A standard API for shared memory parallel applications

In C/C++ or Fortran

Consists in compiler directives, runtime routines and

environment variables

Works on the fork and join model

An OpenMP program is portable

Requires less programming effort than Pthread


History

OpenMP 1.0 for Fortran (1997) & C/C++ (1998)

OpenMP 2.0 for Fortran (2000) & C/C++ (2002)

o Major revision

2005 : OpenMP 2.5

o Extensive rewrite and clarification

2008 : OpenMP 3.0

o Task

o Better support for nested parallelism

2013: OpenMP 4.0

o Accelerator directives


1/31/2014

95

The parallel directives

A parallel region is a block of code that will be executed by

multiple threads

How to declare a parallel region

To put a synchronization barrier use


#pragma omp parallel

{

…

…

}

!$OMP PARALLEL

…

…

!$OMP END PARALLEL

#pragma omp barrier !$omp barrier

Example

Code :

Execution :



{

printf("hello world !\n");

}

$ ./a.out

Hello world !

Hello world !

Hello world !

Hello world !

1/31/2014

96

The parallel loops

It exists a useful directive to parallelize the loops

All the threads will execute independently iterations of the loop

In C

In Fortran

We can fuse the directive parallel and the directive for/do


#pragma omp for

For (i=0; i<N; i++){

…

}

!$OMP DO

DO i=0,N

…

ENDDO

!$OMP END DO

Example C/C++


#include <omp.h>

#define N 10000

int main(){

int tab[N];

init(tab);

#pragma omp parallel for

for (i=0; i<N; i++)

{

tab[i] = foo(i, tab[i]);

}

}

1/31/2014

97

Example Fortran


PROGRAM main

USE omp_lib

IMPLICIT NONE

INTEGER N 10000

INTEGER, DIMENSION ( N) :: tab

CALL init(tab);

!$omp parallel do

DO i=1, N

tab(i) = foo(i, tab(i))

ENDDO

!$omp end parallel do

END PROGRAM main

Get some useful informations

To get the ID of the current thread

To get the current number of thread in the region

To set the number of thread in the next parallel region


int my_id = omp_get_thread_num()

int nbThreads = omp_get_num_threads()

omp_set_num_threads(int n)

1/31/2014

98

Example

Code :

Execution :



{

int id = omp_get_thread_num();

int size = omp_get_num_threads();

printf("hello world ! From %i of %i\n", id, size);

}

$ ./a.out

Hello world ! From 2 of 4




The data sharing clauses

Data can be shared by threads or private to each threads o Declare a list of private variables to each threads

o Declare a list of shared variables to all threads (default behaviour)

o Declare the default behavior for each variable

o None will return an error at compilation for each variables not explicitly precised

These clauses can be added to the directives o Parallel

o For/Do

o Single


#pragma omp directive private( variable [, variable])

#pragma omp directive shared( variable [, variable])

#pragma omp directive default(none | shared | private)

1/31/2014

99

Example using shared & private

#include <omp.h>

#define N 1000

main ()

{

int i;

float a[N], b[N], c[N];

/* Some initializations */

for (i=0; i < N; i++)

a[i] = b[i] = i * 1.0;

#pragma omp parallel for shared(a,b,c) private(i)

for (i=0; i < n; i++)

c[i] = a[i] + b[i];

}


Critical part

The compilers and OpenMP

On Linux

o Gcc inputfile -fopenmp

o Icc inputfile -openmp

o Gfortran inputfile -fopenmp

o Ifort inputfile -openmp

On Windows

o Icc inputfile /Qopenmp

o Ifort inputfile /Qopenmp


1/31/2014

100

The single directive (1)

Only one thread executes this region of code

Second case : Any of the threads can execute this part, but

only one


Code to execute just once

Waiting time

The single directive (2)

A region delimited by these directives will be executed by only one of all the threads (master or slaves)

Only one thread executes this region

Synchronization at the end



{

…

#pragma omp single

{

…

}

…

}

//end of parallel region

!$OMP PARALLEL

! Parallel region

…

!$OMP SINGLE

!only one thread in

…

!$OMP END SINGLE

…

!$OMP END PARALLEL

1/31/2014

101

The master directive (1)

Only the master thread executes a region of code


Code to execute just once

The master directive (2)

A region delimited by these directives will be executed by the

master

Only the master thread executes this region



{

…

#pragma omp master

{

…

}

…

}

//end of parallel region

!$OMP PARALLEL

! Parallel region

…

!$OMP MASTER

!only the master in

…

!$OMP END MASTER

…

!$OMP END PARALLEL

1/31/2014

102

The Critical Region (1)


Start of critical region

End of critical region

Critical part

Waiting time

The Critical Region (2)

All threads execute the code, but only one at a time

A lightweight special form exists but apply only on the

following statement


#pragma omp critical

{

…

}

!$OMP CRITICAL

…

!$OMP END CRITICAL

#pragma omp atomic

…

!$OMP ATOMIC

…

1/31/2014

103

Example : critical & atomic

Is the same result in X at the end ?


PROGRAM CRITICAL

INTEGER X

X = 0

…

!$OMP PARALLEL SHARED(X)

…

!$OMP CRITICAL

X = X + 1

X = X * 2

!$OMP END CRITICAL

…

!$OMP END PARALLEL

…

END PROGRAM CRITICAL

PROGRAM ATOMIC

INTEGER X

X = 0

…

!$OMP PARALLEL SHARED(X)

…

!$OMP ATOMIC

X = X + 1

!$OMP ATOMIC

X = X * 2

…

!$OMP END PARALLEL

…

END PROGRAM ATOMIC

The performance clauses

To be sure to have performance, you can use this clause, so

at the runtime, you can adapt the behavior of your

application

At the beginning of a parallel region, you can also set the

number of threads


#pragma omp parallel if(cond)

{

…

}

#pragma omp parallel num_threads(n)

{

…

}

1/31/2014

104

The nowait clause

There are some implicit synchronization barriers at the end of the blocks marked by these directives o Do/For

o Single

To avoid the synchronization, put this clause

o The threads will directly continue after the end of the region, but beware of data dependencies


#pragma omp for nowait

for(i=0; i<N; i++){

…

}

Environment variables

Number of threads

Limit of thread in the system

o Default value is 1024

The size of the stack for each threads

o Default value is 4 MB 32 bit-system and 8 MB 64 bit-system


$ export OMP_NUM_THREADS=[0-9*]

$ export OMP_THREAD_LIMIT=n

$ export OMP_STACKSIZE=size

1/31/2014

105

Informations

To set the number of thread in the next level of parallelism

Get the maximum number of processors

Get time reference

o Get the current time in second

o omp_get_wtick


omp_set_num_threads(n)

int nb_procs = omp_get_num_procs()

Double t = omp_get_wtime()

Double t = omp_get_wtick()

Reduction (1)

Reduction applies to the following directives:

o for

o parallel

o sections

Declare a reduction:

o In C/C++ : reduction (operator: list)

o In fortran : reduction (operation|intrinsic : list)

The REDUCTION clause performs a reduction on the

variables that appear in the given list


1/31/2014

106

Reduction (2)

Restrictions on the variables in the list:

o must be named scalar variables

o must be declared SHARED in the enclosing context

Principle:

o For each thread a private copy of each list variable is created

o At the end of the reduction, the private copies are combined into one

scalar

o The final result is written to the global shared variable.


Example


#include <omp.h>

#define N 10000

int main(){

int tab[N];

int sum = 0;

init(tab);

#pragma omp parallel for reduction(+:sum)

for (i=0; i<N; i++)

{

sum += foo(i, tab[i]);

}

}

1/31/2014

107

Reduction Operator : C/C++


Intrinsic/Operation Operation

+ Sum

- Subtraction

* Product

/ Division

^ Power

&& Logical AND

|| Logical OR

Minimum

Maximum

& Bitwise AND

| Bitwise OR

Bitwise Exclusive OR

Reduction Operator : Fortran


Intrinsic/Operation Operation

+ Sum

- Subtraction

* Product

.EQV. Equality

.NEQV. Non-equality

.AND. Logical AND

.OR. Logical OR

MIN Minimum

MAX Maximum

IAND Bitwise AND

IOR Bitwise OR

IEOR Bitwise Exclusive OR

1/31/2014

108

Sections

The sections directive specifies that the code in the enclosed section blocks are to be divided among the threads in the teams

Each section is executed once

May be declared into a parallel region

Declare parallel sections o omp sections[clause[[,],clause…]

omp section code block

omp section code block

…

omp end sections[nowait]


Sections : example


{

#pragma omp sections nowait

{

#pragma omp section

{

printf("Thread %d doing section 1\n",tid);

workToDoInSection1();

}

#pragma omp section

{

printf("Thread %d doing section 2\n",tid);

workToDoInSection2();

}

}

}


1/31/2014

109

Task

Tasks are the most important change in OpenMP 3.0

Each thread encountering a task construct it and add it in a tasks pool

The task may be executed by the encountering thread, or deferred for execution by any other thread in the team

Declare a task: o omp task [clause [,clause] ]

Task synchronization: o omp taskwait


What can be done with task ?

Allows to parallelize irregular problems o Unbounded loops

o Recursive algorithms

o Producer/Consumers

o …

Tasks are work units which may be deferred

Tasks executed by threads in the team

Each task encountered by a thread is created

Tasks can be nested


1/31/2014

110

Task : example


{

#pragma omp single nowait

{

iter = 25;

while( iter > 0)

{

#pragma omp task

myfunction(iter);

iter = iter -1;

}

}

#pragma omp taskwait

}


Critical part


1/31/2014

111

Why Use Accelerators?

For application containing hotspots

o Parts of code where the majority of the execution time is spent

If hotspots are highly parallel:

o The application can be accelerated

o Otherwise, the algorithm may have to be changed

Accelerators can also be viewed as low-power compute

nodes

o Make the application power efficient

Intel MIC Programming 221

Hardware Accelerator

In HPC, two kinds of accelerators are extremely popular

o GPUs, such as the K20 in Titan (2nd fastest supercomputer)

• Nvidia Tesla

• AMD Firepro

o Intel Xeon Phi, for example in Tianhe 2 (fastest supercomputer)

• Based on x86 technology

1/31/2014

112

GPUs


Today’s Hybrid/Heterogeneous Compute Node

Streaming engines (e.g. GPU) o Application specific

architectures (“narrow band”)

o Vector/SIMD

o Can be extremely fast

General purpose cores o Share a main memory

o Core ISA provides fast SIMD instructions

o Large cache memories


1/31/2014

113

Stream computing

Stream programming is well suited to GPU

o But memory hierarchy is exposed

A similar computation is performed on a collection of data (stream)

o There is no data dependence between the computation on different stream elements


What is GPGPU ?

General Purpose computation using GPU in applications other than 3D graphics

o GPU accelerates critical path of application

Data parallel algorithms leverage GPU attributes

o Large data arrays, streaming throughput

o Fine-grain SIMD parallelism

o Low-latency floating point (FP) computation

Applications – see //GPGPU.org

o Game effects (FX) physics, image processing

o Physical modeling, computational engineering, matrix algebra, convolution, correlation, wave propagation


1/31/2014

114

Previous GPGPU Constraints

Dealing with graphics API

o Working with the corner cases of the graphics API

Addressing modes

o Limited texture size/dimension

Shader capabilities

o Limited outputs

Instruction sets

o Lack of Integer & bit ops

Communication limited

o Between pixels

o Scatter a[i] = p


Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

per thread

per Shader

per Context

FB Memory

CUDA

1/31/2014

115

“Compute Unified Device Architecture”

General purpose programming model o User kicks off batches of threads on the GPU

o GPU = dedicated super-threaded, massively data parallel co-processor

Targeted software stack o Compute oriented drivers, language, and tools

Driver for loading computation programs into GPU o Standalone Driver - Optimized for computation

o Interface designed for compute - graphics free API

o Data sharing with OpenGL buffer objects

o Guaranteed maximum download & readback speeds

o Explicit GPU memory management


An Example of Physical Reality Behind CUDA


Northbridge

Southbridge

µProc

ATA

SATA

PCI

HDA

1394

USB

SCSI

PCI-e

DDR

DDR

SCSI

GPU

Device memory

Host Memory

(shared memory)

1/31/2014

116

Parallel Computing on a GPU with Nvidia

NVIDIA GPU Computing Architecture o Via a separate HW interface

o In laptops, desktops, workstations, servers

The next GK110 GPUs (K20) will deliver up to 4.5 TFLOPS (SP) on compiled parallel C applications o 1 TFLOPS DP

Programmable in C with CUDA tools o Programming model scales transparently

o Multithreaded SPMD model uses application data parallelism and thread parallelism


Tesla K20

GeForce GTX 680

Tesla C2075

Extended C

Declspecs

o global, device, shared, local, constant

Keywords

o threadIdx, blockIdx

Intrinsics

o __syncthreads

Runtime API

o Memory, symbol, execution management

Function launch


__device__ float filter[N];

__global__ void convolve (float *image) {

__shared__ float region[M];

...

region[threadIdx] = image[i];

__syncthreads()

...

image[j] = result;

}

// Allocate GPU memory

void *myimage = cudaMalloc(bytes)

// 100 blocks, 10 threads per block

convolve<<<100, 10>>> (myimage);

1/31/2014

117

Compilation Path


C/C++ CUDA Application

NVCC CPU Code

PTX Code Virtual

…

PTX to Target

Compiler

Physical

GF100 GT200

Hardware Overview

Device contains o Multiprocessors o Host access interface o Memory o 4 generation groups:

o 1.0, 1.1 (8800, 9800) o 1.2, 1.3 (GTX220, c1060) o 2.0 (c2050) o 3.0 (K10), 3.5 (K20)

Multiprocessors contains o ALUs o Registers o Shared Memory o Access to Local Memory o Access to Global Memory


DEVICE

Constant

Memory

Texture

Memory

Global

Memory

Multiprocessor 1

Shared Memory

Local

Memory

ALU

Registers

Local

Memory

ALU

Registers

Multiprocessor N

Shared Memory

Local

Memory

ALU

Registers

Local

Memory

ALU

Registers

Host

1/31/2014

118

Nvidia Architecture Overview

GT200 / GF100 / GK110


SMX

Memory Sizes accross GPU Generations

Specifications 1.0-1.1 1.2-1.3 2.0 3.x

Multiprocessors Up to 16 Up to 30 Up to 16 Up to 14

ALU(SP)/Multipro. 8 8 32 192

32 bits Registers/Multipro.

8 k 16 k 32 k 64k

Shared Mem/Multipro. 16 kB 16 kB 16 / 48 kB 16 / 48 kB

Constant Memory 64 kB 64 kB 64 kB 64 kB

Local Memory

Global Memory Up to 4 GB Up to 4 GB Up to 12 GB Up to 12 GB

Cache on Global Mem No/- No/- Yes/L1-L2 Yes/L1-L2

Size of L2 Cache - - 768 kB Up to 1536 kB

Size of L1 Cache/multipro

- - 16 / 48 kB 16 / 48 kB


1/31/2014

119

Thread Batching:

Grids and Blocks

A kernel is executed as a grid of thread blocks o All threads share data

memory space

A thread block is a batch of threads that can cooperate with each other by: o Synchronizing their execution

• For hazard-free shared memory accesses

o Efficiently sharing data through a low latency shared memory

Two threads from two different blocks cannot cooperate o Atomic operations added in HW

1.1


Host

Kernel

1

Kernel

2

Device

Grid 1

Block

(0, 0)

Block

(1, 0)

Block

(2, 0)

Block

(0, 1)

Block

(1, 1)

Block

(2, 1)

Grid 2

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)

Block and Thread IDs

Threads and blocks have IDs o So each thread can

decide what data to work on

o Block ID: 1D, 2D or 3D o Thread ID: 1D, 2D, or 3D

Simplifies memory addressing when processing multidimensional data o Image processing o Solving PDEs on volumes o …


Device

Grid 1

Block

(0, 0)

Block

(1, 0)

Block

(2, 0)

Block

(0, 1)

Block

(1, 1)

Block

(2, 1)

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)

1/31/2014

120

Block and Thread keywords

Block keywords threadIdx.{x,y,z} defines

the thread index inside the block

blockDim.{x,y,z} defines the block dimensions

Grid keywords

blockIdx.{x,y,z} defines the block index inside the grid

gridDim.{x,y,z} defines the grid dimension


Block (1, 1)‏

Thread

‏(0,1,0)

Thread

‏(1,1,0)

Thread

‏(2,1,0)

Thread

‏(3,1,0)

Thread

‏(4,1,0)

Thread

‏(0,2,0)

Thread

‏(1,2,0)

Thread

‏(2,2,0)

Thread

‏(3,2,0)

Thread

‏(4,2,0)

Thread

‏(0,0,0)

Thread

‏(1,0,0)

Thread

‏(2,0,0)

Thread

‏(3,0,0)

Thread

‏(4,0,0)

Thread

‏(0,1,1)

Thread

‏(1,1,1)

Thread

‏(2,1,1)

Thread

‏(3,1,1)

Thread

‏(4,1,1)

Thread

‏(0,2,1)

Thread

‏(1,2,1)

Thread

‏(2,2,1)

Thread

‏(3,2,1)

Thread

‏(4,2,1)

Thread

‏(0,0,1)

Thread

‏(1,0,1)

Thread

‏(2,0,1)

Thread

‏(3,0,1)

Thread

‏(4,0,1)

blockDim.x

Grid 1

Block

‏(0 ,0)Block

‏(0 ,1)Block

‏(0 ,2)Block

‏(0 ,3)

Block

‏(1 ,3)

Block

‏(1 ,0)Block

‏(1 ,1)Block

‏(1 ,2)

gridDim.x

gri

dD

im.y

Memory Space Overview

Each thread can: o R/W per-thread registers o R/W per-thread shared

memory o R/W per-block local

memory o R/W per-grid global

memory o Read only per-grid

constant memory o Read only per-grid texture

memory

The host can: o R/W global memory o R/W constant memory o R/W texture memory


(Device) Grid

Constant

Memory

Texture

Memory

Global

Memory

Block (0, 0)

Shared Memory

Local

Memory

Thread (0, 0)

Registers

Local

Memory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Local

Memory

Thread (0, 0)

Registers

Local

Memory

Thread (1, 0)

Registers

Host

1/31/2014

121

Memory Allocation

• cudaMalloc()

o Allocates object in the device Global Memory

o Requires two parameters

• Address of a pointer to the allocated object

• Size of of allocated object

• cudaFree()

o Frees object from device Global Memory

• Pointer to freed object


(Device) Grid

Constant

Memory

Texture

Memory

Global

Memory

Block (0, 0)

Shared Memory

Local

Memory

Thread (0, 0)

Registers

Local

Memory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Local

Memory

Thread (0, 0)

Registers

Local

Memory

Thread (1, 0)

Registers

Host

CUDA Host-Device Data Transfer

• cudaMemcpy() o memory data transfer o Requires 4 parameters

• Pointer to source • Pointer to destination • Number of bytes copied • Type of transfer

– Host to Host – Host to Device – Device to Host – Device to Device

• Asynchronous variant supported on 1.1+HW


(Device) Grid

Constant

Memory

Texture

Memory

Global

Memory

Block (0, 0)

Shared Memory

Local

Memory

Thread (0, 0)

Registers

Local

Memory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Local

Memory

Thread (0, 0)

Registers

Local

Memory

Thread (1, 0)

Registers

Host

1/31/2014

122

CUDA Function Declarations

__global__ defines a kernel function

o Must return void


Executed on the: Only callable from the:

__device__ float DeviceFunc() device device

__global__ void KernelFunc() device host

__host__ float HostFunc() host host

CUDA Functions Declaration

Address of __device__ functions cannot be taken

For functions executed on the device:

o No recursion (HW < 2.0)

o Recursion possible for __device__ function (HW >= 2.0)

o No static variable declarations inside the function

o No variable number of arguments


1/31/2014

123

Calling a Kernel Function

Thread Creation

A kernel function must be called with an execution configuration:

o Any call to a kernel function is asynchronous, explicit synchronization needed for blocking


__global__ void KernelFunc(...);

dim3 DimGrid(100, 50); // 5000 thread blocks

dim3 DimBlock(8, 8, 4); // 256 threads per block

…

KernelFunc<<< DimGrid, DimBlock >>>(...);

…

cudaThreadSynchronize();

Compilation

Any source file containing CUDA language extensions must be compiled with nvcc

nvcc is a compiler driver o Works by invoking all the necessary tools and compilers like

cudacc, g++, cl, ...

nvcc can output: o Either C code

• That must then be compiled with the rest of the application using another tool

o Or object code directly


1/31/2014

124

Linking

Any executable with CUDA code requires one dynamic library: o The CUDA runtime library (cudart)

With gcc, you may need to link with the sandard C++ library o libstdc++


Debugging : CudaGDB

On Linux or Mac OS X

Compile your application with nvcc and –g and –G options

Execute the debugger with : cuda-gdb

Possible to : o Get device information, gridDim and blockDim

o Break on the host and in the kernel

o Switch between CUDA Threads and host thread

Can be integrated to : o Emacs GUI

o DDD

Another available debugger o Allinea DDT


1/31/2014

125

GPU Debugging Pitfalls

But not all illegal program behavior can be caught

Conditions to Debug application on the local machine o Linux

• If single GPU, no Graphical Server running on the system

• 2 GPUs on the machine, 1 running the Graphical Server and the second running the CUDA program

o Windows

• Only possible if there is two GPU

• 1 for the visualization

• 1 to debug the CUDA application

On a remote machine, no problem


Profiler

31/01/2014 www.caps-entreprise.com 250

1/31/2014

126

Parallel NSight

Available on Windows and Linux o Integrated to Microsoft

Visual Studio

o Integrated to Eclipse IDE

Debugging CUDA application o Using Microsoft Visual

Studio windows : Memory, Locals, Watches and Breakpoints

Analyzing the performance of your GPGPU applications o CUDA

o OpenCL

o DirectCompute


Warps

Each block of threads is split into warps

Each warp contains the same number of threads: 32

Each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread #0

Each warp is executed by a multiprocessor in a SIMD fashion


1/31/2014

127

Warps (2)

Divergent branches within a warp cause serialization

o If all threads in a warp take the same branch, no extra cost

o If each thread take one or two different branches, entire warp pays cost of both branches of code

o If threads take n different branches, entire warp pays cost of n branches of code


Coalescing 1.0/1.1

A coordinated read by a half-warp (16 threads)

A contiguous region of global memory o 64 bytes: each thread reads a word (int, float, ...)

o 128 bytes: each thread reads a double-word (int2, float2, ...)

o 256 bytes: each thread reads a quad-word (int4, float4,...)

Additional restrictions on G8x/G9x architecture: o Starting address for a region must be a multiple of region size

o The kth thread in a half-warp must access the kth element in a block being read

Exception: not all threads must be participating o Predicated access, divergence within a half-warp


1/31/2014

128

Coalesced Access 1.0/1.1:

Reading floats


Uncoalesced Access 1.0/1.1:

Reading floats


1/31/2014

129

Coalesced Access 2.x

Cache on global memory may hide coalescing issues

2 level of cache

o 16-48 KB of L1 per SM

o 768 KB of L2 global for all SM

Memory Latency

o Global: 400-800 cycles

o L2 Cache: 100-200 cycles

o L1 Cache: about 4 cycles (without bank conflict)


Shared Memory

Hundreds of times faster than global memory

o About same latency as registers

o 32 banks can be accessed simultaneously with 2.x compute capability

o Successive 32 bits words are assigned to successive banks

Threads of a same block can cooperate via shared memory

o Up to 48 KBytes with 2.x compute capability by multiprocessor

Can be used to avoid non-coalesced accesses


1/31/2014

130

Shared Memory:

Performance Issues

The fast case:

o If all threads of a half-warp (or warp with cc 2.x) access different banks, there is no bank conflict

o If all threads of a half-warp (or warp with cc 2.x) read the identical address, there is no bank conflict (broadcast)

The slow case:

o Bank conflict: multiple threads in the same half-warp (or warp with cc 2.x)

o Must serialize the accesses

o Cost = max # of simultaneous accesses to a single bank


Shared Memory Access


Pattern accesses with no

bank conflicts:

each thread of the half-warp

accesses a different bank

1/31/2014

131

Shared Memory Access (2)


Each thread reads one

address from the same

bank: no conflict (broadcast)

Threads accessing the

same bank to different

value:

conflict!

Optimizing Threads per Block

Choose threads per block as a multiple warp size o Avoid wasting computation on under populated warps

More threads per block == better memory lattency hiding o Kernel invocations can fail if too many registers are used

Heuristics o Minimum Required by the HW: 64 threads per block

• Only if multiple concurrent blocks

o 192 or 256 threads a better choice

• Usually still enough regs to compile and invoke successfully

o This all depends on your computation, so experiment!


1/31/2014

132

Grid/Block Size Heuristics

# of blocks > # of multiprocessors o So all multiprocessors have at least one block to execute

# of blocks / # of multiprocessors > 2 o Multiple blocks can run concurrently in a multiprocessor

o Blocks that aren’t waiting at a _synchthreads() keep the hardware busy

o Subject to resource availbility - registers, shared memory

# of blocks > 100 to scale to future device o Blocks executed in pipeline fashion

o 1000 blocks per grid will scale accross multiple generations


Asynchronicity & Overlapping

Default CUDA API

o Kernel launches are asynchronous with CPU

o Memcopies block CPU thread (H2D=HostToDevice,

D2H=DeviceToHost)

o CUDA calls are sequential on GPU, serialized by the driver

But CUDA also offers asynchronicity and overlapping

o Asynchronous memcopies (D2H, H2D) with CPU

o Ability to concurrently execute a kernel and a memcopy


1/31/2014

133

Page-locked Memory, Principles (1)

Operating systems handle memory with a mechanism called

paged virtual memory:

o Divides the virtual address space of an application into memory pages

(default on x86 is 4 KiBytes)

o Allows applications to use more memory than the physical RAM

available on the system, by swapping pages to a disk

o Physical address of the page can change, this is tranparent to the

application, as the virtual address does not change

Pages can be locked by the OS into physical memory to prevent

swapping and to guarantee a permanent physical address


Page-locked Memory, Principles (2)

A PCI-express device can only directly access physical addresses, never an application's virtual address space o So only page-locked memory can be directly exploited by the

hardware

CUDA allows the application to request page-locked memory from the CUDA kernel driver

Both the application and the device can use directly such memory o No need for time-consuming intermediate copies between the

application virtual address space and the device's on-board memory


1/31/2014

134

CUDA page-locked memory

All CUDA version allows application to request page-locked

memory, often called pinned memory

o No other applications, not even the OS, can use the locked pages.

Do not use too much page-locked memory!

All CUDA memory copy functions take advantage of pinned

memory

Pinned memory is a prerequisite for asynchronous memory

copies


Different way to use Page-Locked Memory

Allocation directly in Page-Locked Memory


//Allocate the data in physical RAM

cudaMallocHost((void**) &hostPtr, size);

…

cudaFreeHost(hostPtr);

//Do not forget it or the data will stay alive in your Main memory

1/31/2014

135

Asynchronicity (1)

Synchronous execution example:

o The application waits for the GPU to complete the requested task.


Asynchronicity (2)

The asynchronous version:

o Control is returned to the application before the device has completed the requested task.


1/31/2014

136

Asynchronicity (3)

Advantages

o Enables full exploitation of the hardware available on the machine

(CPU + GPU together)

o Kernel launches are already asynchronous, no need to modify your

code

Drawback

o Needs explicit synchronization for data coherency

o Transfers require extra work to setup asynchronicity

But speed benefit already makes the extra work useful


Overlapping

Concurrent execution of GPU kernel and transfers from/to GPU o Makes use of asynchronicity

o Particularly handy when data make frequent,

o expensive round-trips between CPU and GPU

Typical cases o Several independent problems

o Several instances of a problem

o Single problem splitted into a set of sub-problems

Requires to use streams in your CUDA code


1/31/2014

137

Basics of CUDA Streams (1)

You said “stream”?

o A sequence of operations that execute in order on GPU

Streams have the following properties:

Streams use asynchronicity

o Concurrent execution between CPU and GPU

Streams enable overlapping

o Concurrent execution of a kernel on the GPU and

o transfers from or to the GPU


Basics of CUDA Streams (2)

How it works:

Operations from different streams can be interleaved

A kernel and a memcopy from different streams can be overlapped


1/31/2014

138

Code Example

// data allocation

float * hostPtr;

cudaMallocHost((void**) &hostPtr, 2 * size);

// streams declaration

cudaStream_t stream[2];

for(int i = 0; i < 2; ++i)

cudaStreamCreate(&stream[i]);

// streamed copy from host to device

for(int i = 0; i < 2; ++i)

cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size, size,

cudaMemcpyHostToDevice, stream[i]);

// streamed execution of the kernel

for(int i = 0; i < 2; ++i)

myKernel<<<100, 512, 0, stream[i]>>>(outputDevPtr + i * size, inputDevPtr + i * size, size);

// streamed copy from device to host

for(int i = 0; i < 2; ++i)

cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size, size,

cudaMemcpyDeviceToHost, stream[i]);

// threads synchronization

cudaThreadSynchronize();


Using multiple CUDA Accelerators with MPI

#CUDA accelerators > #cores o Multiple MPI processes per core

(beware of CPU overload)

#CUDA accelerators == #cores o The ideal case: generally one

MPI process per core and GPU

o CPU may be idle while GPU is working

#CUDA accelerators < #cores o Share the GPUs?

o Lock the GPUs?

o Load Balancing CPU & GPU?


CPU

CPU

GPU

GPU

CPU GPU

GPU

CPU

CPU

GPU

GPU

CPU

CPU

1/31/2014

139

Resident Data

Think differently : instead of

Use resident data mechanism


Transfer

CPU->GPU

Compute

Kernel A

Transfer

GPU->CPU

Compute

Kernel B

CPU CPU GPU

CPU

Transfer

CPU->GPU

Transfer

CPU->GPU

Transfer

GPU->CPU

Compute

Kernel A

Transfer

GPU->CPU

Compute

Kernel B

CPU GPU CPU CPU GPU

Reducing Transfers

Use GPU-resident data as much as possible o Send once, use many times, read once

o Can tremendously boost performance

o Transfers can easily be the dominant factor in GPU usage

• Then follow Amdahl’s Law by optimizing transfers rather than kernels

Examples o Multiple steps of computations in a loop

o Multiple steps of computations in sequence

Do everything requiring the resident data on the GPU if possible o Unless the computations do not fit GPU at all


1/31/2014

140

Partial transfers

Think differently : instead of

Use partial transfer mechanism


CPU

Transfer

CPU->GPU

Transfer

GPU->CPU

CPU

Compute

Kernel A

GPU

Transfer

CPU->GPU

CPU

Transfer

GPU->CPU

CPU

Compute

Kernel B

GPU CPU

I/O

CPU

Transfer

CPU->GPU GPU->CPU

CPU

Compute

Kernel A

GPU

CPU->GPU

CPU

Transfer

GPU->CPU

CPU

Compute

Kernel B

GPU CPU

I/O

Minimizing Quantities

Again, maximize resident data, this time by keeping sub-

arrays on the GPU

o Send once, use and update many times, read once

o If some data must absolutely come from outside the GPU

o If some data must absolutely goes to outside the GPU

Network or disk I/O

Computation steps unimplementable on GPU

Warning: each data transfer as an initial overhead


1/31/2014

141

Reducing Transfers

The GPU computes faster than it performs transfers o Sometimes it is better to re-compute data than retrieving it from a remote

memory

Don’t try to factorize data to save memory, think performance o Memory saving is often a performance killer

• Allocate more memory to re-align data onto the GPU’s global memory

• Allocate more memory to avoid bank conflicts in shared memory

• Re-compute data to avoid transfers…

Avoid to compute borders on the GPU o Border cases are often performance killer due to:

• Incomplete warps

• Branch divergences

• Incomplete coalesced segments


CuComplex Header

Complex numbers : CuComplex Header

o Simple or double precision (HW >= 1.3)

o Include cuComplex.h


1/31/2014

142

CuBLAS Library

Basic Linear Algebra Subprograms

o Include cublas.h

o Link with libcublas.so (linux) or cublas.dll (windows)

o Up to BLAS3 (same arguments)

Available functions

o Dot-product : cublasXdot()

o Matrix mutlitplication : cublasXgemm()

o …

User guide : http://developer.nvidia.com/cuda-toolkit-40


CuFFT Library

Fast Fourier Transform

o Include cufft.h

o Link with libcufft.so (linux) or cufft.dll (windows)

o 1D, 2D or 3D

Datatype

o Real or Complex type

o Simple or double precision (HW 1.3)

User guide : http://developer.nvidia.com/cuda-toolkit-40


1/31/2014

143

Thrust

Templated Performance Primitives Library for CUDA

Similar to the C++ STL

Available functionnality

o Containers

o Iterators

o Sort

o Scan

o Reduction

o …


NPP Library

NVIDIA Performance Primitives library

GPU-accelerated image, video, and signal processing

functions

5x to 10x faster performance than CPU

Available functions

o Filter functions

o JPEG functions

o Geometry transforms

o Stastictics functions

o …


1/31/2014

144

OpenCL

Before OpenCL

GPGPU o Vertex / pixel shaders

o Heavily constrained and not adapted

CTM / Brook o Then Brook+

o Then CAL/IL

CUDA o Widely broadcasted

No one of these technologies is hardware agnostic o Portability is not possible


1/31/2014

145

What is Hybrid Computing with OpenCL?

OpenCL is o Open, royalty-free, standard

o For cross-platform, parallel programming of modern processors

o An Apple initiative

o Approved by Intel, Nvidia, AMD, etc.

o Specified by the Khronos group (same as OpenGL)

It intends to unify the access to heterogeneous hardware accelerators o CPUs (Intel i7, …)

o GPUs (Nvidia GTX & Tesla, AMD/ATI 58xx, …)

What’s the difference with CUDA or CAL/IL? o Portability over Nvidia, ATI, S3… platforms + CPUs


OpenCL Devices

NVIDIA

o All CUDA cards

AMD GPUs

o Radeon & Radeon HD

o FirePro, FireStream

o Mobility…

Intel & AMD CPUs

o X86 w/ >= SSE 3.x

Cell/B.E.

DSP

ARM


1/31/2014

146

Inputs/Outputs with OpenCL programming

OpenCL architecture


Application

OpenCL kernels

OpenCL framework

OpenCL C language OpenCL API

OpenCL runtime

Driver

GPU hardware

OpenCL and C for CUDA


PTX

GPU

Entry point for developers who prefer high level C

Entry point for developers who prefer low level API

Shared backend compiler and optimization technology

C for OpenCL

C for CUDA

1/31/2014

147

Compilation & Execution

Really simple

Include the OpenCL o #include <CL/cl.h>

Link with the OpenCL library

To execute


$ g++ -o myprogram myprogram.cc –L/PATH/TO/OPENCLLIB -lOpenCL

$ ./myprogram

OpenCL APIs

C language API o Binding C++ (official)

o Binding Java

o Binding Python

o …

In the remainder we will only see the C API o And lab sessions focus on the C API

The C++ API is available on the Khronos Website o http://www.khronos.org/registry/cl/

Extensions exist to o OpenGL

o Direct3D


1/31/2014

148

Platform Model

Model consists of one or more interconnected devices

Computations occur within the Processing Elements of each device


Platform Version

3 different kind of versions for an OpenCL device

The platform version

o Version of the OpenCL runtime linked with the application

The device version

o Version of the hardware

The language version

o Higher revision of the OpenCL standard that this device supports


1/31/2014

149

Execution Model

Kernels are submitted by the host application to devices throw command queues

Kernel instances, called Work-Item (WI), are identified by their point in the NDRange index space o This enables to parallelize the execution of the kernels

But still 2 programming models are supported o Data-parallel

o Task parallel

So even if we have a single programming model, we should have two different programming approaches according to the paradigm we are considering


NDRange

NDRange is a N-Dimensional index space

o N is 1, 2 or 3

o NDRange is defined by an integer array of length N specifying the

extent of the index space on each dimension


1/31/2014

150

Work-Groups & Work-Items

Work-Items are organized into Work-Groups (WG)

Each Work-group has a unique global ID in the NDRange

Each Work-item has

o A unique global ID in the NDRange

o A unique local ID in its work-group


Parallelism Grains

CPU cores can handle only a few tasks

o But more complex

• Hard control flows

• Memory cache

o They can be either CPU threads or processes

• CPU threads: OpenMP, Pthread

• CPU Processes: MPI, fork()…

GPU threads are extremely lightweight

o Very little creation overhead

o Simple and regular computations

o GPU needs 1000s of threads (w.i.) for full efficiency


1/31/2014

151

Memory Model

Four distinct memory regions o Global Memory

o Local Memory

o Constant Memory

o Private Memory

Global and Constant memories are common to all WI o May be cached depending on the hardware capabilities

Local memory is shared by all WI of a WG

Private memory is private to each WI


Memory Architecture


1/31/2014

152

Memory Transfer (1)

2 types of transfers o Blocking (“synchronous”)

o Non-Blocking (“asynchronous”)

In the function clEnqueueRead/Write, set the “blocking” attribute to: o CL_TRUE, make a blocking transfer

o CL_FALSE, make a non-blocking transfer

For a non-blocking transfer o Need to link an event to the transfer command

o The event will be used for producer-consumer relationship, and/or explicit waiting


Memory Transfer (2)

Synchronizing ensures that data have been transferred to/from the device at this point

Example

Or can be used for the dependency flow in out-of-order queues o Use clEnqueueWaitForEvents() to synchronize in-order queues


cl_int clWaitForEvents (cl_uint num_events, const cl_event *evt_list)

cl_event evt;

err = clEnqueueReadBuffer( cmd_queue, buf_on_device, CL_FALSE, 0,

size, buf_on_host, 0, NULL, &evt );

clFlush(cmd_queue);

… //some work that doesn’t change the content of buf_on_host

clWaitForEvents(1, &evt);

… //work that may change buf_on_host (after transfer)

1/31/2014

153

Queue Policy

A command queue is linked to a specific device

By default the command queue is in-order

o But you can use this option to make it out-of-order

CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE


Out-of-order Queue

With out-of-order queues

o No guarantee about the execution order

o So we need events dependence to ensure producer/consumer

relationship

When there is a dependence between commands

o Link cl_event objects to these commands

o Create list of events for this dependence

o Enqueue the command with its list of dependences

o The command will be launched only when all listed events have

terminated

Larger granularity : barrier

o Force waiting for all commands before the barrier to complete


1/31/2014

154

Synchronizing Queues and Commands

May I synchronize an event according others in…

• OO: Out-of-Order

• IO: In-Order

• ✓ available


The same OO

queue

Another OO

queue

The same IO

queue

Another IO

queue

clWaitForEvents() ✓ ✓ ✓ ✓

clEnqueueWaitForEvents() ✓ ✓ Useless ✓

clEnqueueBarrier() ✓ ✓

Intel® Xeon PhiTM

1/31/2014

155

Phi Specifications

Discrete accelerators

Connected to host by PCIe

Passively or actively cooled

Embeds 50+ 64-bit x86 CPU cores

Has its own DRAM

Runs its own Linux OS


Host CPU

PCIe

Phi Products

3120P/3120A 5110P/5120D 7110P/7120X

Max. # of Cores 57 60 61

Frequency 1.1 1.053 1.238

Cache (MB) 28.5 30 30.5

Memory Capacity (GB) 6 8 16

Memory Bandwidth (GB/s) 240 320 / 352 352

Peak DP (TFLOPS) 1.003 1.011 1.208

TDP (W) 300 225 / 245 300

Cooling Passive / Active Passive / Dense Form Factor

Passive / None

Applications Compute-bound workloads (MonteCarlo, Black-Scholes, …)

Memory bandwidth-bound workloads (STREAM … )

Supercomputing centers


1/31/2014

156

Architectures Comparison

311

MIC

DRAM

ALU

ALU

DRAM

ALU

ALU Control

Cache

CPU

Power-efficient

multiprocessor

General-purpose architecture

GPU

DRAM

Massively data parallel

www.caps-entreprise.com

Coprocessor Card Design

Up to 16 channels of GDDR5 memory

Up to 8 GB GDDR5 (350 GB/s peak)

PCIe Gen2 compliant

Flash memory for coprocessor startup

System Management Controller (SMC) handles monitoring and control chores


1/31/2014

157

Microarchitecture of the Entire Coprocessor

Can be viewed as: o a symmetric multiprocessor

(SMP)

o with a shared uniform memory access (UMA) system

Up to 61 cores interconnected by a ring interconnect (ODI) o Transactions are managed

transparently by the hardware

512 kB (L2) cache per core o Cache coherency across the

entire multiprocessor thanks to TD mechanism

o Data remains consistent without software intervention


Individual Coprocessor Core Architecture

22nm using Trigate transistors

Designed for high-level of power-efficient parallelism o Based on Intel architecture for programmability

64-bit execution

In-order code execution model with multithreading

4 hardware threads per core

Clocked at ~1GHz

32 kB of L1 instruction cache

32 kB of L1 data cache

512 kB of private (local) L2 cache

Instructions are fetched from memory, decoded, dispatched and executed either by o A scalar unit using traditional x86 and x87

instructions

o A vector processing unit (VPU) using the Intel Initial Many Core Instructions (IMCI) utilizing a 512-bit wide vector length

• No support for MMX instructions, Intel AVX or SSE extensions

22nm using 3D Trigate transistors

Highly parallel and power-efficient design o Based on Intel architecture for programmability

64-bit execution

In-order code execution model with multithreading

4 hardware threads per core

Clocked at ~1GHz

32 kB of L1 instruction cache

32 kB of L1 data cache

512 kB of private (local) L2 cache

Instructions are fetched from memory, decoded, dispatched and executed either by o A scalar unit using traditional x86 and x87

instructions

o A vector processing unit (VPU) using the Intel Initial Many Core Instructions (IMCI) utilizing a 512-bit wide vector length

• No support for MMX instructions, Intel AVX or SSE extensions


Instruction Fetch

Instruction Decode

Scalar Unit Vector Unit

Scalar Registers Vector Registers

L1 32 kB Icache and 32 kB Dcache

512 kB L2 Cache Local Subset

On-Die Interconnect

1/31/2014

158

Instruction and Multithread Processing

Derived from Pentium design

Two instructions per clock cycle o One on the U-pipe

o One on the V-pipe

V-pipe cannot execute all instruction types o Vector instructions are mainly

executed only on the U-pipe

Instruction decoder is a two-cycle fully pipelined unit o Single-threaded code will use

50%

o At least 2 threads should be run per core

Instruction latency: 4 cycles


Xeon Phi Multithreading Capabilities

Xeon Phi has 4 hardware threads

o Xeon Phi can handle 4 live threads at the same time on each core

o Unlike Hyper-threading enabled CPUs, hardware threads cannot be

turned off

Intended to hide latencies

o Memory accesses and computations

o Inherent to in-order architecture

Maximum performance may be reached before that number

o Saturation may happen with two or three threads only

316 www.caps-entreprise.com

1/31/2014

159

Cache Memory Considerations

Cores can supply L2 cache data to each other on-die: data may be replicated

If no data or code is shared between the cores o L2 size is 30.5 MB (61 cores)

If every core shares the same code and data o L2 size is 512 kB

L2 usable size depends on how code and data are shared among the cores

L1 cache latency: 1 cycle

L2 cache latency: 15-30 cycles

GDDR5 latency: 500-1000 cycles


VPU Architecture

8 mask registers for SIMD lane predicated execution

Extended Mathematical Unit (EMU) for SP exponent, logarithm, reciprocal and reciprocal square root operations

IEEE 754 2008 compliant

318

V0 V1 V3

V31 V30

…

512 bits

K0 K1 K2 K3 K4 K5 K6 K7

16 bits

32 Vector Registers

Vector Mask Registers

…

16 SP or 8 DP elements / cycle


1/31/2014

160

Vectorization Brings Performance

Maximizing vectorization is as important as using enough

threads

Single precision

o 1.1 GHz * 61 cores * 16 lanes * 2 ops/core = 2.147 TFLOPS

Double precision

o 1.1 GHz * 61 cores * 8 lanes * 2 ops/core = 1.074 TFLOPS

319

VPU


Coprocessor Software Overview

Software tools are similar to those available on the host

Development Tools and Runtimes o Intel Parallel Studio XE 2013

• Intel Composer: Intel C/C++ and Intel Fortran compilers, parallel debugger, performance libraries

• Intel Inspector: identifies memory errors and threading errors

• Intel Advisor: helps programmer parallelize their code

• Intel Amplifier: performance profiler

Intel Manycore Platform Software Stack (Intel MPSS) o Specific to the coprocessor

• Middlewares

• Device drivers

• Coprocessor management utilities

• Linux OS

• GNU tools (gcc, gdb, …)


1/31/2014

161

Coprocessor Linux OS

Intel Xeon Phi runs an autonomous linux OS

o Linux kernel version 2.6.34 or greater

o Minimal, with small memory footprint

• Includes Linux Standard Base (LSB) Core libraries and Busybox minimal

shell environment

o Controlled by the host via the PCIe bus

The host boots the card and provides the linux boot image

Be careful, memory on the Xeon Phi is volatile

o Data is lost each time the card is rebooted


MPSS Boot

Host coprocessor driver mic.ko:

o Provides PCIe access

o Loads linux kernel into the accelerator’s memory

o Starts booting

mpssd daemon

o Controls booting based on configuration files

o mpssd is a linux service

micctrl application

o Configures the linux OS on the coprocessor


1/31/2014

162

Coprocessor Startup Configuration

Enable root access

o Add SSH keys in /root/.ssh

Generate default configuration files

o default.conf and micN.conf in /etc/sysconfig/mic

Start MPSS service

323

user@host $ sudo micctrl --initdefaults

user@host $ sudo service mpss start


Coprocessor Administration

Use micctrl utility

o Check coprocessor(s) status

o Check coprocessor(s) configuration

o Re-boot coprocessor(s)

o Shut down coprocessor(s)

o Add user

324

user@host $ micctrl --config

user@host $ micctrl -s

user@host $ micctrl -R [mic coprocessorlist]

user@host $ micctrl -S [mic coprocessorlist]

user@host $ micctrl –-useradd=<name> \

-sshkeys=<keydir> [mic coprocessorlist]


1/31/2014

163

Available Execution Models

Native execution model o Application is compiled for and executed on Xeon Phi

o Or application is both compiled for host and Xeon Phi

• May run on both architectures

• May introduce communications between both versions

• Requires architecture-agnostic infrastructure

o Original application can be reused

Processor-centric execution model o Application runs on the host

o And offloads selected parts of code onto Xeon Phi

o Communications are driven by the host

o Application has to be modified


Available Execution Models

Intel MIC

Native

MPI OpenMP

Accelerator

Intel Offload

Intel MKL OpenCL OpenACC


1/31/2014

164

Intel Offload

OpenMP Specifications

Currently: version 3.1

o Released in 2011

o Does NOT support accelerators

Heading toward version 4.0

o Released

o Supports accelerators

Intel expected that OpenMP 4.0 will reuse offload directives

and intends to support it when finalized


1/31/2014

165

Intel Offload Directive Model

Syntax o In C:

o In FORTRAN:

Implements following behaviors: o Coprocessor memory allocation

o Data transfer from host to coprocessor

o Execution on coprocessor

o Data transfer from coprocessor to host

o Coprocessor memory deallocation


#pragma offload <clauses>

<statement>

!DIR$ offload <clauses>

<statement>

Offload Computations to the Coprocessor


! Fortran OpenMP

!dir$ omp offload target(mic)

!$omp parallel do

Do i=1, count

A(i)=B(i)*c+d

End do

!$omp end parallel do

// C/C++ OpenMP

#pragma offload target(mic)

#pragma omp parallel for

for(i=0; i<count; i++)

{

a[i]=b[i]*c+d;

}

Next statement can execute on coprocessor “mic” if available, else processor

Next OpenMP parallel construct can execute on coprocessor “mic” if available, else processor

1/31/2014

166

Function and Variables on Coprocessor

Compile functions for, or allocate variables on, both the host

and the coprocessor

In C

In FORTRAN


__attribute__((target(mic))) <var/function>

__declspec(target(mic)) <var/function>

!DIR$ attributes offload:<mic>::<var/function>

#pragma offload_attribute(push,target(mic))

code

#pragma offload_attribute(pop)

Copy Clauses

Clauses Syntax Semantics

Inputs in(var-list : modifiers) Copy from the host to the accelerator

Outputs out(var-list : modifiers) Copy from the accelerator to the host

Inputs / Outputs inout(var-list : modifiers) Copy to the accelerator before offloading computations and back afterwards

Non-copied data nocopy(var-list : modifiers) Data is local to accelerator target


1/31/2014

167

Modifier Options

Modifiers Syntax Semantics

Specify pointer length length(N) Transfer N elements of the pointer’s type (N*sizeof(element) bytes). It cannot be used to offload part of an array

Control target data allocation alloc(array section reference) Limits memory allocation the shape of array specified

Control pointer memory allocation alloc_if(condition) Allocate memory for the pointer if condition is true

Control freeing of pointer memory free_if(condition) Free memory used by pointer if condition is true

Move data from one variable to another

into(array section reference) Transfers data from a variable on the host to another variable on the coprocessor and vice versa

Control target data alignment align(expression) Specify minimum memory alignment on accelerator


Allocating Partial Arrays in C

In the example above:

o 7000 integers are allocated on the host

o 6000 integers are allocated on the coprocessor

o The first element on the coprocessor has index 10

o The last element on the coprocessor has index 6009

o 500 elements, in the range ptr[100]-ptr[599], are copied on the

coprocessor


int *ptr = (int*) malloc(7000*sizeof(int));

#pragma offload in(ptr[100:500]:alloc(ptr[10:6000])

{ … }

1/31/2014

168

Allocating Partial Arrays in FORTRAN

In the example above:

o 30 integers are allocated on the host

o 18 integers are allocated on the coprocessor

o The first element on the coprocessor has index 3

o The last element on the coprocessor has index 20

o 8 elements, in the range T(7)-T(14), are copied on the coprocessor


INTEGER :: T(30)

!DIR$ OFFLOAD IN((T(7:14):ALLOC(3:20))

…

Moving Data Example

The example above performs a copy of the first 1500

elements of t1 to the last 1500 elements of t2


int t1[2000], t2[5000];

#pragma offload in(t1[0:1500]:into(t2[3500:1500]))

…

1/31/2014

169

Memory Management

By default, directives allocate fresh memory on the Xeon Phi

memory for each variable when entering the construct

By default, memory is deallocated when exiting the construct

Memory allocation is expensive: modifiers can change the

default behavior to reuse memory space

o If data has been allocated by a previous offload construct

o If data has been allocated by an attribute directive

o If data should be reused by an other offload construct


Persistent Storage Example


int nb = 1000;

int *t = (int*) malloc(nb*sizeof(int));

void bar()

{

#pragma offload in(t[0:nb]:alloc_if(1))

foo(t, nb);

}

void foo(int * ptr, int size)

{

#pragma offload in(ptr[0:size]:alloc_if(0))

…

}

Allocation of t of size nb on the coprocessor

Reuse of t already on the coprocessor and free t at the end of offload section

1/31/2014

170

Static and Dynamic Memory Example


__declspec(target(mic)) int array_host_mic[5000];

int array_host[5000];

void bar()

{

foo(&array_host[0], 5000);

foo(&array_host_mic[0], 5000);

}

void foo(int *t, int nb)

{

#pragma offload in(t[0:nb]:alloc_if(0))

…

}

Dynamic allocation of t on the coprocessor

Reuse of static allocation of array_host_mic on the coprocessor

Allocation on the processor and the coprocessor

Asynchronous Behavior

By default, Intel Offload directives cause the host thread to wait for the completion of the Xeon Phi instruction before going on to the next statement

Asynchronous behavior can be used specifying a signal clause to the offload directive

A offload_wait directive should be used to ensure completion


CPU MIC

1

2

3

4

5

CPU MIC

1

2

3

4

5

1/31/2014

171

Asynchronous Computations Example (C)

341

char sig;

int count = 1000;

__attribute__((target(mic))) mic_compute()

do {

#pragma offload target(mic) signal(&sig)

{

mic_compute();

}

host_activity();

#pragma offload_wait(&sig)

count = count – 1;

} while(count > 0);


Asynchronous Computations Example (FORTRAN)

342

integer sig

integer count

count = 1000

!dir$ attributes offload:mic::mic_compute

do while (count .gt. 0)

!dir$ offload target(mic:0) signal(sig)

call mic_compute()

call host_activity()

!dir$ offload_wait target(mic:0) wait(sig)

count = count – 1

end do


1/31/2014

172

Asynchronous Transfers

Use offload_transfer directives instead, for example in C

343

char sig1, sig2, sig3;

float *p = (float*)malloc(N*size(float));

#pragma offload_transfer in(p:length(N)) signal(&sig1)

host_activity();

#pragma offload wait(sig1) signal(&sig2)

{

foo(N, p);

}

#pragma offload_transfer wait(sig2) out(p:length(N)) signal(&sig3)

host_activity();

#pragma offload_wait(&sig3)


Compile and Execute Intel Offload Applications

Source Intel Compiler environment

Compile with –openmp

o To ignore Intel Offload directive use –no-offload flag

o To display offload optimizer report use –opt-report-phase=offload

Execute from the host

To retrieve basic profiling information

To retrieve timing information

344

user@host $ icc –openmp myProgram.c –o myApp.mic

user@host $ ./myApp.mic

user@host $ source /opt/intel/composer_xe_2013.2.146/bin/compilervars.sh intel64

user@host $ export H_TIME=1

user@host $ export H_TRACE=1


1/31/2014

173

Directive Standard

Directive-based Programming (1)

Three ways of programming GPGPU applications:


Libraries

Ready-to-use

Acceleration

Directives

Quickly Accelerate

Existing Applications

Programming

Languages

Maximum Performance

1/31/2014

174

Advantages of Directive-based Programming

Simple and fast development of accelerated applications

Non-intrusive

Helps to keep a unique version of code o To preserve code assets

o To reduce maintenance cost

o To be portable on several accelerators

Incremental approach

Enables "portable" performance


OpenACC

1/31/2014

175

Various Many-core Paths


• Large number of small cores • Data parallelism is key • PCIe to CPU connection

AMD Discrete GPU

AMD APU

• Integrated CPU+GPU cores • Target power efficient

devices at this stage • Shared memory system with

partitions

INTEL Many Integrated Cores

• 50+ number of x86 cores • Support conventional programming • Vectorization is key • Run as an accelerator or standalone

NVIDIA GPU

• Large number of small cores • Data parallelism is key • Support nested and dynamic

parallelism • PCIe to host CPU or low

power ARM CPU (CARMA)

OpenACC Initiative


Launched by CAPS, Cray, Nvidia and PGI in november 2011 o Allinea, Georgia Tech, U. Houston, ORNL, Rogue

Wave, Sandia NL, Swiss National Computing Center, TUD joined in 2012

Open Standard

A directive-based approach for programming heterogeneous many-core hardware for C/C++ and FORTRAN applications

Specification version 2.0 (June 2013)

http://www.openacc-standard.com

http://www.openacc-standard.com/



1/31/2014

176

OpenACC Compilers (1)

CAPS Compilers:

Source-to-source

compilers

Support Intel Xeon Phi,

NVIDIA GPUs, AMD

GPUs and APUs

PGI Accelerator

Extension of x86 PGI

compiler

Support Intel Xeon Phi,

NVIDIA GPUs, AMD

GPUs and APUs


Cray Compiler:

Provided with Cray systems only

CAPS Compilers (2)

Take the original application as input and generate another

application source code as output

o Automatically turn the OpenACC source code into a accelerator-

specific source code (CUDA, OpenCL)

Compile the entire hybrid application

Just prefix the original compilation line with capsmc to

produce a hybrid application

352

$ capsmc gcc myprogram.c

$ capsmc gfortran myprogram.f90

Intel MIC Programming

1/31/2014

177

Compilation Paths

CAPS Compilers drives all compilation passes

Host application compilation o Calls traditional CPU

compilers o CAPS Runtime is linked

to the host part of the application

Device code production o According to the

specified target o A dynamic library is

built


Fun #3

C++ Frontend

C Frontend

Fortran Frontend

CUDA Code Generation

Executable (mybin.exe)

Instrumen-tation module

CPU compiler (gcc, ifort, …)

CUDA compilers

HWA Code (Dynamic library)

OpenCL Generation

OpenCL compilers

Extraction module

Fun #2

Host code

codelets

CAPS Runtime

Fun #1

CAPS Compilers Options

Usage:

To specify accelerator-specific code

To display the compilation process

354

$ capsmc –-openacc-target OPENCL –d -c gcc myprogram.c

$ capsmc –-openacc-target CUDA gcc myprogram.c #(default)

$ capsmc –-openacc-target OPENCL gcc myprogram.c #(for Xeon Phi)

$ capsmc [CAPSMC_FLAGS] <host_compiler> [HOST_COMPILER_FLAGS] <source_files>


1/31/2014

178

Programming Model

Express data and computations to be executed on an accelerator

o Using marked code regions

Main OpenACC constructs

o Parallel and kernel regions

o Parallel loops

o Data regions

o Runtime API

355

Data/stream/vector

parallelism to be

exploited by HWA e.g. CUDA / OpenCL

CPU and HWA linked with a

PCIx bus


Execution Model

Among a bulk of computations executed by the CPU, some regions can be offloaded to hardware accelerators o Parallel regions o Kernels regions

Host is responsible for: o Allocating memory space on accelerator o Initiating data transfers o Launching computations o Waiting for completion o Deallocating memory space

Accelerators execute parallel regions: o Use work-sharing directives o Specify level of parallelization

356 Intel MIC Programming

1/31/2014

179

OpenACC Execution Model

Host-controlled execution

Based on three parallelism levels

o Gangs – coarse grain

o Workers – fine grain

o Vectors – finest grain

357

Device

Gang

Worker

Vectors

Gang

Worker

Vectors

…


Gangs, Workers, Vectors

In CAPS Compilers, gangs, workers and vectors correspond

to the following in an OpenCL grid

Beware: this implementation is compiler-dependent

358

numGroups(1) = 1

numGroups(0) = number of gangs

localSize(1) = number of workers

localSize(0) = number of vectors


1/31/2014

180

Directive Syntax

C

Fortran

359

!$acc directive-name [clause [, clause] …]

code to offload

!$acc end directive-name

#pragma acc directive-name [clause [, clause] …]

{

code to offload

}


Parallel Construct

Starts parallel execution on the accelerator

Creates gangs and workers

The number of gangs and workers remains constant for the parallel region

One worker in each gang begins executing the code in the region


#pragma acc parallel […]

{

…

for(i=0; i < n; i++) {

for(j=0; j < n; j++) {

…

}

}

…

}

Code executed on the hardware

accelerator

1/31/2014

181

Gangs, Workers, Vectors in Parallel Constructs

In parallel constructs, the number of gangs, workers and vectors is the same for the entire section

The clauses: o num_gangs o num_workers o vector_length

Enable to specify the number of gangs, workers and vectors in the corresponding parallel section


#pragma acc parallel, num_gangs(128) \

num_workers(256)

{

…

for(i=0; i < n; i++) {

for(j=0; j < m; j++) {

…

}

}

…

}

…

… … …

256

128

Loop Constructs

A Loop directive applies to a loop that immediately follow the directive

The parallelism to use is described by one of the following clause:

o Gang for coarse-grain parallelism

o Worker for middle-grain parallelism

o Vector for fine-grain parallelism


1/31/2014

182

Loop Directive Example

With gang, worker or vector clauses, the iterations of the following loop are executed in parallel

Gang, worker or vector clauses enable to distribute the iterations between the available gangs, workers or vectors


#pragma acc parallel, num_gangs(128) \

num_workers(192) \

vector_length(32)

{

…

#pragma acc loop gang

for(i=0; i < n; i++) {

#pragma acc loop worker

for(j=0; j < m; j++) {

#pragma acc loop vector

for(k=0; k < l; k++) {

…

}

}

}

…

}

…

192

128

i=0

j=0 j=1 j=2

…

i=0

…

…

i=1

…

… k=0 k=1 k=2

Kernels Construct

Defines a region of code to be compiled into a sequence of accelerator kernels o Typically, each loop nest will be a distinct kernel

The number of gangs and workers can be different for each kernel


#pragma acc kernels […]

{

for(i=0; i < n; i++) {

…

}

…

for(j=0; j < n; j++) {

…

}

}

$!acc kernels […]

DO i=1,n

…

END DO

…

DO j=1,n

…

END DO

$!acc end kernels

1st Kernel

2nd Kernel

1/31/2014

183

Gang, Worker, Vector in Kernels Constructs

The parallelism description is the same as in parallel sections

However, these clauses accept an argument to specify the number of gangs, workers or vectors to use

Every loop can have a different number of gangs, workers or vectors in the same kernels region


#pragma acc kernels

{

…

#pragma acc loop gang(128)

for(i=0; i < n; i++) {

…

}

…

#pragma acc loop gang(64)

for(j=0; j < m; j++) {

…

}

}

…

64

…

i=0

…

i=0

…

i=2

… …

i=0

…

i=0

…

i=2

128

Data Independency

In kernels sections, the clause independent on loop directive specifies that iterations of the loop are data-independent

The user does not have to think about gangs, workers or vector parameters

It allows the compiler to generate code to execute the iterations in parallel with no synchronization


A[0] = 0;

#pragma acc loop independent

for(i=1; i<n; i++)

{

A[i] = A[i]-1;

}

1/31/2014

184

What is the problem using discrete accelerators?

PCIe transfers have huge latencies

In kernels and parallel regions, data are implicitly managed

o Data are automatically transferred to and from the device

o Implies possible useless communications

Avoiding transfers leads to a better performance

OpenACC offers a solution to control transfers


Device Memory Reuse

In this example: o A and B are allocated

and transferred for the first kernels region

o A and C are allocated and transferred for the second kernels region

How to reuse A between the two kernels regions? o And save transfer and

allocation time


float A[n];

#pragma acc kernels

{

for(i=0; i < n; i++) {

A[i] = B[n – i];

}

}

…

init(C)

…

#pragma acc kernels

{

for(i=0; i < n; i++) {

C[i] += A[i] * alpha;

}

}

1/31/2014

185

Memory Allocations

Avoid data reallocation using the create clause o It declares variables, arrays or subarrays to be allocated in the device

memory

o No data specified in this clause will be copied between host and device

The scope of such a clause corresponds to a data region

Kernels and Parallel regions implicitly define data regions

The present clause declares data that are already present on the device


Create and Present Clause Example


float A[n];

#pragma acc data create(A)

{

#pragma acc kernels present(A)

{

for(i=0; i < n; i++) {

A[i] = B[n – i];

}

}

…

init(C)

…

#pragma acc kernels present(A)

{

for(i=0; i < n; i++) {


}

}

}

Allocation of A of size n on the device

Deallocation of A on the device

Reuse of A already allocated on the device

Reuse of A already allocated on the device

1/31/2014

186

Data Storage: Mirroring

How is the data stored in a data region?

A data construct defines a section of code where data are mirrored between host and device

Mirroring duplicates a CPU memory block into the HWA memory o Users ensure consistency of copies via directives


Host Memory

Master copy

…………………………………………………….

HWA Memory

CAPS RT Descriptor

…………………………………………………….

Mirror copy

Arrays and Subarrays

In C and C++, arrays are specified with start and length

o For example, with an array of size n

In FORTRAN, arrays are specified with a list of range specifications

o For example, with an array a of size (n,m)

In any language, any array or subarray must be a contiguous block of memory


#pragma acc data create a[0:n]

!$acc data create a(0:n,0:m)

1/31/2014

187

Transfers: Copyin Clause

Declares data that need only to be copied from the host to the device when entering the data section

o Performs input transfers only

It defines scalars, arrays and subarrays to be allocated on the device memory for the duration of the data region


#pragma acc data create(A[:n])

{

#pragma acc kernels present(A[:n]) \

copyin(B[:n])

{

for(i=0; i < n; i++) {

A[i] = B[n – i];

}

}

…

#pragma acc kernels present(A[:n])

{

for(i=0; i < n; i++) {

C[i] = A[i] * alpha;

}

}

}

Transfers: Copyout Clause

Declares data that need only to be copied from the device to the host when exiting data section

o Performs output transfers only

It defines scalars, arrays and subarrays to be allocated on the device memory for the duration of the data region



{


copyin(B[:n])

{

for(i=0; i < n; i++) {

A[i] = B[n – i];

}

}

…


copyout(C[:n])

{

for(i=0; i < n; i++) {

C[i] = A[i] * alpha;

}

}

}

1/31/2014

188

Transfers: Copy Clause

If we change the example, how to express that input and output transfers of C are required?

Use copy clause to: o Declare data that need to be

copied from the host to the device when entering the data section

o Assign values on the device that need to be copied back to the host when exiting the data section

o Allocate scalars, arrays and subarrays on the device memory for the duration of the data region



{


copyin(B[:n])

{

for(i=0; i < n; i++) {

A[i] = B[n – i];

}

}

…

init(C)

…


copy(C[:n])

{

for(i=0; i < n; i++) {


}

}

}

Present_or_create Clause

Combines two behaviors

Declares data that may be present

o If data is already present, use value in the device memory

o If not, allocate data on device when entering region and deallocate when exiting

May be shortened to pcreate


1/31/2014

189

Present_or_copyin/copyout/copy Clauses

If data is already present, use value in the device memory

If not: o present_or_copyin/present_or_copyout/present_or_copy allocate

memory on device at region entry

o present_or_copyin/present_or_copy transfer data from the host at region entry

o present_or_copyout/copy transfer data from the device to the host at region exit

o present_or_copyin/present_or_copyout/prsent_or_copy deallocate memory at region exit

May be shortened to pcopyin, pcopyout and pcopy


Present_or_* Clauses Example


program main

…

!$acc data create(A(1:n))

call f1( n, A, B )

…

!$acc end data

…

call f1( n, A, C )

…

contains

subroutine f1( n, A, B )

…

!$acc kernels pcopyout(A(1:n)) \

copyin(B(1:n))

do i=1,n

A(i) = B(n – i)

end do

!$acc end kernels

end subroutine f1

…

end program main

Allocation of A of size n on the device

Reuse of A already allocated on the device Allocation of B of size n on the device for the duration of the subroutine and input transfer of B

Deallocation of A on the device

Allocation of A and B of size n on the device for the duration of the subroutine Input transfer of B and output transfer of A

Present_or_* clauses are generally safer

1/31/2014

190

Default Behavior

CAPS Compilers is able to detect the variables required on the device for the kernels and parallel constructs.

According to the specification, depending on the type of the variables, they follow the following policies

o Tables: present_or_copy behavior

o Scalar

• if not live in or live out variable: private behavior

• copy behavior otherwise


OpenACC 2.0: New Features (1)

Atomic operations: o Different kinds of atomic

sections can be executed in parallel/kernels constructs

• Read, write, update, capture

Routine directives o Function invocation from

kernels/parallel constructs

o Functions can be executed from host or device

SC'13

#pragma acc atomic read v=x; #pragma acc atomic write x=42; #pragma acc atomic update x++; #pragma acc atomic capture v = x++;

#pragma acc routine (myfunc) void myfunc( … ) { … } int main() { #pragma acc kernels { myfunc( … ); } … myfunc( … ); }

380

1/31/2014

191

OpenACC 2.0: New Features (2)

Enter/Exit data directives o Enable scope

free data management

Device type clause o Enable

architecture specific optimizations

SC'13

void init(int* array, int size){ array = malloc(sizeof(int)*size); #pragma acc enter data create(array[0:size]) } int main(){ int *array; … init(array, size); … #pragma acc exit data delete(array) … }

381

#pragma acc kernels device_type(nvidia) num_gangs(64) \ device_type(xeonphi) num_gangs(128) { #pragma acc loop gang for(int i=0; i<size; i++){ … } }

OpenMP 4.0

1/31/2014

192

Intel Offload / OpenMP 4.0 / OpenACC 1.0

Directive Set Intel Offload OpenMP 4.0 (To be released)

OpenACC 1.0

Offloading computations #pragma offload !dir$ offload

#pragma omp target !$omp target

#pragma acc kernels #pragma acc parallel !$acc kernels !$acc parallel

Work Sharing #pragma omp parallel !$omp parallel

#pragma omp parallel !$omp parallel #pragma omp teams !$omp teams #pragma omp distribute !$omp distribute

#pragma acc loop !$acc loop #pragma loop gang/worker/vector !$acc loop gang/worker/vector

Data Clauses in, out, inout alloc_if, free_if

map to, map from, map tofrom map alloc

copyin, copyout, copy create, present, pcopyin, pcopyout, pcopy

Data regions - #pragma omp target data !$omp target data

#pragma acc data !$acc data

Transfer directives #pragma offload_transfer !dir$ offload_transfer

#pragma omp target update !$omp target update

#pragma acc update !$acc update


CAPEX / OPEX

with GPU

1/31/2014

193

Goals – Why Using GPUs

Performance

Energy saving

Cheaper machine

Preparing code to manycore


Is the Machine Cheaper?

You may want

o To run faster than you competitor

o To run faster than the streamed data come

o To run faster in order to use less energy to compute

o To run differently to save energy

OPEX or CAPEX?

o Capital Expenditures

• Machine cost and software migration, surface cost

o Operational Expenditures

• Energy consumption, hardware and software maintenance


1/31/2014

194

CAPEX-OPEX Analysis for a Heterogeneous

System

Capital Expenses (CapEx) o System acquisition cost

o Software migration cost

o Software acquisition cost

o Teaching cost

o Real estate cost

Operational Expenses (OpEx) o Energy cost (system + cooling)

o Maintenance cost

For a given amount of compute work, the CapEx-Opex analysis indicates the “real” value of a given system o For instance, if I add GPU do I save money?

o And how many should I add?

o Then should I use slower CPU?


Application Speedup and CapEx-OpEx

Adding GPUs/accelerators to the system o Increases system cost

o Increases base energy consumption (one GPU = x10 watt idle)

Exploiting the GPUs/accelerators o Decreases execution time, so potentially the energy consumption for a

given amount of work

o Reduces the number of nodes of the architecture • Threshold effect on the number of routers etc.

o Requires to migrate the code

Multiple views of the value of application speedup o Shorten time-to-market

• Threshold effect

o More work performed during the lifetime of the system


1/31/2014

195

CapEx Hardware Parameters

Choice of the hardware configuration can be: o Fast CPU + Fast GPU (expensive node)

o Slow CPU + Fast GPU

o Fast CPU + Slow GPU

o Slow CPU + Slow GPU

o Fast CPU

o Slow CPU

Nodes performance impact on the number of nodes o More nodes means more network with non negligible cost and energy

consumption

o Less nodes may limit scalability issues if any

Application workload analysis is the only way to decide o Optimizing software can significantly increase performance and so reduce

needed hardware

o Code migration to GPU is on the critical path


Small systems: - a few nodes (1-8) - cost x10k€ Large systems - many nodes (x100) - cost x1M€

CapEx: Code Migration Cost

Migration cost o Learning cost

o Software environment cost

o Porting cost

Migration cost is mostly hardware size independent o Not an issue for dedicated large systems

o Different if the machine aims at serving a large community

Main migration benefit is to highlight manycore parallelism o Not specific to one kind of device

o Implementation is specific

Constructor specific implementation solution o Amortize period similar to the one of the hardware (3 years)

Agnostic parallelism expression o Using portable solution for multiple hardware generations (amortized on 10 years)

o Of course not that simple! Still requires some level of tuning

May be very useful for non scalable message passing code


Mastering the cost of migration has a significant impact on the total cost for small systems Typical effort: - Manpower: a few Man-Months - Cost: x 10k€

1/31/2014

196

Two Applications Examples

Application 1

• Field: Monte Carlo simulation for thermal radiation

• MPI code

• Migration cost: 1 man month

Application 2

• Field: astrophysics, hydrodynamic

• MPI code

• Requires 3 GPUs per node for having enough memory space

• Migration cost: 2 man month


Power Consumption Application 1


0 = Baseline Energy Consumption

CPU energy

GPU energy

Power usage effectiveness (PUE) = Total facility power / IT equipment power Current 1.9, best practice 1.3 Src: http://www.google.com/corporate/datacenter/efficiency-measurements.html

1/31/2014

197

Power Consumption Application 2


Configuration Execution

time (s) System Costs

Maintenance

Costs

Energy

Costs

CAPEX

+OPEX

Application 1 Migration cost = 1 man.month

4 nodes 6862 1.87€ 0.19€ 0.37€ 2.43€

4 nodes + 4 GPUs 1744 0.71€ 0.07€ 0,12€ 0.90€

4 nodes + 8 GPUs 1000 0.51€ 0.05€ 0,08€ 0.64€

4 nodes + 12 GPUs 731 0.45€

0.04€ 0,08€ 0.57€

Application 2 Migration cost = 2 man.month

4 nodes 713 0.19€ 0.02€ 0.025€ 0.239€

4 nodes + 12 GPUs 485 0.30€ 0.03€ 0.034€ 0.358€

4 nodes (slow ck)+ 12 GPUs 500 (estim.) 0.24€ 0.02€ 0.034€ 0.302€

CAPEX-OPEX Overview

Comparison on an equivalent workload

o CAPEX = System costs + Migration costs

o OPEX = Energy cost + Computer maintenance cost (10% Computer costs)


1/31/2014

198

Cost per Run

Application 1 Application 2


- €

0,50 €

1,00 €

1,50 €

2,00 €

2,50 €

3,00 €

no GPU 1 GPU/node 2 GPU/node 3 GPU/node

Migration Costs(4 nodes)

Energy Costs(power + cooling)

Maintenance Costs(10% CC)

System Costs

- €

0,05 €

0,10 €

0,15 €

0,20 €

0,25 €

0,30 €

0,35 €

0,40 €

0,45 €

no GPU 3 GPU/node 3 GPU/node(slow ck)

Porting Methodology

1/31/2014

199

Methodology to port applications


Migration process

Code analysis and definition step o This step performs a diagnosis of the application, specifies the main

porting operations, and makes a gross estimation of the potential speedup as well as porting cost

• This step is concluded by a Go / NoGo Analysis

First port of the application o This step implements a first GPGPU version of the code

o With this version, a GPGPU execution profile can obtained and bottlenecks can be identified

Fine tune of the code and setup for production o This is the last step of the migration process that aims at getting a

well-optimized production code


1/31/2014

200

Phase 1 (details)


Step 1: code analysis

Hotspot identification o Profile the application

o Identify hotspot to convert to GPU kernels

o Code restructuration may be needed to get parallel hotspot that are computation intensive enough

CPU analysis o Check the efficiency of the application on the CPU

Parallelism discovery o Ensure kernels can be execute in parallel

o If not, reconsider the algorithms


1/31/2014

201

Step 2: Getting an Hybrid GPGPU Program

Converts hotspots to GPU kernels

o Using HMPP codelets

Helps to identify GPGPU issues

Helps also to validate parallel implementation

o Difference between CPU and GPU results


Phase 2 (details)


1/31/2014

202

Step 3: Optimizing the hybrid program

Optimize the GPGPU kernels with code transformations

o Loop unrolling & jamming

o Increase the grid dimension

o Distribute loops

o Fuse loops

o …

Make it Incrementally

o Each transformation at a time

Easier to be rigorous than find bugs


How to Avoid Over-Spending Manpower?

Implement a version control management

Incrementally port the code

Check-point restart techniques

Do not delay integration

Stick to the plan

Check everything is available

Report every hard issues

Reconsider basic assumptions

Do not minimize the workload


1/31/2014

203

An Example of Checklist

Do you know what is the target machine?

Do you have access to all the necessary codes and libraries?

Do you know how to run the codes?

Are there input data sets available?

Do you have an execution profile representative for performance measurements?

Do you have access to the target architecture?

Do you know how to check the results?

Are you allowed to change floating point rounding?

Do you have somebody to ask questions about the application?

Do you need an application domain consultant?

Have you stated the performance gain to achieve with the end-user?

Are debugging tools installed on the target machines?

Did you check the drivers/libraries/OS versions on the target machines?

Has the application code already been running on the target machine?


Checking the Results

The validation of output data is essential in a porting

process, and often neglected

It ensures that the transformations do not affect the

application’s global behavior

o Considering the application as a black-box, delivering the same

output data for a given input data set is not always a good thing

• Since readability of the application is sometime critical


1/31/2014

204

Hardware

What is the targeted hardware?

o AMD, Nvidia, Xeon Phi?

Do you intend to use multitargets

o For example multi-GPUs

Is the target hardware fixed forever?

o Is it supposed to change in the next years?


Software

May I change the compiler?

Software target is OpenCL or CUDA?

o Is the software target hardware agnostic?

Shall I use OpenACC?

o Can be the native CPU version of the application alterated?


1/31/2014

205

Accelerator Programming model

Directive-based programming

Parallel Computing

OpenHMPP OpenACC GPGPU

Many-Core programming

Parallelization

HPC OpenCL

Code speedup NVIDIA Cuda

High Performance Computing

CAPS Compilers

CAPS Workbench

Portability

Performance

Visit CAPS Website: www.caps-entreprise.com

Hardware and Software for Parallel Computing

Documents