Top Banner
OpenMP Martin Kruliš Jiří Dokulil
37

OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Jan 02, 2016

Download

Documents

Rudolph Brown
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

OpenMP

Martin Kruliš

Jiří Dokulil

Page 2: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

OpenMP

OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S.

Department of Energy,… http://www.openmp.org specifications (freely available)

1.0 – C/C++ and FORTRAN versions 2.0 – C/C++ and FORTRAN versions 2.5 – combined C/C++ and FORTRAN 3.0 – combined C/C++ and FORTRAN 4.0 – combined C/C++ and FORTRAN (July 2013)

Page 3: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Basics

fork – join model tailored mostly for large array operations

pragmas #pragma omp …

only a few constructs programs should run without OpenMP

possible but not enforced #ifdef _OPENMP

Page 4: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Simple example

#define N 1024*1024

int* data=new int[N];

for(int i=0; i<N; ++i)

{

data[i]=i;

}

Page 5: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Simple example – cont.

#define N 1024*1024

int* data=new int[N];

#pragma omp parallel for

for(int i=0; i<N; ++i)

{

data[i]=i;

}

Page 6: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Another example

int sum;

#pragma omp parallel for

for(int i=0; i<N; ++i)

{

sum+=data[i];

}

WRONGWRONG

Page 7: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Variable scope

shared one instance for all threads

private one instance for each thread

reduction special variant for reduction operations

valid within lexical extent no effect in called functions

Page 8: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Variable scope – private default for loop control variable

only for the parallelized loop should (probably always) be made private all loops in Fortran

all variables declared within the parallelized block

all non-static variables in called functions allocated on stack – private for each thread

uninitialized values at start of the block and after the block except for classes

default constructor (must be accessible) may not be shared among the threads

Page 9: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Variable scope – private

int j;

#pragma omp parallel for private(j)

for(int i=0; i<N/2; ++i)

{

j=i*2;

data[j]=i;

data[j+1]=i;

}

Page 10: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Variable scope – reduction

performing e.g. sum of an array cannot use only private variable shared requires explicit synchronization combination is possible and (relatively) efficient

but unnecessarily complex each thread works on an private copy

initialized to a default value (0 for +, 1 for *,…) final results are joined and available to the

master thread

Page 11: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Variable scope – reduction

long long sum=0;

#pragma omp parallel for reduction(+:sum)

for(int i=0; i<N; ++i)

{

sum+=data[i];

}

Page 12: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Variable scope – firstprivate and lastprivate

values of private variables at the start of the block and after end of the block are undefined

firstprivate all values are initialized to the value of the master

thread lastprivate

variable after the parallelized block is set to the value of the last iteration (last in the serial version)

Page 13: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

parallel

#pragma omp parallel launches threads and executes block in

parallel modifiers

if (scalar expression) variable scope modifiers (including reduction) num_threads

especially useful in conjunction with omp_get_thread_num

Page 14: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Loop-level parallelism

#pragma omp parallel for launch threads and execute loop in parallel can be nested

#pragma omp for parallel loop within another parallel block no (direct) nesting

“simple” for expression implicit barrier at the end

Page 15: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Loop-level parallelism – modifiers 1

variable scope modifiers nowait – removes barrier

cannot be used with #pragma omp parallel for ordered

loop (or called function) may contain block marked #pragma omp ordered

such block is executed in the same order as in serial execution of the loop

at most one such block may exist

Page 16: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Loop-level parallelism – modifiers 2 schedule

schedule(static[, chunk_size]) round robin no chunk size → equal size to all threads

schedule(dynamic[, chunk_size]) threads request chunks default chunk size is 1

schedule(guided[, chunk_size]) like dynamic with size of chunks proportional to the amount

of remaining work, but at least chunk_size default chunk size is 1

auto selected by implementation

runtime use default value stored in variable def-sched-var

Page 17: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Parallel sections

#pragma omp sections #pragma omp section #pragma omp section …

several blocks of code that should be evaluated in parallel

modifiers private, firstprivate, lastprivate, reduction nowait

Page 18: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Single

#pragma omp single code is executed by only one thread of the team modifiers

private, firstprivate nowait

when not used, there is a barrier at the end of the block copyprivate

final value of the variable is distributed to all threads in the team after the block is executed

incompatible with nowait

Page 19: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Workshare

Fortran only…

SUBROUTINE A11_1(AA, BB, CC, DD, EE, FF, N)

INTEGER N

REAL AA(N,N), BB(N,N), CC(N,N), DD(N,N), EE(N,N), FF(N,N)

!$OMP PARALLEL

!$OMP WORKSHARE

AA = BB

CC = DD

EE = FF

!$OMP END WORKSHARE

!$OMP END PARALLEL

END SUBROUTINE A11_1

Page 20: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Master

#pragma omp master executed only by the master thread

Page 21: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Critical section

#pragma omp critical [name] the well-known critical section at most once thread can execute critical

section with certain name multiple pragmas with same name form one

section names have external linkage all unnamed pragmas form one section

Page 22: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Barrier

#pragma omp barrier no associated block of code

some restrictions on placementif (a<10)

#pragma omp barrier

{ do_something() }

Page 23: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Atomic

#pragma omp atomic followed by expression in the form

x op= expr +, *, -, /, &, ^, |, <<,or >> expr must not reference x

x++ ++x x-- --y

Page 24: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Flush #pragma omp flush (variable list) make thread’s view of variables consistent with the main

memory variable list may be omitted, flushes all similar to volatile in C/C++ influences memory operation reordering that can be

performed by the compiler cannot move read/write of the flushed variable to the other “side”

of the flush operation all values of flushed variables are saved to the memory

before flush finishes first read of flushed variable after flush is performed from

the main memory same placement restrictions as barrier

Page 25: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

threadprivate

#pragma omp threadprivate(list) makes global variable private for each thread complex restrictions

Page 26: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

copyin, copyprivate

copyin(list) copy value of threadprivate variable from master

thread to other members of the team used as modifier in #pragma omp parallel values copied at the start of the block

copyprivate(list) copy value from one thread’s threadprivate

variable to all other members of the team used as modifier in #pragma omp single values copied at the end of the block

Page 27: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Task

new in OpenMP 3.0 #pragma omp task piece of code to be executed in parallel

immediately or later if clause forces immediate execution when false

tied or untied (to a thread) can be suspended, e.g. by launching nested task

modifiers default, private, firstprivate, shared untied if

Page 28: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Task scheduling points

after explicit generation of a task after the last instruction of a task region taskwait region in implicit and explicit barriers (almost) anywhere in untied tasks

Page 29: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Taskwait

#pragma omp taskwait wait for completion of all child tasks

generated since the start of the current task

Page 30: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Functions

omp_set_num_threads, omp_get_max_threads number of threads used for parallel regions without

num_threads clause omp_get_num_threads

number of threads in the team omp_get_thread_num

number of calling thread within the team 0 = master

omp_get_num_procs number of processors available to the program

Page 31: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Functions – cont.

omp_in_parallel checks if the caller is in active parallel region

active region is region without if or if the condition was true

omp_set_dynamic, omp_get_dynamic dynamic adjustment of thread number on/off

omp_set_nested, omp_get_nested nested parallelism on/off

Page 32: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Locks plain and nested omp_lock_t, omp_nest_lock_t omp_init_lock, omp_init_nest_lock

initializes the lock omp_destroy_lock, omp_destroy_nest_lock

uninitializes must be unlocked

omp_set_lock, omp_set_nest_lock must be initialized locks the lock blocks until the lock is acquired

omp_unset_lock, omp_unset_nest_lock must be locked and owned by the calling thread unlocks

omp_test_lock, omp_test_nest_lock like set but does not block

Page 33: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Timing routines

double omp_get_wtime() wall clocl time in seconds since “time in the past” may not be consistent between threads

double omp_get_wtick() number of seconds between successive clock

ticks of the timer used by omp_get_wtime

Page 34: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Environment variables OMP_NUM_THREADS

number of threads launched in parallel regions omp_set_num_threads, omp_get_num_threads

OMP_SCHEDULE used in loops with schedule(runtime) "guided,4", "dynamic“

OMP_DYNAMIC set if implementation may change number of threads omp_set_dynamic, omp_get_dynamic true or false

OMP_NESTED controls nested parallelism true or false default is false

Page 35: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

Nesting of regions some limitations “close nesting”

no #pragma omp parallel nested between the two regions “work-sharing region”

for, sections, single, (workshare) work-sharing region may not be closely nested inside a work-

sharing, critical, ordered, or master region barrier region may not be closely nested inside a work-sharing,

critical, ordered, or master region master region may not be closely nested inside a work-sharing

region ordered region may not be closely nested inside a critical region ordered region must be closely nested inside a loop region (or

parallel loop region) with an ordered clause critical region may not be nested (closely or otherwise) inside a

critical region with the same name note that this restriction is not sufficient to prevent deadlock

Page 36: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

OpenMP 4.0 The newest version (June 2013)

No implementations yet Thread affinity

proc_bind(master | close | spread) SIMD support

Explicit loop vectorization (by SSE, AVX, …) User defined reduction

#pragma omp declare reduction (identifier : typelist : combiner-expr) [initializer-clause]

Atomic operations with sequentialconsistency (seq_cst)

Page 37: OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… .

OpenMP 4.0

Accelerator support Xeon Phi cards, GPUs, … #pragma omp target – offloads computation

device(idx) map(variable map)

#pragma target update