Top Banner
OpenMP – an overview Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015
43

OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

Aug 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overviewSeminar Non-uniform Memory Access (NUMA), WS2014/15

Matthias Springer

Hasso Plattner Institute, Operating Systems and Middleware Group

January 14, 2015

Page 2: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview

Overview

What is OpenMP?

Comparison of Multiprocessing Libraries

OpenMP API

ForestGOMP: NUMA with OpenMP

Matrix Multiply with OpenMP and MPI

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 2 / 39

Page 3: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ What is OpenMP?

What is OpenMP? [5]

• OpenMP = Open Multi-Processing

• API for multi-platform shared memory multiprocessing

• Set of compiler directives, library routines, and environment

variables

• Programing languages: C, C++, Fortran

• Operating Systems: e.g. Solaris, AIX, Linux, Mac OS X, Windows

• OpenMP Architecture Review Board: group of hardware and softwarevendors

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 3 / 39

Page 4: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ What is OpenMP?

Shared Memory Multiprocessing [4]

Interconnect

CPU CPU CPU CPU

Memory

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 4 / 39

Page 5: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ Comparison of Multiprocessing Libraries

Overview

What is OpenMP?

Comparison of Multiprocessing Libraries

OpenMP API

ForestGOMP: NUMA with OpenMP

Matrix Multiply with OpenMP and MPI

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 5 / 39

Page 6: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ Comparison of Multiprocessing Libraries

OpenMP vs. pthreads vs. MPI

• pthreads: low-level programming− Programmer specifies behavior of each thread− Links against libpthread: no compiler support required

• OpenMP: higher-level programming− Programmer specifies that a piece of code should be executed in parallel− Required compiler support (e.g. preprocessor)

• MPI: Message Passing Interface− Communication based on message sending and receiving− No shared memory, designed for distributed systems

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 6 / 39

Page 7: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ OpenMP API

Overview

What is OpenMP?

Comparison of Multiprocessing Libraries

OpenMP APIparallel DirectiveCompiling OpenMP programsNotationScope of Variablesparallel for Directivereduce ClauseSynchronization

ForestGOMP: NUMA with OpenMP

Matrix Multiply with OpenMP and MPIHasso Plattner Institute, Operating Systems and

Middleware Group OpenMP – an overview January 14, 2015 7 / 39

Page 8: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ OpenMP API◮ parallel Directive

Running Function on Multiple Threads [4]

#include <stdio.h>

#include <stdlib.h>

#include <omp.h>

void hello_world(void)

{

int my_rank = omp_get_thread_num ();

int thread_count = omp_get_num_threads ();

printf("Thread %d of %d says Hello !\n", my_rank , thread_count);

}

int main(int argc , char* argv [])

{

int thread_count = strtol(argv[1], NULL , 10);

#pragma omp parallel num_threads(thread_count)

hello_world ();

return 0;

}

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 8 / 39

Page 9: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ OpenMP API◮ Compiling OpenMP programs

Compiling OpenMP Programs

• Compile: gcc -fopenmp -o hello hello.c

• Run: ./hello 3

Thread 1 of 3 says Hello!

Thread 0 of 3 says Hello!

Thread 2 of 3 says Hello!

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 9 / 39

Page 10: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ OpenMP API◮ Compiling OpenMP programs

OpenMP Compilation Process

• Annotated Source Code → OpenMP Compiler → Parallel Object Code

• Compiler can also generate sequential object code

• Compiler Front End: parse OpenMP directives, correctness checks

• Compiler Back End: replace constructs by calls to runtime library,change structure of program (e.g., put parallel section in a function tofork it)

• See https://iwomp.zih.tu-dresden.de/downloads/

OpenMP-compilation.pdf for more details

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 10 / 39

Page 11: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ OpenMP API◮ Notation

Notation (Syntax)

#include <stdio.h>

#include <stdlib.h>

#include <omp.h>

void hello_world(void)

{

int my_rank = omp_get_thread_num();

int thread_count = omp_get_num_threads();

printf("Thread %d of %d says Hello!\n", my_rank , thread_count);

}

int main(int argc, char* argv[])

{

int thread_count = strtol(argv[1], NULL, 10);

#pragma omp parallel num_threads(thread_count)

hello_world();

return 0;

}

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 11 / 39

Page 12: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ OpenMP API◮ Notation

Notation (Syntax) [1]

• Directive: pragma statemente.g. #pragma omp parallel [ clause [ [, ] clause ] . . . ]structured-block

• Runtime Library Routine: function defined in omp.h

e.g. omp_get_thread_num()

• Structured Block: Single statement or compound statement with asingle entry at the top and a single exit at the bottom

• Clause: modifies a directive’s behaviore.g. num_threads( integer-expression )

• Environment Variable: defined outside the programe.g. OMP_NUM_THREADS

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 12 / 39

Page 13: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ OpenMP API◮ Notation

Notation (OpenMP)

• Master Thread: original thread

• Slave Threads: all additional threads

• Team: master thread + slave threads

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 13 / 39

Page 14: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ OpenMP API◮ Scope of Variables

Scope of Variables

• shared scope: variable can be accessed by all threads in teamvariables declared outside a structured block following a parallel

directive

• private scope: variable can be accessed by a single threadvariable declared inside a structured block following a parallel

directive

int foo = 42;

int bar = 40;

#pragma omp parallel private(foo) shared(bar) default(none)

{

int x;

/* foo and x are private */

/* bar is shared */

}

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 14 / 39

Page 15: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ OpenMP API◮ Scope of Variables

Handout only: Scope of Variables

• Private variables are uninitialized.

• Initialize variables with values from master thread: firstprivate.

• default(none) requires programmer to specify visibility for all variablesimplicitly (good practice).

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 14E / 39

Page 16: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ OpenMP API◮ parallel for Directive

parallel for Directive

#include <stdlib.h>

#include <omp.h>

int main(int argc , char* argv [])

{

#pragma omp parallel

{

#pragma omp for

for (int i = 0; i < 3; ++i)

{

printf("Thread %d of %d says Hello !\n",

omp_get_thread_num (), omp_get_num_threads ());

}

}

}

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 15 / 39

Page 17: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ OpenMP API◮ parallel for Directive

Handout only: parallel for Directive

• Runs loop iterations in parallel.

• Shortcut: #pragma omp parallel for

• Loop iterations must be data-independent.

• Loop must be in canonical form.− E.g.: test condition is <, <=, >, etc.; operation is increment.− OpenMP must be able to determine the number of iterations before

the loop is executed.

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 15E / 39

Page 18: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ OpenMP API◮ parallel for Directive

parallel for Directive

Mapping of iterations to threads controlled by schedule clause.

• schedule(static [, chunksize]): block of chunksize iterations staticallyassigned to thread

• schedule(dynamic [, chunksize]): thread reserves chunksize iterations fromqueue

• schedule(guided [, chunksize]): same as dynamnic, but chunk size startsbig and gets smaller and smaller, until it reaches chunksize.

• schedule(runtime): scheduling behavior determined by environmentvariable

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 16 / 39

Page 19: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ OpenMP API◮ reduce Clause

Example: Sum of List of Integers

#include <stdlib.h>

#include <omp.h>

int main(int argc , char* argv [])

{

int sum = 0;

int A[100];

int i;

for (i = 0; i < 100; ++i) A[i] = i;

#pragma omp parallel for

for (i = 0; i < 100; ++i)

{

sum += A[i];

}

printf("Sum: %d\n", sum);

}

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 17 / 39

Page 20: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ OpenMP API◮ reduce Clause

reduce Clause

#include <stdlib.h>

#include <omp.h>

int main(int argc , char* argv [])

{

int sum = 0;

int A[100];

int i;

for (i = 0; i < 100; ++i) A[i] = i;

#pragma omp parallel for reduction (+:sum)

for (i = 0; i < 100; ++i)

{

sum += A[i];

}

printf("Sum: %d\n", sum);

}

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 18 / 39

Page 21: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ OpenMP API◮ reduce Clause

reduce Clause

• Compiler creates local private copy per variable

• +, initial value 0

• −, initial value 0

• ∗, initial value 1

• Also support for &, |, ˆ, &&, || in C/C++

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 19 / 39

Page 22: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ OpenMP API◮ Synchronization

Critical Sections

#include <stdlib.h>

#include <omp.h>

int main(int argc , char* argv [])

{

int sum = 0;

int A[100];

int i;

for (i = 0; i < 100; ++i) A[i] = i;

#pragma omp parallel for

for (i = 0; i < 100; ++i)

{

#pragma omp critical

sum += A[i];

}

printf("Sum: %d\n", sum);

}

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 20 / 39

Page 23: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ OpenMP API◮ Synchronization

Atomic Statements

#include <stdlib.h>

#include <omp.h>

int main(int argc , char* argv [])

{

int sum = 0;

int A[100];

int i;

for (i = 0; i < 100; ++i) A[i] = i;

#pragma omp parallel for

for (i = 0; i < 100; ++i)

{

#pragma omp atomic

sum += A[i];

}

printf("Sum: %d\n", sum);

}

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 21 / 39

Page 24: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ OpenMP API◮ Synchronization

Handout only: Atomic Statements

• Behavior is implementation-specific, but might use special CPUinstructions (e.g. atomic fetch add).

• Supports x binop= y, x++, ++x, x--, --x.

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 21E / 39

Page 25: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ OpenMP API◮ Synchronization

More Synchronization Constructs

• #pragma omp barrier: wait until all threads arrive

• #pragma omp for nowait: remove implicit barrier after for loop (also existsfor other directives)

• #pragma omp master: only executed by master thread

• #pragma omp single: only executed by one thread

• Sections: define a number of blocks, every thread executes one block

• Locks: omp_init_lock(), omp_set_lock(), omp_unset_lock(), . . .

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 22 / 39

Page 26: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ ForestGOMP: NUMA with OpenMP

Overview

What is OpenMP?

Comparison of Multiprocessing Libraries

OpenMP API

ForestGOMP: NUMA with OpenMP

Matrix Multiply with OpenMP and MPI

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 23 / 39

Page 27: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ ForestGOMP: NUMA with OpenMP

Implementation: ForestGOMP [2]

• Objectives and Motivation− Keep buffers and threads operating on them on the same NUMA node

(reducing contention)− Processor level: group threads sharing data intensively (improve cache

usage)

• Triggers for scheudling− Allocation/deallocation of resources− Processor becomes idle− Change of hardware counters (e.g., cache miss, remote access rate)

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 24 / 39

Page 28: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ ForestGOMP: NUMA with OpenMP

BubbleSched: Hierarchical Bubble-Based Thread Scheduler

4x NUMA Node Runqueues

4x 4x Core Runqueues

Machine Runqueue

• Runqueue for different hierarchical levels

• Bubble: group of threads sharing data or heavy synchronization

• Responsible for scheduling threads

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 25 / 39

Page 29: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ ForestGOMP: NUMA with OpenMP

Mami: NUMA-aware Memory Manager

• API for memory allocation

• Can migrate memory to a different NUMA node

• Supports Next Touch policy: migrate data to NUMA node ofaccessing thread

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 26 / 39

Page 30: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ ForestGOMP: NUMA with OpenMP

Handout only: Memory Relocation with Next Touch [3]

• Buffers are marked as migrate-on-next-touch when a thread migrationis expected

• Buffer is relocated if thread touches buffer that is not located on localnode

• Implemented in kernel mode

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 26E / 39

Page 31: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ ForestGOMP: NUMA with OpenMP

ForestGOMP: Mami-aware OpenMP Runtime

• Mami attaches memory hints: e.g., which regions are access frequentlyby a certain thread

• Initial distribution: put thread and corresponding memory on sameNUMA node (local accesses)

• Handle idleness: steal threads from local core, then from differentNUMA node (also migrates memory; prefers threads with less memory)

• Two levels of distribution: memory-aware, then cache-aware

Hasso Plattner Institute, Operating Systems andMiddleware Group OpenMP – an overview January 14, 2015 27 / 39

Page 32: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

OpenMP – an overview◮ ForestGOMP: NUMA with OpenMP

References

OpenMP Architecture Review Board.Openmp 3.1 api c/c++ syntax quick reference card, 2011.

François Broquedis, Nathalie Furmento, Brice Goglin, Pierre-AndréWacrenier, and Raymond Namyst.Forestgomp: An efficient openmp environment for numa architectures.International Journal of Parallel Programming, 38(5-6):418–439, 2010.

Brice Goglin, Nathalie Furmento, et al.Memory migration on next-touch.In Linux Symposium, 2009.

P.S. Pacheco.An Introduction to Parallel Programming.An Introduction to Parallel Programming. Morgan Kaufmann, 2011.

Wikipedia.Openmp — wikipedia, the free encyclopedia, 2014.[Online; accessed 14-December-2014].Hasso Plattner Institute, Operating Systems and

Middleware Group OpenMP – an overview January 14, 2015 28 / 39

Page 33: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

Matrix Multiply with OpenMP and MPISeminar Non-uniform Memory Access (NUMA), WS2014/15

Carolin Fiedler, Matthias Springer

Hasso Plattner Institute, Operating Systems and Middleware Group

January 14, 2015

Page 34: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

Matrix Multiply with OpenMP and MPI◮ Matrix Multiply with OpenMP and MPI

Overview

What is OpenMP?

Comparison of Multiprocessing Libraries

OpenMP API

ForestGOMP: NUMA with OpenMP

Matrix Multiply with OpenMP and MPI

Hasso Plattner Institute, Operating Systems andMiddleware Group Matrix Multiply with OpenMP and MPIJanuary 14, 2015 30 / 39

Page 35: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

Matrix Multiply with OpenMP and MPI◮ Matrix Multiply with OpenMP and MPI

Idea

• Distribute work on nodes with MPI (no memory sharing)1 worker per node

• Parallelize work on a single node with OpenMP (shared memory)1 thread per core

Hasso Plattner Institute, Operating Systems andMiddleware Group Matrix Multiply with OpenMP and MPIJanuary 14, 2015 31 / 39

Page 36: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

Matrix Multiply with OpenMP and MPI◮ Matrix Multiply with OpenMP and MPI

Message Passing

A

B

C

• Replicate B on all MPI workers

• For n MPI workers, divide A in n stripes, every worker gets one stripe

• Result matrix C contains one stripe per worker

• Message passing (remote memory access) during send (distribute) andcollect phases

• Local memory access only during multiplication and add

Hasso Plattner Institute, Operating Systems andMiddleware Group Matrix Multiply with OpenMP and MPIJanuary 14, 2015 32 / 39

Page 37: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

Matrix Multiply with OpenMP and MPI◮ Matrix Multiply with OpenMP and MPI

Live Demo and Source Code

Hasso Plattner Institute, Operating Systems andMiddleware Group Matrix Multiply with OpenMP and MPIJanuary 14, 2015 33 / 39

Page 38: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

Matrix Multiply with OpenMP and MPI◮ Matrix Multiply with OpenMP and MPI

Source Code (OpenMP)

#pragma omp parallel default (none) shared(A,B,C,offset ,rows)

private(i,j,k)

{

#pragma omp for

for (j = 0; j < M; j++)

for (i = offset; i < offset + rows; i++)

for (k = 0; k < P; k++)

C[i][j] += A[i][k] * B[k][j];

}

Hasso Plattner Institute, Operating Systems andMiddleware Group Matrix Multiply with OpenMP and MPIJanuary 14, 2015 34 / 39

Page 39: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

Matrix Multiply with OpenMP and MPI◮ Matrix Multiply with OpenMP and MPI

Performance Measurements

• Matrix size: 2048 x 2048

• (MPI) 1 x (OpenMP) 1: 110.3 s

• 1 x 2: 55.6 s

• 1 x 12: 12.7 s

• 1 x 24: 10 s

• 2 x 12: 9.8 s

• 12 x 2: 10.7 s

• 24 x 1: 11.5 s

Hasso Plattner Institute, Operating Systems andMiddleware Group Matrix Multiply with OpenMP and MPIJanuary 14, 2015 35 / 39

Page 40: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

Matrix Multiply with OpenMP and MPI◮ Matrix Multiply with OpenMP and MPI

Best Configuration

• 2 x 12: 9.8 s

• System has 2 NUMA nodes (sockets)

• Every socket has 12 cores (6 real ones + 6 HyperThreading)

• Only local memory accesses inside an OpenMP thread

export OMP_NUM_THREADS =12

mpirun -np 2 --bysocket --bind -to -socket --report -bindings ./a.out

Hasso Plattner Institute, Operating Systems andMiddleware Group Matrix Multiply with OpenMP and MPIJanuary 14, 2015 36 / 39

Page 41: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

Matrix Multiply with OpenMP and MPI◮ Matrix Multiply with OpenMP and MPI

Hardware-specific Performance Optimizations

• ubuntu-numa0101 machine details− 2x Intel Xeon E5-2620 (Sandy Bridge) CPU

• 6 cores, 2.0 GHz each• 6x 32 KB L1 cache (32 KB instruction, 32 KB data)• 6x 256 KB L2 cache• 1x 15 MB shared L3 cache

− 64 GB RAM

• Optimizations− Transposition: read matrices row-wise− Blocking: access matrices in chunks that fit into the cache− SSE (Streaming SIMD Extensions): add, multiply two 128-bit vectors;

some CPUs have fused multiply-add units− Alignment: aligned (16 byte) loads are faster than unaligned loads− Loop Unrolling: less branches in the assembly code, instruction-level

parallelism− Parameter Tuning: brute-force different blocking sizes per matrix size

Hasso Plattner Institute, Operating Systems andMiddleware Group Matrix Multiply with OpenMP and MPIJanuary 14, 2015 37 / 39

Page 42: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

Matrix Multiply with OpenMP and MPI◮ Matrix Multiply with OpenMP and MPI

Matrix TranspositionImage taken from Wikipedia [6]

A

B

a1,1

a3,1 a3,2

a2,1 a2,2

a4,1 a4,2

a1,2

b1,2

b2,2

b1,3

b2,3

b1,1

b2,1

• Matrices stored row-wise in main memory

• Matrix A: reading row-wise, Matrix B: reading column-wise

• Prefetching: 64 byte cache line, will read 8 doubles from B but onlyuse one of them

• Transpose matrix B for more cache hits

Hasso Plattner Institute, Operating Systems andMiddleware Group Matrix Multiply with OpenMP and MPIJanuary 14, 2015 38 / 39

Page 43: OpenMP – an overview · Seminar Non-uniform Memory Access (NUMA), WS2014/15 Matthias Springer Hasso Plattner Institute, Operating Systems and Middleware Group January 14, 2015.

Matrix Multiply with OpenMP and MPI◮ Matrix Multiply with OpenMP and MPI

L1/L2 Blocking

A

B

a1,1

a3,1 a3,2

a2,1 a2,2

a4,1 a4,2

a1,2

b1,2

b2,2

b1,3

b2,3

b1,1

b2,1

• Divide matrices in block and iterate over all combinations of blocks

• L1 Blocking: cache big enough for 327688

= 4096 doubles, block size√

40962

= 45.2, use 40 × 40 blocks to ensure that entire cache line is

used

• L2 Blocking: 128 × 128 blocks

Hasso Plattner Institute, Operating Systems andMiddleware Group Matrix Multiply with OpenMP and MPIJanuary 14, 2015 39 / 39