Parallel Processing 1 High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads * Jeremy R. Johnson *Some of this lecture was derived from Pthreads Programming by Nichols, Buttlar, and Farrell and POSIX Threads Programming Tutorial (computing.llnl.gov/tutorials/pthreads) by Blaise Barney
61
Embed
Parallel Processing1 High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads * Jeremy R. Johnson *Some of this lecture was.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Parallel Processing 1
High Performance Computing(CS 540)
Shared Memory Programming with OpenMP and Pthreads*
Jeremy R. Johnson
*Some of this lecture was derived from Pthreads Programming by Nichols, Buttlar, and Farrell and POSIX Threads Programming Tutorial (computing.llnl.gov/tutorials/pthreads) by Blaise Barney
Parallel Processing 2
Introduction• Objective: To further study the shared memory model of parallel
programming. Introduction to the OpenMP and Pthreads for shared memory parallel programming
• Topics– Concurrent programming with UNIX Processes
– Introduction to shared memory parallel programming with Pthreads• Threads• fork/join• race conditions• Synchronization• performance issues - synchronization overhead, contention and granularity, load balance, cache
coherency and false sharing.
– Introduction parallel program design paradigms• Data parallelism (static scheduling)• Task parallelism with workers• Divide and conquer parallelism (fork/join)
Parallel Processing 3
Introduction
• Topics
– OpenMP vs. Pthreads• hello_pthreadsc
• hello_openmp.c
– Parallel Regions and execution model– Data parallelism with loops– Shared vs. private variables– Scheduling and chunk size– Synchronization and reduction variables– Functional parallelism with parallel sections– Case Studies
Processes
• Processes contain information about program resources and program execution state
– Process ID, process group ID, user ID, and group ID– Environment– Working directory– Program instructions– Registers– Stack– Heap– File descriptors– Signal actions– Shared libraries– Inter-process communication tools (such as message queues, pipes,
semaphores, or shared memory).
Parallel Processing 4
UNIX Process
Parallel Processing 5
Threads
• An independent stream of instructions that can be scheduled to run
– Stack pointer– Registers (program counter)– Scheduling properties (such as policy or priority)– Set of pending and blocked signals– Thread specific data
• “lightweight process”– Cost of creating and managing threads much less than processes– Threads live within a process and share process resources such as
address space
• Pthreads – standard thread API (IEEE Std 1003.1)
Parallel Processing 6
Threads within a UNIX Process
Parallel Processing 7
Shared Memory Model
• All threads have access to the same global, shared memory
• All threads within a process share the same address space
• Threads also have their own private data
• Programmers are responsible for synchronizing access (protecting) globally shared data.
Parallel Processing 8
Simple Example
void do_one_thing(int *);
void do_another_thing(int *);
void do_wrap_up(int, int);
int r1 = 0, r2 = 0;
extern int
main(void)
{
do_one_thing(&r1);
do_another_thing(&r2);
do_wrap_up(r1, r2);
return 0;
}
Parallel Processing 9
Parallel Processing 10
do_another_thing() i j k--------------------------------------main()
printf("Counters finished with count = %d\n",sum);
printf("Count should be %d X %d = %d\n",numcounters,limit,numcounters*limit);
return 0;
}
Mutex
• Mutex variables are for protecting shared data when multiple writes occur.
• A mutex variable acts like a "lock" protecting access to a shared data resource. Only one thread can own (lock) a mutex at any given time
Parallel Processing 25
Mutex Operations
• pthread_mutex_lock (mutex) – The pthread_mutex_lock() routine is used by a thread to
acquire a lock on the specified mutex variable. If the mutex is already locked by another thread, this call will block the calling thread until the mutex is unlocked.
• Pthread_mutex_unlock (mutex) – will unlock a mutex if called by the owning
thread. Calling this routine is required after a thread has completed its use of protected data if other threads are to acquire the mutex for their work with the protected data.
printf("Counters finished with count = %d\n",sum);
printf("Count should be %d X %d = %d\n",numcounters,limit,numcounters*limit);
return 0;
}
Better Count
int sum= 0;
pthread_mutex_t lock;
void count(int *arg)
{
int i;
int localsum = 0;
for (i=0;i<*arg;i++)
{
localsum++;
}
pthread_mutex_lock(&lock);
sum = sum + localsum;
pthread_mutex_unlock(&lock);
}
Parallel Processing 28
Threadsafe Code
• Refers to an application's ability to execute multiple threads simultaneously without "clobbering" shared data or creating "race" conditions.
Parallel Processing 29
Condition Variables
• While mutexes implement synchronization by controlling thread access to data, condition variables allow threads to synchronize based upon the actual value of data.
• Without condition variables, the programmer would need to have threads continually polling (possibly in a critical section), to check if the condition is met.
• A condition variable is a way to achieve the same goal without polling
• Always used with a mutexParallel Processing 30
Using Condition variables
Thread A
• Do work up to the point where a certain condition must occur (such as "count" must reach a specified value)
• Lock associated mutex and check value of a global variable
• Call pthread_cond_wait() to perform a blocking wait for signal from Thread-B. Note that a call to pthread_cond_wait() automatically and atomically unlocks the associated mutex variable so that it can be used by Thread-B.
• When signalled, wake up. Mutex is automatically and atomically locked.
• Explicitly unlock mutex• Continue
Thread B
• Do work
• Lock associated mutex
• Change the value of the global variable that Thread-A is waiting upon.
• Check value of the global Thread-A wait variable. If it fulfills the desired condition, signal Thread-A.
• Chandra, Dagum, Kohr, Maydan, McDonald, Menon, “Parallel Programming in OpenMP”, Morgan Kaufman Publishers, 2001.
• Chapman, Jost, and Van der Pas, “Using OpenMP: Portable Shared Memory Parallel Programming,” The MIT Press, 2008.
Parallel Processing 36
Shared vs. Distributed Memory
Memory
P0 P1 Pn...
Interconnection Network
P0 P1 Pn
...M0 M1 Mn
Shared memory Distributed memory
Parallel Processing 37
Shared Memory Programming Model
• Shared memory programming does not require physically shared memory so long as there is support for logically shared memory (in either hardware or software)
• If logical shared memory then there may be different costs for accessing memory depending on the physical location.
• UMA - uniform memory access– SMP - symmetric multi-processor– typically memory connected to processors via a bus
• NUMA - non-uniform memory access– typically physically distributed memory connected via an
interconnection network
Parallel Processing 38
Hello_openmp.c#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char **argv)
{
int n;
if (argc > 1) {
n = atoi(argv[1]); omp_set_num_threads(n);
}
printf("Number of threads = %d\n",omp_get_num_threads());
#pragma omp parallel
{
int id = omp_get_thread_num();
printf("Hello World from %d\n",id);
if (id == 0)
printf("Number of threads = %d\n",omp_get_num_threads());
}
exit(0);
}
Parallel Processing 39
Compiling & Running Hello_openmp
% gcc –fopenmp hello_openmp.c –o hello
% ./hello 4
Number of threads = 1
Hello World from 1
Hello World from 0
Hello World from 3
Number of threads = 4
Hello World from 2
The order of the print statements is nondeterministic
Parallel Processing 40
Execution Model
Master thread
Master and slave threads
Master thread
Implicit barrier synchronization(join)
Implicit thread creation (fork)
Parallel Region
Parallel Processing 41
Explicit Barrier#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
int n;
if (argc > 1) {
n = atoi(argv[1]);
omp_set_num_threads(n);
}
printf("Number of threads = %d\n",omp_get_num_threads());
#pragma omp parallel
{
int id = omp_get_thread_num();
printf("Hello World from %d\n",id);
#pragma omp barrier
if (id == 0) printf("Number of threads = %d\n",omp_get_num_threads());
}
exit(0);
}
Parallel Processing 42
Output with Barrier
%./hellob 4
Number of threads = 1
Hello World from 1
Hello World from 0
Hello World from 2
Hello World from 3
Number of threads = 4
The order of the “Hello World” print statements are nondeterministic; however, the Number of threads print statement always comes at the end
Parallel Processing 43
Hello_pthreads.c#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <errno.h>
#define MAXTHREADS 32
int main(int argc, char **argv)
{
int error,i,n;
void hello(int *pid);
pthread_t tid[MAXTHREADS],mytid;
int pid[MAXTHREADS];
if (argc > 1) {
n = atoi(argv[1]);
if (n > MAXTHREADS) {
printf("Too many threads\n"); exit(1);
}
pthread_setconcurrency(n);
}
printf("Number of threads = %d\n",pthread_getconcurrency());