Top Banner
ECE1747 Parallel Programming Shared Memory Multithreading Pthreads
67
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

ECE1747 Parallel Programming

Shared Memory Multithreading Pthreads

Page 2: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Shared Memory

proc1 proc2 proc3 procN

Shared Memory Address Space

• All threads access the same shared memory data space.

Page 3: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Shared Memory (continued)

• Concretely, it means that a variable x, a pointer p, or an array a[] refer to the same object, no matter what processor the reference originates from.

• We have more or less implicitly assumed this to be the case in earlier examples.

Page 4: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Shared Memory

proc1 proc2 proc3 procN

a

Page 5: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Distributed Memory - Message Passing

The alternative model to shared memory.

proc1 proc2 proc3 procN

mem1 mem2 mem3 memN

network

aa a a

Page 6: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Shared Memory vs. Message Passing

• Same terminology is used in distinguishing hardware.

• For us: distinguish programming models, not hardware.

Page 7: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Programming vs. Hardware

• One can implement– a shared memory programming model– on shared or distributed memory hardware– (also in software or in hardware)

• One can implement– a message passing programming model– on shared or distributed memory hardware

Page 8: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Portability of programming models

shared memoryprogramming

message passingprogramming

distr. memorymachine

shared memorymachine

Page 9: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Shared Memory Programming: Important Point to Remember

• No matter what the implementation, it conceptually looks like shared memory.

• There may be some (important) performance differences.

Page 10: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Multithreading

• User has explicit control over thread.

• Good: control can be used to performance benefit.

• Bad: user has to deal with it.

Page 11: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Pthreads

• POSIX standard shared-memory multithreading interface.

• Provides primitives for process management and synchronization.

Page 12: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

What does the user have to do?

• Decide how to decompose the computation into parallel parts.

• Create (and destroy) processes to support that decomposition.

• Add synchronization to make sure dependences are covered.

Page 13: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

General Thread Structure

• Typically, a thread is a concurrent execution of a function or a procedure.

• So, your program needs to be restructured such that parallel parts form separate procedures or functions.

Page 14: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Example of Thread Creation (contd.)

main()

pthread_create(func)

func()

Page 15: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Thread Joining Example

void *func(void *) { ….. }

pthread_t id; int X;

pthread_create(&id, NULL, func, &X);

…..

pthread_join(id, NULL);

…..

Page 16: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Example of Thread Creation (contd.)

main()

pthread_create(func) func()

pthread_join(id)

pthread_ exit()

Page 17: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Sequential SOR

for some number of timesteps/iterations {for (i=0; i<n; i++ )

for( j=1, j<n, j++ )temp[i][j] = 0.25 *

( grid[i-1][j] + grid[i+1][j]

grid[i][j-1] + grid[i][j+1] );for( i=0; i<n; i++ )

for( j=1; j<n; j++ )grid[i][j] = temp[i][j];

}

Page 18: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Parallel SOR

• First (i,j) loop nest can be parallelized.• Second (i,j) loop nest can be parallelized.• Must wait to start second loop nest until all

processors have finished first.• Must wait to start first loop nest of next

iteration until all processors have second loop nest of previous iteration.

• Give n/p rows to each processor.

Page 19: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Pthreads SOR: Parallel parts (1)

void* sor_1(void *s){

int slice = (int) s;int from = (slice*n)/p;int to = ((slice+1)*n)/p;for( i=from; i<to; i++)

for( j=0; j<n; j++ )temp[i][j] = 0.25*(grid[i-1][j] + grid[i+1]

[j]+grid[i][j-1] + grid[i][j+1]);

}

Page 20: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Pthreads SOR: Parallel parts (2)

void* sor_2(void *s)

{int slice = (int) s;int from = (slice*n)/p;int to = ((slice+1)*n)/p;

for( i=from; i<to; i++)

for( j=0; j<n; j++ )

grid[i][j] = temp[i][j];

}

Page 21: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Pthreads SOR: main

for some number of timesteps {for( i=0; i<p; i++ ) pthread_create(&thrd[i], NULL, sor_1, (void *)i);for( i=0; i<p; i++ ) pthread_join(thrd[i], NULL);for( i=0; i<p; i++ ) pthread_create(&thrd[i], NULL, sor_2, (void *)i);for( i=0; i<p; i++ ) pthread_join(thrd[i], NULL);

}

Page 22: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Summary: Thread Management

• pthread_create(): creates a parallel thread executing a given function (and arguments), returns thread identifier.

• pthread_exit(): terminates thread.

• pthread_join(): waits for thread with particular thread identifier to terminate.

Page 23: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Summary: Program Structure

• Encapsulate parallel parts in functions.

• Use function arguments to parameterize what a particular thread does.

• Call pthread_create() with the function and arguments, save thread identifier returned.

• Call pthread_join() with that thread identifier.

Page 24: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Pthreads Synchronization

• Create/exit/join– provide some form of synchronization, – at a very coarse level,– requires thread creation/destruction.

• Need for finer-grain synchronization– mutex locks,– condition variables.

Page 25: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Use of Mutex Locks

• To implement critical sections.

• Pthreads provides only exclusive locks.

• Some other systems allow shared-read, exclusive-write locks.

Page 26: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Barrier Synchronization

• A wait at a barrier causes a thread to wait until all threads have performed a wait at the barrier.

• At that point, they all proceed.

Page 27: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Implementing Barriers in Pthreads

• Count the number of arrivals at the barrier.

• Wait if this is not the last arrival.

• Make everyone unblock if this is the last arrival.

• Since the arrival count is a shared variable, enclose the whole operation in a mutex lock-unlock.

Page 28: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Implementing Barriers in Pthreads

void barrier(){

pthread_mutex_lock(&mutex_arr);arrived++;if (arrived<N) {

pthread_cond_wait(&cond, &mutex_arr);}else {

pthread_cond_broadcast(&cond); arrived=0; /* be prepared for next barrier */ }

pthread_mutex_unlock(&mutex_arr);}

Page 29: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Parallel SOR with Barriers (1 of 2)

void* sor (void* arg)

{

int slice = (int)arg;

int from = (slice * (n-1))/p + 1;

int to = ((slice+1) * (n-1))/p + 1;

for some number of iterations { … }

}

Page 30: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Parallel SOR with Barriers (2 of 2)

for (i=from; i<to; i++)

for (j=1; j<n; j++)

temp[i][j] = 0.25 * (grid[i-1][j] + grid[i+1][j] + grid[i][j-1] + grid[i][j+1]);

barrier();

for (i=from; i<to; i++)

for (j=1; j<n; j++)

grid[i][j]=temp[i][j];

barrier();

Page 31: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Parallel SOR with Barriers: main

int main(int argc, char *argv[]){

pthread_t *thrd[p];/* Initialize mutex and condition variables */for (i=0; i<p; i++)

pthread_create (&thrd[i], &attr, sor, (void*)i);for (i=0; i<p; i++)

pthread_join (thrd[i], NULL);/* Destroy mutex and condition variables */

}

Page 32: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Note again

• Many shared memory programming systems (other than Pthreads) have barriers as basic primitive.

• If they do, you should use it, not construct it yourself.

• Implementation may be more efficient than what you can do yourself.

Page 33: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Busy Waiting

• Not an explicit part of the API.

• Available in a general shared memory programming environment.

Page 34: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Busy Waiting

initially: flag = 0;

P1: produce data;

flag = 1;

P2: while( !flag ) ;

consume data;

Page 35: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Use of Busy Waiting

• On the surface, simple and efficient.

• In general, not a recommended practice.

• Often leads to messy and unreadable code (blurs data/synchronization distinction).

• May be inefficient

Page 36: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Private Data in Pthreads

• To make a variable private in Pthreads, you need to make an array out of it.

• Index the array by thread identifier, which you should keep track of .

• Not very elegant or efficient.

Page 37: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Other Primitives in Pthreads

• Set the attributes of a thread.

• Set the attributes of a mutex lock.

• Set scheduling parameters.

Page 38: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

ECE 1747 Parallel Programming

Machine-independent

Performance Optimization Techniques

Page 39: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Returning to Sequential vs. Parallel

• Sequential execution time: t seconds.• Startup overhead of parallel execution: t_st

seconds (depends on architecture)• (Ideal) parallel execution time: t/p + t_st.• If t/p + t_st > t, no gain.

Page 40: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

General Idea

• Parallelism limited by dependences.

• Restructure code to eliminate or reduce dependences.

• Sometimes possible by compiler, but good to know how to do it by hand.

Page 41: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Optimizations: Example 16

for (i = 0; i < 100000; i++)a[i + 1000] = a[i] + 1;

Cannot be parallelized as is.May be parallelized by applying certain code transformations.

Page 42: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.
Page 43: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.
Page 44: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Summary

• Reorganize code such that– dependences are removed or reduced– large pieces of parallel work emerge– loop bounds become known – …

• Code can become messy … there is a point of diminishing returns.

Page 45: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Factors that Determine Speedup

• Characteristics of parallel code– granularity– load balance– locality– communication and synchronization

Page 46: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Granularity

• Granularity = size of the program unit that is executed by a single processor.

• May be a single loop iteration, a set of loop iterations, etc.

• Fine granularity leads to:– (positive) ability to use lots of processors

– (positive) finer-grain load balancing

– (negative) increased overhead

Page 47: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Granularity and Critical Sections

• Small granularity => more processors => more critical section accesses => more contention.

Page 48: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Issues in Performance of Parallel Parts

• Granularity.

• Load balance.

• Locality.

• Synchronization and communication.

Page 49: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Load Balance

• Load imbalance = different in execution time between processors between barriers.

• Execution time may not be predictable.– Regular data parallel: yes.– Irregular data parallel or pipeline: perhaps.– Task queue: no.

Page 50: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.
Page 51: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Static vs. Dynamic

• Static: done once, by the programmer– block, cyclic, etc.– fine for regular data parallel

• Dynamic: done at runtime– task queue– fine for unpredictable execution times– usually high overhead

• Semi-static: done once, at run-time

Page 52: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Choice is not inherent

• MM or SOR could be done using task queues: put all iterations in a queue.– In heterogeneous environment.– In multitasked environment.

Page 53: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Static Load Balancing

• Block– best locality– possibly poor load balance

• Cyclic– better load balance– worse locality

• Block-cyclic– load balancing advantages of cyclic (mostly)– better locality

Page 54: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Dynamic Load Balancing (1 of 2)

• Centralized: single task queue.– Easy to program– Excellent load balance

• Distributed: task queue per processor.– Less communication/synchronization

Page 55: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Dynamic Load Balancing (2 of 2)

• Task stealing:– Processes normally remove and insert tasks

from their own queue.– When queue is empty, remove task(s) from

other queues.• Extra overhead and programming difficulty.

• Better load balancing.

Page 56: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Semi-static Load Balancing

• Measure the cost of program parts.

• Use measurement to partition computation.

• Done once, done every iteration, done every n iterations.

Page 57: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Molecular Dynamics (MD)

• Simulation of a set of bodies under the influence of physical laws.

• Atoms, molecules, celestial bodies, ...

• Have same basic structure.

Page 58: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Molecular Dynamics (Skeleton)

for some number of timesteps {

for all molecules i

for all other molecules j

force[i] += f( loc[i], loc[j] );

for all molecules i

loc[i] = g( loc[i], force[i] );

}

Page 59: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Molecular Dynamics

• To reduce amount of computation, account for interaction only with nearby molecules.

Page 60: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Molecular Dynamics (continued)

for some number of timesteps {

for all molecules i

for all nearby molecules j

force[i] += f( loc[i], loc[j] );

for all molecules i

loc[i] = g( loc[i], force[i] );

}

Page 61: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Molecular Dynamics (continued)

for each molecule i

number of nearby molecules count[i]

array of indices of nearby molecules index[j]

( 0 <= j < count[i])

Page 62: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Molecular Dynamics (continued)

for some number of timesteps {

for( i=0; i<num_mol; i++ )

for( j=0; j<count[i]; j++ )

force[i] += f(loc[i],loc[index[j]]);

for( i=0; i<num_mol; i++ )

loc[i] = g( loc[i], force[i] );

}

Page 63: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Molecular Dynamics (simple)

for some number of timesteps {parallel for

for( i=0; i<num_mol; i++ )for( j=0; j<count[i]; j++ )force[i] += f(loc[i],loc[index[j]]);

parallel forfor( i=0; i<num_mol; i++ )

loc[i] = g( loc[i], force[i] );}

Page 64: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Molecular Dynamics (simple)

• Simple to program.

• Possibly poor load balance– block distribution of i iterations (molecules)– could lead to uneven neighbor distribution– cyclic does not help

Page 65: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Better Load Balance

• Assign iterations such that each processor has ~ the same number of neighbors.

• Array of “assign records”– size: number of processors– two elements:

• beginning i value (molecule)• ending i value (molecule)

• Recompute partition periodically

Page 66: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Frequency of Balancing

• Every time neighbor list is recomputed.– once during initialization.– every iteration.– every n iterations.

• Extra overhead vs. better approximation and better load balance.

Page 67: ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Summary

• Parallel code optimization– Critical section accesses.– Granularity.– Load balance.