Parallelization at a Glance
Post on 14-Feb-2016
25 Views
Preview:
DESCRIPTION
Transcript
Rechen- und Kommunikationszentrum (RZ)
Parallelization at a Glance
Christian Terboven <terboven@rz.rwth-aachen.de>20.11.2012 / Aachen, Germany
Stand: 19.11.2012Version 2.3
RZ: Christian Terboven Folie 2
Basic Concepts
Parallelization Strategies
Possible Issues
Efficient Parallelization
Agenda
RZ: Christian Terboven Folie 3
Basic Concepts
RZ: Christian Terboven Folie 4
Process: Instance of a computer program Thread: A sequence of instructions (own Program Counter)
Parallel / Concurrent execution: Multiple Processes
Multiple Threads
Per process: Address Space, Files, Environment, … Per thread: Program Counter, Stack, …
Processes and Threads
process thread program counter
computer
RZ: Christian Terboven Folie 5
Even on a multi-socket multi-core system you should not make any assumption which process / thread is executed when and where!
Two threads on one core:
Two threads on two cores:
Process and Thread Scheduling by the OS
Thread1 Thread 2 System thread
thread migration
“pinned” threads
RZ: Christian Terboven Folie 6
Memory can be accessed by several threads running on different cores in a multi-socket multi-core system:
Shared Memory Parallelization
a=4
CPU1 CPU2
ac=3+a
RZ: Christian Terboven Folie 7
Parallelization Strategies
RZ: Christian Terboven Folie 8
Chances for concurrent execution: Look for tasks that can be executed simultaneously
(task parallelism)
Decompose data into distinct chunks to be processed
independently (data parallelism)
Finding Concurrency
Organize by task
Task parallelism
Divide and conquer
Organize by data decomposition
Geometric decomposition
Recursive data
Organize by flow of data
Pipeline
Event-based coordination
RZ: Christian Terboven Folie 9
Divide and conquer
Problem
Subproblem Subproblem
split
SubproblemSubproblemSubproblemSubproblem
split
Subsolution Subsolution Subsolution Subsolution
solve
Subsolution Subsolution
merge
Solution
merge
RZ: Christian Terboven Folie 10
Geometric decomposition
Example: ventricular assist device (VAD)
RZ: Christian Terboven Folie 11
Analogy assembly line Assign different stages to different PEs
Pipeline
C1 C2 C3 C4 C5
C1 C2 C3 C4 C5
C1 C2 C3 C4 C5
C1 C2 C3 C4 C5
Stage 1
Stage 2
Stage 3
Stage 4
Time
RZ: Christian Terboven Folie 12
Recursive data pattern Parallelize operations on recursive data structures
See example later on: Fibonacci
Event-based coordination pattern Decomposition into semi-independent tasks interacting in an irregular fashion
See example later on: While-loop with tasks
Further parallel design patterns
RZ: Christian Terboven Folie 13
Possible Issues
RZ: Christian Terboven Folie 14
Data Race: Concurrent access of the same memory location by multiple threads without proper synchronization Let variable x have the initial value 1
Depending on which thread is faster, you will see either 1 or 5
Result is nondeterministic (i.e. depends on OS scheduling)
Data Races (and how to detect and avoid them) will be covered in more detail later!
Data Races / Race Conditions
x=5; printf(x);
RZ: Christian Terboven Folie 15
Serialization: When threads / processes wait „too much“ Limited scalability, if at all
Simple (and stupid) example:
Serialization
Send RecvData
Transfer
SendRecv DataTransfer
Send RecvData
Transfer
Calc
Calc
Wait
Wait
RZ: Christian Terboven Folie 16
Efficient Parallelization
RZ: Christian Terboven Folie 17
Overhead introduced by the parallelization: Time to start / end / manage threads
Time to send / exchange data
Time spent in synchronization of threads / processes
With parallelization: The total CPU time increases,
The Wall time decreases,
The System time stays the same.
Efficient parallelization is about minimizing the overhead introduced by the parallelization itself!
Parallelization Overhead
RZ: Christian Terboven Folie 18
Load Balancing
perfect load balancing
time
load imbalance
time
– All threads / processes finish at the same time
– Some threads / processes take longer than others
– But: All threads / processes have to wait for the slowest thread / process, which is thus limitingthe scalability
RZ: Christian Terboven Folie 19
Time using 1 CPU: T(1) Time using p CPUs: T(p) Speedup S: S(p)=T(1)/T(p)
Measures how much faster the parallel computation is! Efficiency E: E(p)=S(p)/p
Example: T(1)=6s, T(2)=4s
S(2)=1.5
E(2)=0.75
Ideal case: T(p)=T(1)/p S(p)=p E(p)=1.0
Speedup and Efficiency
RZ: Christian Terboven Folie 20
Describes the influence of the serial part onto scalability(without taking any overhead into account).
S(p)=T(1)/T(p)=T(1)/(f*T(1) + (1-f)*T(1)/p)=1/(f+(1-f)/p)
f: serial part (0 f 1)
T(1) : time using 1 CPU
T(p): time using p CPUs
S(p): speedup; S(p)=T(1)/T(p)
E(p): efficiency; E(p)=S(p)/p
It is rather easy to scale to a small number of cores, but any parallelization is limited by the serial part of the program!
Amdahl‘s Law
RZ: Christian Terboven Folie 21
If 80% (measured in program runtime) of your work can be parallelized and „just“ 20% are still running sequential, then your speedup will be:
Amdahl‘s Law illustrated
1 processor:time: 100%speedup: 1
2 processors:time: 60%speedup: 1.7
4 processors:time: 40%speedup: 2.5
processors:time: 20%speedup: 5
RZ: Christian Terboven Folie 22
After the initial parallelization of a program, you will typically see speedup curves like this:
Speedup in Practicesp
eedu
p
1 2 3 4 5 6 7 8 . . .
1
2
3
4
5
6
7
8
Ideal speedup S(p)=p
Realistic speedup
p
Speedup according to Amdahl’s law
top related