Task/channel model Foster’s design methodology Partitioning Communication analysis Agglomeration
Post on 30-Dec-2015
44 Views
Preview:
DESCRIPTION
Transcript
1
Parallel Computing 5
Parallel Application Design Ondřej Jakl
Institute of Geonics, Academy of Sci. of the CR
2
• Task/channel model
• Foster’s design methodology
• Partitioning
• Communication analysis
• Agglomeration
• Mapping to processors
• Examples
Outline of the lecture
3
• In general a very creative process
• Only methodical frameworks available
• Usually more alternatives to be considered
• The best parallel solution may differ from suggestions of the sequential approach
Design of parallel algorithms
4
• Introduced in Ian Foster’s Designing and Building Parallel Programs [Foster 1995]
– http://www-unix.mcs.anl.gov/dbpp
• Represents a parallel computation as set of tasks– task is a program, its local memory and a collection of I/O ports
• task can send local data values to other tasks via output ports
• task can receive data values from other tasks via input ports
• The tasks may interact with each other by sending messages through channels
– channel is a message queue that connects one task’s output port with another task’s input port
– nonblocking asynchronous send and blocking receive is supposed
• An abstraction close to the message passing model
Task/channel model (1)
5
Task/channel model (2)
after [Quinn 2004]TaskChannel
Program
Output port
Input port
Directed graph of tasks (vertices) and channels (edges)
6
• Design stages:1. partitioning into concurrent tasks2. communication analysis to
coordinate tasks3. agglomeration into larger tasks
with respect to the target platform 4. mapping of tasks to processors
• 1, 2 conceptual level, 3, 4 implementation dependent
• In practice often considered simultaneously
Foster’s methodology [Foster 1995]
7
• Process of dividing the computation and the data into pieces – primitive tasks• Goal: Expose the opportunities for parallel
processing• Maximal (fine-grained) decomposition
for greater flexibility• Complementary techniques:
– domain decomposition (data centric approach)
– functional decomposition (computation centric approach)
• Combinations possible– usual scenario:
primary decomposition – functional secondary decomposition – domain
Partitioning (decomposition)
8
• Primary object of decomposition: processed data – first, data associated with the problem is divided into pieces
• focus on the largest and/or most frequently accessed data• pieces should be of comparable size
– next, the computation is partitioned according to the data on which it operates
• usually the same code for each task (SPMD – Single Program Multiple Data)• may be non-trivial, may bring up complex mathematical problems
• Most often used technique in parallel programming
Domain (data) decomposition
3D grid data: one-, two-, three-dimensional decomposition [Foster 1995]
9
• Primary object of decomposition: computation– first, computation is decomposed into disjoint tasks
• different codes of the tasks (MPMD – Multiple Program Multiple Data)
• methodological benefits: implies program structuring– gives rise to simpler modules with interfaces– c.f. object oriented programming, etc.
– next, data is partitioned according to the requirements of the tasks• data requirements may be disjoint, or overlap ( communication)
• Sources of parallelism: – concurrent processing of independent tasks– concurrent processing of a stream of data through pipelining
• a stream of data is passed on through a succession of tasks, each of which perform some operation on it
• MPSD – Multiple Program Single Data
• The number of task usually does not scale with the problem size – for greater scalability combine with domain decomposition on the subtasks
Functional (task) decompositionClimate model [Foster 1995]
10
• More tasks (at least by order of magnitude) then processors
– if not: little flexibility
• No redundancy in processing and data– if not: little scalability
• Comparable size of tasks– if not: difficult load balancing
• Number of task proportional to the size of the problem– if not: problems utilizing additional processors
• Alternate partitions available?
Good decomposition
11
• Calculation of π by the standard numerical integration formula
• Consider numerical integration based on the rectangle method
– integral is approximated by the area of evenly spaced rectangular strips
– height of the strips is calculated as the value of the integrated function at the midpoint of the strips
Example: PI calculation
1.00.0
2.0
4.0
F(x
) = 4
/(1+
x2)
x
dxx
1
02
1
4
12
Seqential pseudocode
set n (number of strips)for each strip
calculate the height y of the strip (rectangle) at its midpointsum all y to the result S
endformultiply S by the width of the stripsprint result
PI calculation – sequential algorithm
13
Parallel pseudocode (for the task/channel model)
if master thenset n (number of strips)send n to the workers
else // workerreceive n from the master
endiffor each strip assigned to this task
calculate the height y of the strip (rectangle) at midpointsum all y to the (partial) result S
endforif master then
receive S from all workerssum all S and multiply by the width of the stripsprint result
else // workersend S to the master
endif
PI calculation – parallel algorithm
14
• Domain decomposition:– primitive task– calculation of one strip height
• Functional decomposition:– manager task: controls the computation
worker task(s): perform the main calculation• manager/worker technique (also called control decomposition)
• more or less technical decomposition
• A perfectly/embarrassingly parallel problem: the (worker) processes are (almost) independent
Parallel PI calculation – partitioning
15
• Determination of the communication pattern among the primitive tasks
• Goal: Expose the information flow
• The tasks generated by partitioning are as a rule not independent – they cooperate by exchanging data
• Communication means overhead – minimize!
– not included in the sequential algorithm
• Efficient communication may be difficult to organize
– especially in domain-decomposed problems
Communication analysis
16
Cathegorizationlocal: between small number of “neighbours”global: many “distant” tasks participatestructured: regular and repeated communication patterns in place
and time unstructured: communication networks are arbitrary graphsstatic: communication partners do not change over timedynamic: communication depends on the computation history and
changes at runtimesynchronous: communication partners cooperate in data transfer
operationsasynchronous: producers are not able to determine data requests of
consumers
The first items are to be preferred in parallel programs
Parallel communication
17
• Preferably no communication involved in parallel algorithm– if not: overhead decreasing parallel efficiency
• Tasks have comparable communication demands– if not: little scalability
• Tasks communicate only with a small number of neighbours– if not: loss of parallel efficiency
• Communication operations and computation in different tasks can proceed concurrentlycommunication and computation can overlap
– if not: inefficient and nonscalable algorithm
Good communication
18
Example: Jacobi differences
8
(t)X (t)X (t)X (t)X (t)X 4 1)(tX 1ji,1-ji,j1,ij1,-iji,
ji,
Jacobi finite difference method
•Repeated update (in timesteps) of values assigned to points of a multidimensional grid
•In 2-D, the grid point i, j may get in timestep t+1 a value given by the formula (weighted mean)
[Foster 1995]
19
Jacobi: parallel algorithm
• Decomposition (domain):– primitive task – calculation of the weighted
mean in one grid point
• Parallel code main loopfor each timestep t
send Xi,j(t) to each neighbour
receive Xi-1,j(t), Xi+1,j(t), Xi,j-1(t), Xi,j+1(t) from neighbours
calculate Xi,j(t+1)
endfor
• Communication: – communication channels between
neighbours– local, structured, static, synchronous
20
Example: Gauss-Seidel scheme
8
(t)X 1)(tX (t)X 1)(tX (t)X 4 1)(tX 1ji,1-ji,j1,ij1,-iji,
ji,
• More efficient in sequential computing
• Not easy to parallelize
[Foster 1995]
21
• Process of gouping primitive tasks into larger tasks • Goal: revision of the (abstract, conceptual)
partitioning and communication to improve performance
– choose granularity appropriate to the target parallel computer
• Large number of fine-grained tasks tend to be inefficient because of great
• communication cost• task creation cost
– spawn operation rather expensive
(and to simplify programming demands)• Agglomeration increases granularity
– potential conflict with retaining flexibility and scalability [next slides]
• Closely related with mapping to processors
Agglomeration
22
• Measure characterizing the size and quantity of tasks• Increasing granularity by combining several tasks into larger ones
– reduces communication cost• less communication (a)• fewer, but larger messages (b)
– reduces task creation cost • less processes
• Agglomerate tasks that– frequently communicate with each other
• increases locality
– cannot execute concurrently
• Consider also [next slides]– surface-to-volume effects– replication of computation/data
Agglomeration & granularity
[Quinn 2004]
23
• The communication/computation ratio decreases with increasing granularity:
– computation cost is proportional to the “volume” of the subdomain
– communication cost is proportional to the “surface”
• Agglomeration in all dimension is most efficient – reduces surface for given volume
– in practice is more difficult to code
• Difficult with unstructured communication
• Ex.: Jacobi finite differences [next slide]
Surface-to-volume effects (1)
24
Surface-to-volume effects (2)
[Foster 1995]
No agglomeration:
41
14
comp
comm
N
N1
16
44
comp
comm
N
N
Agglomeration 4 x 4:
Ex.: Jacobi finite differences – agglomeration
>
25
• Ability to make use of diverse computing environments– good parallel programs are resilient to changes in processor count
• scalability - ability to employ increasing number of tasks
• Too coarse granularity reduces flexibility• Usual practical design: agglomerate one task per processor
– can be controlled by a compile-time or runtime parameters– with some MPS (PVM, MPI-2) on-the-fly (dynamic spawn)
• But consider also creating more tasks than processors:– when tasks often wait for remote data: several tasks mapped to one
processor permit overlapping computation and communication– greater scope for mapping strategies that balance computational load over
available processors• a rule of thumb: an order of magnitude more tasks
• Optimal number of tasks: determined by a combination of analytic modelling and empirical studies
Agglomeration & flexibility
• To reduce communication requirements, the same computation is repeated in several tasks
– compute once & distribute vs. compute repeatedly & don’t communicate – a trade off
• Redundant computation pays off when its computational cost is less then the communication cost
– moreover it removes dependences• Ex.: summation of numbers (located on separate processors) with
distribution of the result
Replicating computation
s s s
d d d
s
sss
Without replication: 2(n – 1) steps
•(n – 1) additions• necessary minimum
With replication: (n – 1) steps
•n (n – 1) additions• (n – 1)2 redundant
27
• Increased locality of communication
• Beneficial replication of computation
• Replication of data does not compromise scalability
• Similar computation and communication costs of the agglomerated tasks
• Number of tasks can scale with the problem size
• Fewer larger-grained tasks is usually more efficient than more fine-grained tasks
Good agglomeration
28
• Process of assigning (agglomerated) tasks to processors for execution
• Goal: Maximize processor utilization, minimize interprocessor communication
– load balancing• Concerns multicomputers only
– multiprocessors: automatic task scheduling• Guidelines to minimize execution time:
– concurrent task place on different processors (increase concurrency)
– tasks with frequent communication place on the same processor (enhance locality)
• Optimal mapping is generally an NP-complete problem
– strategies, heuristics for special classes of problems available
Mappingco
nfli
ctin
g
29
Basic mapping strategies
[Quinn 2004]
30
• Mapping strategy with the aim to keep all processors busy during the execution of the parallel program
– minimization of the idle time • In heterogeneous computing environment
every parallel application may need (dynamic) load balancing
• Static load balancing– performed before the program
enters the solution phase • Dynamic load balancing
– needed when task created/destroyed at run-time and/or comm./comp requirements of tasks vary widely
– invoked occasionally during the execution of the parallel program• analyses the current computation and rebalances it• may imply significant overhead!
Load balancing
Bad load balancing [LLNL 2010] barrier
31
• Most appropriate for domain decomposed problems
• Representative examples [next slides]
– recursive bisection
– probabilistic methods
– local algorithms
Load-balancing algorithms
32
• Recursive cuts into subdomains of nearly equal computational cost while attempting to minimize communication
– allows the partitioning algorithm itself to be executed in parallel
Recursive bisection
Coordinate bisection:• for irregular grids with local
communication• cuts into halves based on
physical coordinates of grid points
• simple, but does care for communication
• unbalanced bisection: does not necessarily divide into halves
• to reduce communication• a lot of variants
• e.g. recursive graph bisection
Irregular grid for a superconductivity simulation [Foster 1995]
33
• Allocate tasks randomly on processors– about the same computation load can be expected for
large number of tasks• typically at least ten times as many tasks as processors
required
• Communication is usually not considered– appropriate for tasks with little communication and/or
little locality in communication• Simple, low cost, scalable• Variant: cyclic mapping for spatial locality in load
levels– each of p processors is allocated every pth task
• Variant: block cyclic distributions– blocks of tasks are allocated to processors
Probabilistic methods
1.00.0
2.0
4.0F
(x) =
4/(1
+x2
)
x
proc. #1
proc. #2
proc. #3
34
• Compensate for changes in computational load using only local information obtained from a small number of “neighbouring” tasks
– do not require expensive global knowledge of computational state
• If imbalance exists (threshold), some computation load is transferred to the less loaded neighbour
• Simple, but less efficient then global algorithms– slow when adjusting major
changes in load characteristics
• Advantageous for dynamic load balancing
Local algorithms
Local algorithm for a grid problem [Foster 1995]
35
• Suitable for a pool of independent tasks– represent stand-alone problems, contain solution code + data
– can be conceived as special kind of data
• Often obtained from functional decomposition – many tasks with weak locality
• Centralized or distributed variants
• Dynamic load balancing by default
• Examples:
– (hierarchical) manager/worker
– decentralized schemes
Task-scheduling algorithms
36
• Simple task scheduling scheme – sometimes called “master/slave”
• Central manager task is responsible for problem allocation– maintains a pool (queue) of problems
• e.g. a search in a particular tree branch
• Workers run on separate processors and repeatedly request and solve assigned problems
– may also sent new problems to the manager
• Efficiency: – consider cost of problem transfer
• prefetching, caching applicable
– manager must not become a bottleneck• Hierarchical manager/worker variant
– introduces a layer of submanagers responsible for subset of workers
Manager/worker
[Wilkinson 1999]
37
• Task-scheduling without global management
• Task pool is a data structure distributed among many processors
• The pool is accessed asynchronously by idle workers– various access polices: neighbours, by random, etc.
• Termination detection may be difficult
Decentralized schemes
38
• In general: Try to balance conflicting requirements for equitable load distribution and low communication cost
• When possible, use static mapping allocating each process to a single processor
• Dynamic load balancing / task scheduling can be appropriate when the number or size of tasks is variable or not known until runtime
• With centralized load-balancing schemes verify that the manager will not become a bottleneck
• Consider implementation cost
Good mapping
39
• Foster’s design methodology is conveniently applicable– in [Quinn 2004] made use of for the design of many parallel
programs in MPI (OpenMP)
• In practice, all phases often considered in parallel
• In bad practice, conceptual phases skipped – machine-dependent design from the very beginning
• Some kind of a “life-belt” (“fix point”) when the development comes into troubles
Conclusions
40
Further study
• [Foster 1995] Designing and Building Parallel Programs
• [Quinn 2004] Parallel Programming in C with MPI and OpenMP
• In most textbooks a chapter like “Principles of parallel algorithm design”
– often concentrated on the mapping step
41
45
Example tree search
top related