Task/channel model Foster’s design methodology Partitioning Communication analysis Agglomeration

Parallel Computing 5

Parallel Application Design Ondřej Jakl

Institute of Geonics, Academy of Sci. of the CR

• Task/channel model

• Foster’s design methodology

• Partitioning

• Communication analysis

• Agglomeration

• Mapping to processors

• Examples

Outline of the lecture

• In general a very creative process

• Only methodical frameworks available

• Usually more alternatives to be considered

• The best parallel solution may differ from suggestions of the sequential approach

Design of parallel algorithms

• Introduced in Ian Foster’s Designing and Building Parallel Programs [Foster 1995]

– http://www-unix.mcs.anl.gov/dbpp

• Represents a parallel computation as set of tasks– task is a program, its local memory and a collection of I/O ports

• task can send local data values to other tasks via output ports

• task can receive data values from other tasks via input ports

• The tasks may interact with each other by sending messages through channels

– channel is a message queue that connects one task’s output port with another task’s input port

– nonblocking asynchronous send and blocking receive is supposed

• An abstraction close to the message passing model

Task/channel model (1)

Task/channel model (2)

after [Quinn 2004]TaskChannel

Program

Output port

Input port

Directed graph of tasks (vertices) and channels (edges)

• Design stages:1. partitioning into concurrent tasks2. communication analysis to

coordinate tasks3. agglomeration into larger tasks

with respect to the target platform 4. mapping of tasks to processors

• 1, 2 conceptual level, 3, 4 implementation dependent

• In practice often considered simultaneously

Foster’s methodology [Foster 1995]

• Process of dividing the computation and the data into pieces – primitive tasks• Goal: Expose the opportunities for parallel

processing• Maximal (fine-grained) decomposition

for greater flexibility• Complementary techniques:

– domain decomposition (data centric approach)

– functional decomposition (computation centric approach)

• Combinations possible– usual scenario:

primary decomposition – functional secondary decomposition – domain

Partitioning (decomposition)

• Primary object of decomposition: processed data – first, data associated with the problem is divided into pieces

• focus on the largest and/or most frequently accessed data• pieces should be of comparable size

– next, the computation is partitioned according to the data on which it operates

• usually the same code for each task (SPMD – Single Program Multiple Data)• may be non-trivial, may bring up complex mathematical problems

• Most often used technique in parallel programming

Domain (data) decomposition

3D grid data: one-, two-, three-dimensional decomposition [Foster 1995]

• Primary object of decomposition: computation– first, computation is decomposed into disjoint tasks

• different codes of the tasks (MPMD – Multiple Program Multiple Data)

• methodological benefits: implies program structuring– gives rise to simpler modules with interfaces– c.f. object oriented programming, etc.

– next, data is partitioned according to the requirements of the tasks• data requirements may be disjoint, or overlap ( communication)

• Sources of parallelism: – concurrent processing of independent tasks– concurrent processing of a stream of data through pipelining

• a stream of data is passed on through a succession of tasks, each of which perform some operation on it

• MPSD – Multiple Program Single Data

• The number of task usually does not scale with the problem size – for greater scalability combine with domain decomposition on the subtasks

Functional (task) decompositionClimate model [Foster 1995]

• More tasks (at least by order of magnitude) then processors

– if not: little flexibility

• No redundancy in processing and data– if not: little scalability

• Comparable size of tasks– if not: difficult load balancing

• Number of task proportional to the size of the problem– if not: problems utilizing additional processors

• Alternate partitions available?

Good decomposition

• Calculation of π by the standard numerical integration formula

• Consider numerical integration based on the rectangle method

– integral is approximated by the area of evenly spaced rectangular strips

– height of the strips is calculated as the value of the integrated function at the midpoint of the strips

Example: PI calculation

1.00.0

Seqential pseudocode

set n (number of strips)for each strip

calculate the height y of the strip (rectangle) at its midpointsum all y to the result S

endformultiply S by the width of the stripsprint result

PI calculation – sequential algorithm

Parallel pseudocode (for the task/channel model)

if master thenset n (number of strips)send n to the workers

else // workerreceive n from the master

endiffor each strip assigned to this task

calculate the height y of the strip (rectangle) at midpointsum all y to the (partial) result S

endforif master then

receive S from all workerssum all S and multiply by the width of the stripsprint result

else // workersend S to the master

PI calculation – parallel algorithm

• Domain decomposition:– primitive task– calculation of one strip height

• Functional decomposition:– manager task: controls the computation

worker task(s): perform the main calculation• manager/worker technique (also called control decomposition)

• more or less technical decomposition

• A perfectly/embarrassingly parallel problem: the (worker) processes are (almost) independent

Parallel PI calculation – partitioning

• Determination of the communication pattern among the primitive tasks

• Goal: Expose the information flow

• The tasks generated by partitioning are as a rule not independent – they cooperate by exchanging data

• Communication means overhead – minimize!

– not included in the sequential algorithm

• Efficient communication may be difficult to organize

– especially in domain-decomposed problems

Communication analysis

Cathegorizationlocal: between small number of “neighbours”global: many “distant” tasks participatestructured: regular and repeated communication patterns in place

and time unstructured: communication networks are arbitrary graphsstatic: communication partners do not change over timedynamic: communication depends on the computation history and

changes at runtimesynchronous: communication partners cooperate in data transfer

operationsasynchronous: producers are not able to determine data requests of

consumers

The first items are to be preferred in parallel programs

Parallel communication

• Preferably no communication involved in parallel algorithm– if not: overhead decreasing parallel efficiency

• Tasks have comparable communication demands– if not: little scalability

• Tasks communicate only with a small number of neighbours– if not: loss of parallel efficiency

• Communication operations and computation in different tasks can proceed concurrentlycommunication and computation can overlap

– if not: inefficient and nonscalable algorithm

Good communication

Example: Jacobi differences

(t)X (t)X (t)X (t)X (t)X 4 1)(tX 1ji,1-ji,j1,ij1,-iji,

Jacobi finite difference method

•Repeated update (in timesteps) of values assigned to points of a multidimensional grid

•In 2-D, the grid point i, j may get in timestep t+1 a value given by the formula (weighted mean)

[Foster 1995]

Jacobi: parallel algorithm

• Decomposition (domain):– primitive task – calculation of the weighted

mean in one grid point

• Parallel code main loopfor each timestep t

send Xi,j(t) to each neighbour

receive Xi-1,j(t), Xi+1,j(t), Xi,j-1(t), Xi,j+1(t) from neighbours

calculate Xi,j(t+1)

endfor

• Communication: – communication channels between

neighbours– local, structured, static, synchronous

Example: Gauss-Seidel scheme

(t)X 1)(tX (t)X 1)(tX (t)X 4 1)(tX 1ji,1-ji,j1,ij1,-iji,

• More efficient in sequential computing

• Not easy to parallelize

[Foster 1995]

• Process of gouping primitive tasks into larger tasks • Goal: revision of the (abstract, conceptual)

partitioning and communication to improve performance

– choose granularity appropriate to the target parallel computer

• Large number of fine-grained tasks tend to be inefficient because of great

• communication cost• task creation cost

– spawn operation rather expensive

(and to simplify programming demands)• Agglomeration increases granularity

– potential conflict with retaining flexibility and scalability [next slides]

• Closely related with mapping to processors

Agglomeration

• Measure characterizing the size and quantity of tasks• Increasing granularity by combining several tasks into larger ones

– reduces communication cost• less communication (a)• fewer, but larger messages (b)

– reduces task creation cost • less processes

• Agglomerate tasks that– frequently communicate with each other

• increases locality

– cannot execute concurrently

• Consider also [next slides]– surface-to-volume effects– replication of computation/data

Agglomeration & granularity

[Quinn 2004]

• The communication/computation ratio decreases with increasing granularity:

– computation cost is proportional to the “volume” of the subdomain

– communication cost is proportional to the “surface”

• Agglomeration in all dimension is most efficient – reduces surface for given volume

– in practice is more difficult to code

• Difficult with unstructured communication

• Ex.: Jacobi finite differences [next slide]

Surface-to-volume effects (1)

Surface-to-volume effects (2)

[Foster 1995]

No agglomeration:

Agglomeration 4 x 4:

Ex.: Jacobi finite differences – agglomeration

• Ability to make use of diverse computing environments– good parallel programs are resilient to changes in processor count

• scalability - ability to employ increasing number of tasks

• Too coarse granularity reduces flexibility• Usual practical design: agglomerate one task per processor

– can be controlled by a compile-time or runtime parameters– with some MPS (PVM, MPI-2) on-the-fly (dynamic spawn)

• But consider also creating more tasks than processors:– when tasks often wait for remote data: several tasks mapped to one

processor permit overlapping computation and communication– greater scope for mapping strategies that balance computational load over

available processors• a rule of thumb: an order of magnitude more tasks

• Optimal number of tasks: determined by a combination of analytic modelling and empirical studies

Agglomeration & flexibility

• To reduce communication requirements, the same computation is repeated in several tasks

– compute once & distribute vs. compute repeatedly & don’t communicate – a trade off

• Redundant computation pays off when its computational cost is less then the communication cost

– moreover it removes dependences• Ex.: summation of numbers (located on separate processors) with

distribution of the result

Replicating computation

Without replication: 2(n – 1) steps

•(n – 1) additions• necessary minimum

With replication: (n – 1) steps

•n (n – 1) additions• (n – 1)2 redundant

• Increased locality of communication

• Beneficial replication of computation

• Replication of data does not compromise scalability

• Similar computation and communication costs of the agglomerated tasks

• Number of tasks can scale with the problem size

• Fewer larger-grained tasks is usually more efficient than more fine-grained tasks

Good agglomeration

• Process of assigning (agglomerated) tasks to processors for execution

• Goal: Maximize processor utilization, minimize interprocessor communication

– load balancing• Concerns multicomputers only

– multiprocessors: automatic task scheduling• Guidelines to minimize execution time:

– concurrent task place on different processors (increase concurrency)

– tasks with frequent communication place on the same processor (enhance locality)

• Optimal mapping is generally an NP-complete problem

– strategies, heuristics for special classes of problems available

Mappingco

Basic mapping strategies

[Quinn 2004]

• Mapping strategy with the aim to keep all processors busy during the execution of the parallel program

– minimization of the idle time • In heterogeneous computing environment

every parallel application may need (dynamic) load balancing

• Static load balancing– performed before the program

enters the solution phase • Dynamic load balancing

– needed when task created/destroyed at run-time and/or comm./comp requirements of tasks vary widely

– invoked occasionally during the execution of the parallel program• analyses the current computation and rebalances it• may imply significant overhead!

Load balancing

Bad load balancing [LLNL 2010] barrier

• Most appropriate for domain decomposed problems

• Representative examples [next slides]

– recursive bisection

– probabilistic methods

– local algorithms

Load-balancing algorithms

• Recursive cuts into subdomains of nearly equal computational cost while attempting to minimize communication

– allows the partitioning algorithm itself to be executed in parallel

Recursive bisection

Coordinate bisection:• for irregular grids with local

communication• cuts into halves based on

physical coordinates of grid points

• simple, but does care for communication

• unbalanced bisection: does not necessarily divide into halves

• to reduce communication• a lot of variants

• e.g. recursive graph bisection

Irregular grid for a superconductivity simulation [Foster 1995]

• Allocate tasks randomly on processors– about the same computation load can be expected for

large number of tasks• typically at least ten times as many tasks as processors

required

• Communication is usually not considered– appropriate for tasks with little communication and/or

little locality in communication• Simple, low cost, scalable• Variant: cyclic mapping for spatial locality in load

levels– each of p processors is allocated every pth task

• Variant: block cyclic distributions– blocks of tasks are allocated to processors

Probabilistic methods

1.00.0

proc. #1

proc. #2

proc. #3

• Compensate for changes in computational load using only local information obtained from a small number of “neighbouring” tasks

– do not require expensive global knowledge of computational state

• If imbalance exists (threshold), some computation load is transferred to the less loaded neighbour

• Simple, but less efficient then global algorithms– slow when adjusting major

changes in load characteristics

• Advantageous for dynamic load balancing

Local algorithms

Local algorithm for a grid problem [Foster 1995]

• Suitable for a pool of independent tasks– represent stand-alone problems, contain solution code + data

– can be conceived as special kind of data

• Often obtained from functional decomposition – many tasks with weak locality

• Centralized or distributed variants

• Dynamic load balancing by default

• Examples:

– (hierarchical) manager/worker

– decentralized schemes

Task-scheduling algorithms

• Simple task scheduling scheme – sometimes called “master/slave”

• Central manager task is responsible for problem allocation– maintains a pool (queue) of problems

• e.g. a search in a particular tree branch

• Workers run on separate processors and repeatedly request and solve assigned problems

– may also sent new problems to the manager

• Efficiency: – consider cost of problem transfer

• prefetching, caching applicable

– manager must not become a bottleneck• Hierarchical manager/worker variant

– introduces a layer of submanagers responsible for subset of workers

Manager/worker

[Wilkinson 1999]

• Task-scheduling without global management

• Task pool is a data structure distributed among many processors

• The pool is accessed asynchronously by idle workers– various access polices: neighbours, by random, etc.

• Termination detection may be difficult

Decentralized schemes

• In general: Try to balance conflicting requirements for equitable load distribution and low communication cost

• When possible, use static mapping allocating each process to a single processor

• Dynamic load balancing / task scheduling can be appropriate when the number or size of tasks is variable or not known until runtime

• With centralized load-balancing schemes verify that the manager will not become a bottleneck

• Consider implementation cost

Good mapping

• Foster’s design methodology is conveniently applicable– in [Quinn 2004] made use of for the design of many parallel

programs in MPI (OpenMP)

• In practice, all phases often considered in parallel

• In bad practice, conceptual phases skipped – machine-dependent design from the very beginning

• Some kind of a “life-belt” (“fix point”) when the development comes into troubles

Conclusions

Further study

• [Foster 1995] Designing and Building Parallel Programs

• [Quinn 2004] Parallel Programming in C with MPI and OpenMP

• In most textbooks a chapter like “Principles of parallel algorithm design”

– often concentrated on the mapping step

Example tree search

Task/channel model Foster’s design methodology Partitioning Communication analysis Agglomeration

Documents

Mrs. Foster’s 5th Grade 2008-2009

Agglomeration Basics - Equipment...AGGLOMERATION SOLUTIONS.....

Agglomeration And

John Bellamy Foster’s Ecological Marxism

How Does Agglomeration Promote the Product Innovation of...

Partitioning Introduction to Partitioning Rabi Mahapatra.

86514377 Coal Gold Agglomeration

Press Agglomeration

Penn Foster’s Electro-Mechanical Apprenticeship...

Training expenditure, agglomeration externalities and ......

The Spiritual Disciplines (Adapted: Richard Foster’s ….....

Agglomeration in counter-current spray drying towers. Part.....

FOSTER’S KIDS MENU!

Иван Квасов Креатив-контроль....

Wendell Foster’s Field Day

Lecture 2: Urban Agglomeration...