Top Banner
Parallel Computing Parallel Algorithm Design
39

Parallel Algorithm Design

Feb 15, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Parallel Algorithm Design

Parallel Computing

Parallel Algorithm Design

Page 2: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 2

Task/Channel Model

• Parallel computation = set of tasks

• Task

• Program

• Local memory

• Collection of I/O ports

• Tasks interact by sending messages

through channels

Page 3: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 3

Task/Channel Model

Task Channel

Page 4: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 4

Foster’s Design Methodoly

ProblemPartitioning

Communication

AgglomerationMapping

1. Partitioning

2. Communication

3. Agglomeration

4. Mapping

Page 5: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 5

1. Partitioning

• Dividing computation and data into pieces

• Domain decomposition

• Divide data into pieces

• e.g., An array into sub-arrays (reduction); A loop into

sub-loops (matrix multiplication), A search space into

sub-spaces (chess)

• Functional decomposition

• Divide computation into pieces

• e.g., pipelines (floating point multiplication),

workflows (pay roll processing)

• Determine how to associate data with

computations

Page 6: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 6

Partitioning

• The individual pieces are called primitive

tasks.

• Desirable attributes for partition

• Many more primitive tasks than processors on

target computer.

• Tasks of roughly equal size (in computation

and data).

• Number of tasks increases with problem size.

Page 7: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 7

Example of domain decomposition

Page 8: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 8

Example of Functional Decomposition

Page 9: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 9

2. Communication

• Determine values passed among tasks

• Local communication

• Task needs values from a small number of

other tasks

• Create channels illustrating data flow

• Global communication

• Significant number of tasks contribute data to

perform a computation

• Don’t create channels for them early in design

Page 10: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 10

Desirable attributes for communication

• Balanced

• Communication operations balanced among

tasks

• Small degree:

• Each task communicates with only small group

of neighbors

• Concurrency

• Tasks can perform communications

concurrently

• Task can perform computations concurrently

Page 11: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 11

3. Agglomeration

• Agglomeration is the process of grouping

tasks into larger tasks to improve

performance.

• Here, minimizing communication is

typically a design goal.

• Grouping tasks that communicate with each

other eliminates the need for communication,

called increasing the locality

• Grouping tasks can also allow us to combine

multiple communications into one.

Page 12: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 12

Desirable attributes of agglomeration

• Increased the locality of the parallel algorithm

• Agglomerated tasks have similar computational

and communication costs

• Number of tasks increases with problem size

• Number of tasks is as small as possible, yet at

least as great as the number of processors on

target computer

Page 13: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 13

4. Mapping

• Mapping is the process of assigning

agglomerated tasks to the processors

• Here, were thinking of a distributed memory

machine

• If we choose the number of agglomerated tasks

to equal the number of processors then the

mapping is already done. Each processor gets

one agglomerated task

Page 14: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 14

Mapping Goals

• Processor utilization: would like

processors to have roughly equal

computational and communication costs

• Minimize interprocessor communication

• This can be posed as a graph partitioning

problem:

• Each partition should have roughly the same

number of nodes

• The partition should cut a minimal amount of

edges

Page 15: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 15

Partitioning a graph

P0 P1 P0 P1 P0

P1

Equalizing processor utilization and minimizing interprocessor

communication are often competing forces

Page 16: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 16

Mapping heuristics

• Static number of tasks

• Structured communication

• Constant computation time per task

− Agglomerate tasks to minimize comm

− Create one task per processor

• Variable computation time per task

− Cyclically map tasks to processors

• Unstructured communication

− Use a static load balancing algorithm

• Dynamic number of tasks

• Use a run-time task-scheduling algorithm

− e.g., a master slave strategy

• Use a dynamic load balancing algorithm

− e.g., share load among neighboring processors; remapping

periodically

Page 17: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 17

Example

• 1. Boundary value problems

Ice water Rod Insulation

Page 18: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 18

Boundary Value Problem

2

2222

x

uaua

t

u

c

ka 2

t

uu

t

u jiji

,1,

Heat conduction physics

Discretization

ui,j = temperature at position i and time j

2,1,,1

2

2 2

x

uuu

x

u jijiji

jijijiji ruurruu ,1,,11, )21( 2

2

)( x

tar

Page 19: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 19

Boundary Value Problem

• Partition

• One data item per grid point

• Associate one primitive task with each grid

point

• Two-dimensional domain decomposition

• Communication

• Identify communication pattern between

primitive tasks

• Each interior primitive task has three incoming

and three outgoing channels

Page 20: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 20

Boundary Value Problem

• Agglomeration and mapping

Agglomeration

Page 21: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 21

Model Analysis

• Sequential execution

• – time to update element

• n – number of elements

• m – number of iterations

• Sequential execution time: m n

• Parallel execution

• p – number of processors

• message time = + q/β ≈ , if q « β

• Parallel execution time m (n /p + 2)

Page 22: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 22

Example – Parallel reduction

• Given associative operator

• a0 a

1 a

2 … a

n-1

• Examples

• Add

• Multiply

• And, Or

• Maximum, Minimum

1 task 1 of the values to operate

(1 of the a’s)

Data decomposition

Page 23: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 23

Parallel reduction

Further steps to

reach a binomial tree

Page 24: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 24

Parallel reduction

4 2 0 7

-3 5 -6 -3

8 1 2 3

-4 4 6 -1

Page 25: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 25

Parallel reduction

1 7 -6 4

4 5 8 2

Page 26: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 26

Parallel reduction

8 -2

9 10

Page 27: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 27

Parallel reduction

17 8

Page 28: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 28

Parallel reduction

25

Binomial tree

Page 29: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 29

Agglomeration

sum

sum sum

sum

Page 30: Parallel Algorithm Design

Analysis

• Parallel running time

• – time to perform the binary operation

• - time to communicate a value via a channel

• n values and p tasks

• Time for the tasks perform its inner

calculations: (n/p - 1)

• Communication steps: log p

• After each receiving communication there is

an operation

• Total time: (n/p - 1) + log p ( + )

2010@FEUP Parallel Algorithm Design 30

Page 31: Parallel Algorithm Design

Example: the N-body problem

2010@FEUP Parallel Algorithm Design 31

v

B1

B2 B3

m

(x,y)

f1 f2

Page 32: Parallel Algorithm Design

The N-body problem

2010@FEUP Parallel Algorithm Design 32

Page 33: Parallel Algorithm Design

The N-body problem partitioning

• Domain partitioning

• Assume one task per particle

• Task has particle’s position, velocity

vector and mass

• Iteration

• Get positions and mass of all other particles

• Compute new position and velocity

2010@FEUP Parallel Algorithm Design 33

Page 34: Parallel Algorithm Design

Gather and All-Gather operations

2010@FEUP Parallel Algorithm Design 34

Gather operation

(sequential)

All-Gather operation

(p-1)

Page 35: Parallel Algorithm Design

All-Gather

2010@FEUP Parallel Algorithm Design 35

To avoid conflicts all-gather is performed in log p steps,

doubling the data in each step

Communication (n items) = + (n / )

With p tasks there are log p iterations

The number of items doubles

at each iteration

p

pnp

p

np

i

i

)1(log)

2(

log

1

1

Page 36: Parallel Algorithm Design

Analysis

• N-body problem parallel version

• n bodies and p tasks

• m iterations over time

2010@FEUP Parallel Algorithm Design 36

p

n

p

pnpm

)1(logTotal time excluding I/O

Page 37: Parallel Algorithm Design

Considering I/O

2010@FEUP Parallel Algorithm Design 37

Reading or writing n items of data through an I/O channel

io

+ n/io

In N-body problem the initial values

must be transmitted to the other tasks

Page 38: Parallel Algorithm Design

Scatter operation

2010@FEUP Parallel Algorithm Design 38

Improving

1. First task transmits n/2

items to another task

2. The 2 tasks transmits n/4

items to 2 other tasks

3. The 4 tasks transmits n/8

items to 8 other tasks

4. And so on …

p

pnp

p

np

i

i

)1(log)

2(

log

1

1

Page 39: Parallel Algorithm Design

Analysis considering I/O

• Total time after m iterations

• Initial reading + scattering

• Computing m iterations

• Final gathering + writing

2010@FEUP Parallel Algorithm Design 39

p

n

p

pnpm

p

pnp

n

io

io

)1(

log)1(

log22