Lecture 6 Objectives Communication Complexity Analysis Collective Operations –Reduction –Binomial Trees –Gather and Scatter Operations Review Communication.

Post on 20-Dec-2015

217 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

Transcript

Lecture 6

• Objectives

• Communication Complexity Analysis

• Collective Operations– Reduction– Binomial Trees– Gather and Scatter Operations

• Review Communication Analysis of Floyd’s Algorithm

Parallel Reduction Evolution

Binomial Trees

Subgraph of hypercube

Finding Global Sum

4 2 0 7

-3 5 -6 -3

8 1 2 3

-4 4 6 -1

Finding Global Sum

1 7 -6 4

4 5 8 2

Finding Global Sum

8 -2

9 10

Finding Global Sum

17 8

Finding Global Sum

25

Binomial Tree

Agglomeration

Agglomeration

sum

sum sum

sum

Gather

All-gather

Complete Graph for All-gather

Hypercube for All-gather

Analysis of Communication

n

• Lambda is latency

= message delay

= overhead to send 1 message

• Beta is bandwidth

= number of data items per unit time

= bytes per message

Sending a message with n data items costs

Communication Time for All-Gather

p

pnp

p

np

i

)1(

log2

log

1

1-i

Hypercube

Complete graph

p

pnp

pnp

)1(

)1()/

)(1(

Adding Data Input

Scatter

Scatter in log p Steps

12345678 56781234 56 12

7834

Communication Time for Scatter

p

pnp

p

np

i

)1(

log2

log

1

1-i

Hypercube

Complete graph

p

pnp

pnp

)1(

)1()/

)(1(

Recall Parallel Floyd’s Computational Complexity

• Innermost loop has complexity (n)

• Middle loop executed at most n/p times

• Outer loop executed n times

• Overall computation complexity (n3/p)

Floyd’s Communication Complexity

• No communication in inner loop• No communication in middle loop• Broadcast in outer loop — complexity is

• Executed n times

)/(log np

Execution Time Expression (1)

)/4(log/ npnnpnn

Iterations of outer loopIterations of middle loop

Cell update time

Iterations of outer loop

Messages per broadcastMessage-passing time bytes/msg

Iterations of inner loop

Accounting for Computation/communication Overlap

Note that after the 1st broadcast all the wait times overlap the computation time of Process 0.

Execution Time Expression (2)

Iterations of outer loopIterations of middle loop

Cell update time

Iterations of outer loop

Messages per broadcastMessage-passing time

Iterations of inner loop

/4loglog/ nppnnpnn Message transmission

Predicted vs. Actual Performance

Execution Time (sec)

Processes Predicted Actual

1 25.54 25.54

2 13.02 13.89

3 9.01 9.60

4 6.89 7.29

5 5.86 5.99

6 5.01 5.16

7 4.40 4.50

8 3.94 3.98

X=25.5 nsec

L = 250 usecs

B = 10MB/sec

N = 1000

Summary

• Two matrix decompositions– Rowwise block striped– Columnwise block striped

• Blocking send/receive functions– MPI_Send– MPI_Recv

• Overlapping communications with computations

top related