CS 770G - Parallel Algorithms in Scientific Computing May 9 , 2001 Lecture 2 Message-Passing I: Communication
CS 770G - Parallel Algorithms in Scientific Computing
May 9 , 2001Lecture 2
Message-Passing I:Communication
2
References
• Parallel Computer Architecture: A Hardware / Software Approach
Culler, Singh, Gupta, Morgan Kaufmann
• Introduction to Parallel Computing: Design and Analysis of Algorithms
Kumar, Grama, Gupta, Karypis, Benjamin Cummings
3
Routing Mechanism for Static Networks
• Routing mechanism determines the path a message takes through the network to get from the source to the destination proc.
• Minimal routing: select one of the shortest path.• Nonminimal routing: may use a longer path to avoid
network congestion.• Deterministic routing: determine a unique path based
on source and destination.• Adaptive routing: determine a path to avoid network
conjestion.
4
Examples of Deterministic Minimal Routing
• XY-routing for a 2D-mesh: along X-dimension and then Y-dimension.
• E-cube routing for a hypercube: based on Hamming distance and start with the least significant bit.
5
Communication Cost• Latency
– Sum of time to prepare a message for transmission, and the time taken by the message to traverse the network to its destination.
• Principle parameters:– Startup time (ts): time required to handle a message (prepare
the message, execute the routing algorithm, establish an interface between proc & router).
– Per-hop time (th): time taken by the header of a message to travel between 2 connected procs.
– Per-word transfer time (tw): suppose channel bandwidth is rwords/s. Then tw = 1/r.
6
Switching Techniques I
Store-and-forward routing:• When a message is traversing a path with multiple
links, each intermediate proc forwards the message after it has received the entire message.
• Suppose message size = m, number of links = l.• Total traversing time = (th + m tw) l.• Total communication time (tcomm):
ltmttt whscomm )( ++=
7
Switching Techniques II
Cut-through routing:• A message travels in small units (packets) called flits
(flow-control digits).• As soon as a flit is received at an intermediate proc, it
is passed on to the next proc.• No need to have (large) buffer to store entire message.• Less memory bandwidth.• Deadlock may occur! Can be fixed by using, e.g. XY-
routing, E-cube routing.• whscomm tmtltt ++=
8
Basic Communication Operations
• Point-to-point communication.
• One-to-all broadcast.
• All-to-all broadcast.
• One-to-all personalized.
• All-to-all personalized.
9
Point-to-Point Communication• Sending a message from one proc to another.• Store-and-forward routing (tcomm = ts + tw m l ) Single message transfer time with p procs:
– Ring: ts + tw m p/2 .– Mesh: ts + 2 tw m √ p/2 .– Hypercube: ts + tw m log p .
• Cut-through routing (ts + tw m + th l ).
10
One-to-All Broadcast• One proc sends messages to all (or a subset) of procs.• Reverse direction → all-to-one/collective comm.• Store-and-forward routing (tcomm = ts + tw m l ):• Ring
– each proc receives a message on one of its link and passes it on to its other link.
– time = (ts + tw m) p/2 .
• Mesh– ring broadcast along rows, and then along columns.– time = 2 (ts + tw m)√ p/2 .
11
One-to-All Broadcast (cont.)• Hypercube
– message is sent in each dimension at a time.– time = (ts + tw m) log p .
• Note: (under certain assumptions,) one-to-all broadcast cannot be performed in less than (ts + tw m) log p. One reason is that, on a hypercube, every opportunity to send a message is used.
12
All-to-All Broadcast• All p procs simultaneously initiate a broadcast.• Reverse direction → reduction communication.• Store-and-forward routing (tcomm = ts + tw m l ):• Ring
– Each proc first sends to one of its neighbors the data it needs to broadcast.
– Then, it forwards the data received from one of its neighbors to its other neighbor.
– Time = (ts + tw m) (p-1) .
13
All-to-All Broadcast (cont.)• Mesh
– First phase: all-to-all ring broadcast along the rows.– time_x = (ts + tw m) (√ p - 1 ).– Second phase: all-to-all ring broadcast along the columns.– time_y = (ts + tw m √ p ) (√ p - 1 ).– Total time = 2 ts (√ p - 1 )+ tw m ( p - 1) .
14
All-to-All Broadcast (cont.)• Hypercube
– log p steps.– In every step, pairs of procs exchange their data and double
the size of the message to be transmitted in the next step.– time_i = ts + tw m 2i-1.– Total time = ts log p + tw m ( p - 1 ) .
15
One-to-All Personalized Comm.• Scatter.• A single proc sends a unique message to every other proc.• Reverse direction → gather communication.• Similar to all-to-all broadcast.• In all-to-all broadcast, each proc receives m(p-1) words.
In one-to-all personalized comm, the source proc sends mwords for each of the other p-1 procs.
16
All-to-All Personalized Comm.• All scatter.• Each proc sends a unique message to every other proc.• Use in parallel fast Fourier transform, matrix transpose,
etc.