Communication operations Efficient Parallel Algorithms COMP308
Jan 07, 2016
Communication operations
Efficient Parallel Algorithms
COMP308
Communication time
Communication requires 3 costs:
1. Static start up time (ts):– It is the time required to handle a message at the sending
processor
2. Per-hop time (th) with l the #Links that the message passes– It is take a finite amount of time to reach the next processor in
its path after a message leaves a processor.
3. Per-word transfer time (tw): with m the #bytes– If the channel bandwidth is r words per second, then each
word takes time tw=1/r to traverse the link.
There are 2 main communication schemes:
“store and forward” vs “cut-through” In “store and forward” routing, when a message is
traversing a path with multiple links, each intermediate node on the path forwards the message to the next node after it has received.
In “cut-through” routing an intermediate nodes does not wait for the entire message to arrive before forwarding it. – A tracer is first sent from the source to the designation node to
establish a connection. – Once a connection has been established, the flits are sent one
after the other. All flits follows the same path in a dovetailed fashion.
– As soon as a flit is received at an intermediate node, the flit is passed on to the next node.
One to All Broadcast
Initially, only the source processor has the data of size m that need to be broadcast. At the end of the termination of the procedure, there are P copies of the initial data, one residing at each processor.
Broadcast on ring (Store and Forward)
If the sender sends the messages consecutively to the p-1 other processors, it takes p-1 steps.
By optimisation, we can reduce this to p/2 steps.
Eg.: a 8-processor ring requires 4 steps
NS diagram for “broadcast on ring”
Ring network, Cut-Through routing With cut-through routing, messages can be sent faster to
nodes that are multiple hops away in the network. By using this, we send the message first to the outermost node.
In general, in a p-processor ring the source processor first sends the data to the processor at distance p/2, then both processors sends the message to the processors at distance of p/4 in the same direction, then to p/8, etc.
Broadcast on mesh (Store and Forward)
Most of the optimised communication algorithms on a mesh are simple extensions of their ring counterparts, by consecutively applying the ring algorithm on each dimension of the mesh.
Hypercube
The regular binary structure of the hypercube plays an important role in optimising communication.
Here, a broadcast is performed by sending the message along each dimension at each step. This results in log p or d steps.
It can be proved easily that log p is the minimal number of steps for every network.
Hypercube
Important properties of the networks:– Small degree,– Small diameter,– Regular recursive structure,– Easy way to embed trees, etc
Hypercube – two nodes connected if they are differ precisely on one bit
Hypercube – two nodes connected if they are differ precisely on one bit
0 1
00 01
10 11
000 001
010 011
100 101
110 111
0000 0001
0010 0011
0100 0101
0110 0111
1000 1001
1010 1011
1100 1101
1110 1111
1000 001
1010 011
1100 1101
1110 1111
0000 0001
0010 0011
0100 0101
0110 0111
Broadcast on hypercube (S&F)
Broadcast on ring (Cut-Through )
Broadcast on mesh (C-T)
Broadcast on binary tree (C-T)
Gossiping
All-to-All Communication
Gossiping on Ring (Store and Forward)
Gossiping on Mesh (Store and Forward)
Gossiping on Hypercube (S&F)
Gossiping on Ring (and Mesh)Cut-Through Routing
Each process sends m(p-1) words of data because it has an m-word packet for every other processor
The average distance that an m word packet travels is
Since there are p processors, each performing the same type of communication, the total traffic on the network is
The total number of communication channels in the network to share this load is p.
21
1
1 p
p
ip
i
pp
pm 2
)1(
2
)1(
2)1(2
)1(2
ppmt
ppm
p
ppm
w
Hence this procedure cannot be improved by using CT routing
Gossiping on Hypercube (CT routing)