Top Banner
CS575 Parallel Processing Lecture three: Interconnection Networks Wim Bohm, CSU Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.
21

CS575 Parallel Processing

Feb 08, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS575 Parallel Processing

CS575 Parallel Processing

Lecture three: Interconnection Networks Wim Bohm, CSU

Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.

Page 2: CS575 Parallel Processing

CS575 lecture 3 2

Interconnection networks n  Connect

n  Processors, memories, I/O devices n  Dynamic interconnection networks

n  Connect any to any using switches or busses n  Two types of switches

n  On / off: 1 input, 1 output n  Pass through / cross over: 2 inputs, 2 outputs

n  Static interconnection networks n  Connect point to point using “wires”

Page 3: CS575 Parallel Processing

CS575 lecture 3 3

Dynamic Interconnection Network: Crossbar

n  Connects e.g. p processors to b memories n  p * b matrix

n  p horizontal lines, b vertical lines n  Cross points: on/off switches n  Only one switch on per (row,column) pair n  Non blocking: Pi to Mj does not block Pl to Mk

n  Very costly, does not scale well n  p * b switches, complex timing and checking

Page 4: CS575 Parallel Processing

CS575 lecture 3 4

Dynamic Interconnection Network: Bus n  Connects processors, memories, I/O devices

n  Master: can issue a request to get the bus n  Slave: can respond to a request, one bus is granted n  If there are multiple masters, we need an arbiter

n  Sequential n  Only one communication at the time n  Bottleneck n  But simple and cheap

Page 5: CS575 Parallel Processing

CS575 lecture 3 5

Crossbar vs bus n  Crossbar

n  Scalable in performance n  Not scalable in hardware complexity

n  Bus n  Not scalable in performance n  Scalable in hardware complexity

n  Compromise: multistage network

Page 6: CS575 Parallel Processing

CS575 lecture 3 6

Multi-stage network n  Connects n components to each other n  Usually built from O(n.log(n)) 2x2 switches n  Cheaper than cross bar n  Faster than bus n  Many topologies

n  e.g. Omega (book fig 2.12), Butterfly, ...

Page 7: CS575 Parallel Processing

CS575 lecture 3 7

Static Interconnection Networks

n  Fixed wires (channels) between devices n  Many topologies

n  Completely connected n  (n(n-1))/2 channels n  Static counterpart of crossbar

n  Star n  One central PE for message passing n  Static counterpart of bus

n  Multistage network with PE at each switch

Page 8: CS575 Parallel Processing

CS575 lecture 3 8

More topologies n  Necklace or ring n  Mesh / Torus

n  2D, 3D

n  Trees n  Fat tree

n  Hypercube n  2n nodes in nD hypercube n  n links per node in nD hypercube n  Addressing: 1 bit per dimension

Page 9: CS575 Parallel Processing

CS575 lecture 3 9

Hypercube n  Two connected nodes differ in one bit n  nD hypercube can be divided in

n  2 (n-1) D cubes in n ways n  4 (n-2) D cubes n  8 (n-3) D cubes

n  To get from node s to node t n  Follow the path determined by the differing bits n  E.g. 01100 à 11000:

01100 à 11100 à 11000 n  Question: how many (simple) paths from one node to another?

Page 10: CS575 Parallel Processing

CS575 lecture 3 10

Measures of static networks

n  Diameter n  Maximal shortest path between two nodes

n  Ring: ⎣p/2⎦, hypercube: log(p) 2D wraparound mesh: 2 ⎣sqrt(p)/2⎦

n  Connectivity n  Measure of multiplicity of paths between nodes n  Arc connectivity

n  Minimum #arcs to be removed to create two disconnected networks

n  Ring: 2, hypercube: log(p), mesh: 2, wraparound mesh: 4

Page 11: CS575 Parallel Processing

CS575 lecture 3 11

More measures n  Bisection width

n  Minimal #arcs to be removed to partition the network in two (off by one node) equal halves

n  Ring: 2, Complete binary tree: 1, 2D mesh: sqrt(p) n  Question: bisection width of a hypercube?

n  Channel width n  #bits communicated simultaneously over channel

n  Channel rate / bandwidth n  Peak communication rate (#bits/second)

n  Bisection bandwidth n  Bisection width * channel bandwidth

Page 12: CS575 Parallel Processing

CS575 lecture 3 12

Summary of measures: p nodes Network Diameter Bisection

width Arc

connectivity #links

Completely-Connected

1 p2/4 p-1 p(p-1)/2

Star 2 ⎣p/2⎦ * 1 p-1

Ring ⎣p/2⎦ 2 2 p

Complete binary tree

2log((p+1)/2) 1 1 p-1

Hypercube log(p) p/2 log(p) p.log(p)/2 * The textbook mentions bisection width of a star as 1, but the only way to split a star into (almost) equal halves is by cutting half of its links.

Page 13: CS575 Parallel Processing

CS575 lecture 3 13

Meshes and Hyper cubes

n  Mesh n  Buildable, scalable, cheaper than hyper cubes n  Many (eg grid) applications map naturally n  Cut through works well in meshes n  Commercial systems based on it.

n  Hyper cube n  Recursive structure nice for algorithm design n  Often same O complexity as PRAMs n  Often hypercube algorithm also good for other

topologies, so good starting point

Page 14: CS575 Parallel Processing

CS575 lecture 3 14

Embedding

n  Relationship between two networks n  Studied by mapping one into the other n  Why? n  G(V,E) à G’(V’,E’)

n  graph G, G’, vertices V, V’, edges E, E’ n  Map E àE’, V à V’

n  congestion of k: k (>1) e-s to one e’ n  dilation of k: 1 e to k e’-s n  expansion: |V’| / |V| n  Often we want congestion=dilation=expansion=1

Page 15: CS575 Parallel Processing

CS575 lecture 3 15

Ring into hypercube n  Number the nodes of the ring s.t.

n  Hamming distance between two adjacent nodes = 1 n  Gray code provides such a numbering

n  Can be built recursively: binary reflected Gray code n  2 nodes: 0 1 OK n  2k nodes:

n  take Gray code for 2k-1 nodes n  Concatenate it with reflected Gray code for 2k-1 nodes n  Put 0 in front of first batch, 1 in front of second

n  Mesh can be embedded into a hypercube n  (Toroidal) mesh = rings of rings

Page 16: CS575 Parallel Processing

CS575 lecture 3 16

ring to hypercube cont’

0 00 000 G(0,1) = 0 i →G(i,dim) 1 01 001 G(1,1) = 1 11 011 10 010 G(i,x+1) = 0||G(i,x) i<2x

110 = 1||G(2 x+1-i-1,x) i>=2x

111 (|| is concatenation)

101 100

Page 17: CS575 Parallel Processing

CS575 lecture 3 17

2D Mesh into hypercube

n  Note 2D Mesh n  Rows: rings n  Cols: rings

n  2r * 2s wraparound mesh into 2r+s cube n  Map node(i,j) onto node G(i,r)||G(j,s) n  Row coincides with sub cube n  Column coincides with sub cube n  S.t. if adjacent in mesh then adjacent in cube

Page 18: CS575 Parallel Processing

CS575 lecture 3 18

Complete binary tree into hypercube

n  Map tree root to any cube node n  left child to same node n  right child at level j: invert bit j of parent node 000 000 001 000 010 001 011 000 100 010 110 001 101 011 111

Page 19: CS575 Parallel Processing

CS575 lecture 3 19

Routing Mechanisms

n  Determine all source à destination paths n  Minimal: a shortest path n  Deterministic: one path per (src,dst) pair

n  Mesh: dimension ordered (XY routing) n  Cube: E-routing

n  Send along least significant 1 bit in src XOR dst

n  Adaptive: many paths per (src,dst) pair n  Minimal: only shortest n  Why adaptive? Discuss.

Page 20: CS575 Parallel Processing

CS575 lecture 3 20

Routing (communication) Costs

n  Three factors n  Start up at source (ts)

n  OS, buffers, error correction info, routing algorithm

n  Hop time (th) n  The time it takes to get from one PE to the next n  Also called node latency

n  Word transfer time (tw) n  Inverse of channel bandwidth

Page 21: CS575 Parallel Processing

CS575 lecture 3 21

Two rout(switch)ing techniques

n  Store and Forward O(m.l) n  Strict: whole message travels from PE to PE n  m words, l links tcomm = ts + (m.tw + th).l n  Often, th is much less than m.tw: tcomm= ts + m.l.tw

n  Cut-through O(m+l) n  Non-strict: message broken in flits (packets) n  Flits are pipelined through the network tcomm= ts + l.th + m.tw n  Circular path + finite flit buffer can give rise to deadlock