Ishikawa Laboratory UNIVERSITY OF TOKYO Stage-Distributed Time- Division Permutation.

Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/

http://www.k2.t.u-tokyo.ac.jp/index-e.html

Stage-Distributed Time-Division Permutation Routing in a Multistage Optically Interconnected Fabric

Alvaro Cassinelli(1), Makoto Naruse(2), Alain Goulet(1), and Masatoshi Ishikawa(1)

(1) University of Tokyo, Dept. Information Physics and Computing, 7-3-1 Hongo Bunkyo-ku, Tokyo 113-0033, Japan.

(2) Communications Research Laboratory, 4-2-1 Nukui-kita, Koganei, Tokyo 184-8795, Japan.

Multistage optical hypercube

Processor arrays

XY

W

Z


PLAN of the presentation

II. Column-Control in Multistage Interconnection Networks (CCMINs)

III. Folded Optical Implementation of a transparent CCMIN

IV. Packet switching in a buffered CCMIN (“new”)

V. Conclusion and Further Research

I. Introduction: space-domain optical switching fabrics

VI. Some References


1) Processor-memory bottleneck in Supercomputers

2) Router bottleneck in Next Generation Optical Internet

I. Introduction: the problem on study

How to design an efficient optical switching fabric for addressing:

These problems have some similarities:

low latency required, synchronization, high bandwidth…

Traffic characteristics changes:

synchronous/asynchronous, regular/arbitrary request patterns, fixed/variable length of data bursts (granularity)

In fact, the above problems are case studies among a continuum of situations…


I. Introduction: optics inside routers

Scheme of a router

controller

inp

ut

inte

rface

ou

tpu

t in

terf

ace

switching

fabric

•interconnect router subsystems

• at the (unbuffered) switching fabric (OXC)

•at the interfaces and controller (“all-optical routing”)

Where optics?

This presentation concerns:

SPACE-DOMAIN OPTICAL SWITCHING FABRICS


II. Column-Control in Multistage Interconnection Networks

II.1 Multistage Interconnection Networks

II.2 Column-Control in MINs

II.3 Permutation Capacity of CCMIN

II.4 Unbuffered CCMIN for permutation routing



• O(N2) complexity (using 2x2 switches)• Simultaneous switching noise• Central controller bottleneck • Poor modularity

• Wide-sense non-blocking • Low latency

“Basic” switching fabric:

Full-Crossbar (XC)

Circuit Switching: good for low-latency memory-processor communications.

Packet Switching: Maximum throughput of 63% without buffers (uniform traffic).


…It still has point-to-point full connectivity.

Alternative architecture:

Multistage Interconnection Network (MIN)

(and is “self-routing”)

• Internal blocking

• Large optical losses

• Large crosstalk

• Full point-to-point connectivity• O(N.log2N) complexity • Distributed routing possible• Fault tolerance possible (re-routing)• Easier repairing thanks to modularity



• Column-control simplifies hardware and control

“stage-global switch”column-control lines…

Nice: CCMIN it is still capable of point-to-point connectivity

II.2 Column-Control in MINs

2-states “global” switches with long-range interconnectionssuited for optical implementation (free-space, guided-wave)

• Possible physical-merge of active switching and passive interconnection:


…if blocking was a problem for a MIN…

…things are much worse for the CCMIN

“global-stage” blocking

local-blocking

As a consequence of “global-stage” blocking, permutation capacity of the CCMIN is extremely reduced.


However…


•Request serviced by circuit switching, (or by on-the-flight packet switching)

•Input requests are indep. Bernoulli trials (parameter )

• Uniform Traffic: equal probability of requesting any output port

Input request probability per unit time ()

Pro

bab

ilit

y o

f re

qu

est

accep

tan

ce

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CCMIN

Standard MIN

crossbartends to 63% when N, because HOL blocking.

both tend to 0 when N

CCMIN cannot be used to service arbitrary requests in a circuit-switched manner!

64x64 network



II.4 Unbuffered CCMIN for permutation routing

4-D hypercube-connected multiprocessor…

Synchronous, weak-connected parallel computer

(processors use same permutation / time slot)

C2 C3 C4C11234

16

.

.

.

.

.

.

1

3

5

9

6

8

4

711

12

1516

1314

102

C 3

C2

C1

C4

Reduced permutation capacity may still be useful for synchronous “permutation routing” in parallel

computers(*)

(*) issue well studied in the past on “standard” blocking MINs


III. Folded Optical Implementation of a transparent CCMIN

III.1 Designing a CCMIN for circuit-switched permutation routing

III.2 “Folded” Optical Implementation

III.3 Experimental Demonstration

III.4 Possible applications


3 stage CC-”Baseline Network”

{c3, id} {c2, id} {c1, id}

• Number of permutations: 2n (n=3)

A multistage version of most parallel-computer direct-network topologies (hypercube, cube-connected-cycles, deBruijn, etc.) can be implemented as a CCMIN with properly designed inter-stage permutation modules.

III.1 Designing a CCMIN for circuit-switched permutation routing

• These are {c3, id}x{c2, id}x{c1, id}• These are just the required permutations to implement the (3D) hypercube!

c2

c 3

c1


III.2 “Folded” Optical Implementation

Multistage Interconnection Network architecture

Dense & Efficient 3D folded inter-stage optical interconnects

Optical Multistage Architecture Paradigm

(fixed interconnections)

+

shuffle

shuffleshuffle

plane implementation

•electronic

•planar lightwave circuit (PLC)

3D implementation

•free space

•guided-wave


• fixed, no broadcast: optical fiber ok.

• better efficiency (and just like free-space optics, no cross-talk in 3D).

• No space-invariance imposed.

• Precise and robust alignment possible.

• Theoretically more volume efficient than free-space counterpart.

• “hard” to build? not fundamentally difficult (can be automated, permutation decomposition possible)

• Alignment of output and input

• Power dissipation fundamental limit very far compared with electronics.

input output

Prototype Fiber module (fibers and holders)

“integrated” 2D folded perfect shuffle permutation module

Wave-guide arrays for fixed, point-to-point and space variant interconnections are an interesting alternative to free-space optics

Guide-wave (fiber-based) Modules vs. Free-Space

III.2 “Folded” Optical Implementation slide not shown in main presentation


Prototype (non-integrated) 4x4 fiber module

Two holder prototypes: Zirconium, SiO2

Pitch: 250±5 mMultimode graded index fibers: NA=0,21

(core 50m, cladding 126m)Transmission loss: 3dB/km

Input (VCSEL

854±4nm)

Output (CCD)

(2)

input output

slide not shown in main presentation


III.2 Multiple-permutation module

Besides density, reduced crosstalk and optical efficiency, there is another nice feature of the guided-

wave approach to plane-to-plane optical interconnections…

Interleaving multiple permutations is possible

3D bi-permutation module built by stacking planar

lightwave circuits (for instance)

A small mechanical/optical

perturbation produces a drastic change of the

interconnection pattern

Multi-permutation modules as CCMIN’s “global-stage” switches

output

input

(*) and not only CCMIN’s stage permutations

(*)


Cube Permutations for N=2n

Folded:

Cube Permutation ck :

ck

{bn, … bk+1, bk, bk-1, … b2, b1}

{bn, … bk+1, bk, bk-1, … b2,b1}

If k n/2, exchange only rows; If k>n/2, ck exchange only columns.

The modules are just the same, rotated.

c1 c2 c3 c4

Unfolded: (example with n=4)




Unfolded hypercube and identity permutations

Row-Column Folded bi-permutation module

Prototype implementation of using optical fibers

1234

16

.

.

.

.

.

.

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

columns

row

s

plane mapping (“folding”) *

(*) not unique!

250mm

dummy fi ber rows

switching pitch: 125mm

250mm

dummy fi ber rows

switching pitch: 125mm

I dC2 I dC2C2


“Spanned” 4D hypercube (use four bi-permutation modules)


c3

c4

c2

c1

c2

c1

c 3 c4

(processors interconnected trough a 2D optical “socket” –

or laying in a VLSI chip matrix)

four-dimensional hypercube-connected multiprocessor…

{c4, id}

{c2, id}{c1, id}

{c3, id}{c4, id}

{c2, id}{c3, id}

{c1, id}

…topology is mapped on a plane



Output (CCD

camera)

Input(VCSEL array)

{c2, id}

{c1, id}

Inter-module Coupling Efficiency: 1.7dB(no additional optics, matching oil or antireflection coating).

Alignment tolerance: 5 m (half peak power).

Commutation pitch: 125 m

Validation of simple cascaded architecture.

Exit first module

Input second module

III.2 Experimental Demonstration slide not shown in main presentation


Id x Id C1 x C2Id x C2 C1 x Id

Selected permutatio

n product

Input

(VCSEL array)

Output

(CCD)

Visualization of 2D permutation switching using a pair of modules


C1 or IdC2 or Id


Resonant frequency = 430 Hz (62.5m)

(can vibrate the module in both X and Y directions – in principle, permutation interleaving is possible in both directions)

III.2 Demonstration: electromechanical actuator

X-Y electro-magnetic actuated device

(Micro electro-mechanical actuators (MEMS) may also be an interesting alternative when switching latency in the millisecond range is tolerable)


Time slot

timeInte

rcon

nect

1

Inte

rcon

nect

2

Inte

rcon

nect

3

Inte

rcon

nect

N

Resonant-frequency: round-robin permutation scheduling

III.2 Demonstration: electromechanical actuator slide not shown in main presentation


No electromagnetic actuation:

Electromagnetic actuation:

Input: slow row/column scan of VCSEL array

Fixed Identity permutation Identity & Cube2 permutations alternate at 860 Hz.

III.2 Demonstration: electromechanical actuator slide not shown in main presentation


Actuator position

Photodetector signal

200ms

Input: 635nm laser modulated at 500MHz

Output: High speed photodetector

If 10Gb/s optical link, burst size is 2 Mbits per channel, (every millisecond). Average bandwidth of 2 Gb/s per channel

• Switching latency between interconnections ≈ 0,96 ms (*)

• Time Slot (3dB) ≈ 200ms

III.2 Demonstration: electromechanical actuator

(*) MEMS routers: ms range.


Possible computing applications:

• The present system is not usable for typical memory-processor communications, which requires low latencies (< 100 ns), unless another switching hardware is used (Acousto-optic cells: s range / electro-optical material: ns range)

• If processing time is large (slow switching latency) and “burst” of data large, the electromechanical system may be used (FFT, large database retrieval, ?…)

Communication networks:

• burst switching at the WAN level (ms range reconfiguration times).

• scientific-dedicated, transparent networks with long holding times and high-bandwidth (TransLight, GLIF). MEMS switches are currently used (reconfiguration times in the range of a second is ok). An optical GSMIN may be used to regularly provide interconnection configurations.

• if switching time is reduced, it can be used to perform cyclic permutation scheduling in an virtual output queued (VOQ) switch, leading to 100% throughput (Standford “Tiny-Tera Switch”)

III.4 Possible applications of an optical CCMIN


time

Burst Interconnects

Computationone-stage(ex. 1 ms)

Burst interconnection within “short” time slot

(Ex. 10Gbps, 100nsec 1kbit)

Inte

rco

nn

ect

1

Inte

rco

nn

ect

2

Interconnection switching interval

(Ex. 1ms)＝

…Slow switching may be okay



IV. Packet switching in a buffered CCMIN

IV.1 Buffering in blocking networks

IV.2 FIFO Buffered CCMIN architecture

IV.3 Performance evaluation

IV.4 Delay-line “buffered” architecture


• Unbuffered networks (even wide-sense non-blocking) suffer from HOL blocking: buffering is unavoidable.

•Input queues, Output Queues and Virtual Output Queues and internal buffering has been explored in crossbars as well as in MINs;

• However, an advantage of buffered MINs over buffered crossbars is that the stage-distributed switching marries well with the distribution of buffering (thus avoiding large buffers)

Blocking is a serious drawback for circuit switching

…Less serious for packet switching

Buffering is a solution adopted in “usual” MINs…

IV.1 Buffering for packet switching

…how much a CCMIN is improved by buffering?


…

…

…

Total length of buffer

arbitration

Buffer 1

Buffer N

depth of analysis

Length of transferred packets/cycle

Switc

h G

S-E

n(k

)

…

…

…

…

…

Total length of buffer

arbitration

Buffer 1

Buffer N

depth of analysis

Length of transferred packets/cycle

Switc

h G

S-E

n(k

)

…

…

inp

ut

ou

tpu

t

inter-stage FIFO

buffers

Why this architecture may compare well with “standard” buffered MINs?

• For uniform traffic, at each stage half of the packets wait, and half pass: individual switch/buffer control is, presumably, not really required…

IV.2 FIFO Buffered CCMIN architecture

What’s more: • Arbitration for configuring the Global Switches may not be necessary at all !


6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

00

1

2

3

4

56

crossbar

standard MINGlobal Switched MIN

1

2

3

45

0

• GSMIN performance evolve quicker with buffer size

• For buffer size = 5 packets, equivalent performances

• For buffer size = 3 packets, performances are better than Xbar

IV.3 Performance: global control vs. local control

Seven stage - 128x128 Input/Output fabrics(rem: inter-stage transfer with maximum speed-up equal to the size of the buffer)

Performance of Global Switched MIN compares very well with that of a standard MIN.


Pro

bab

ilit

y o

f p

acket

accep

tan

ce

Bu

ffer s

ize


This is very interesting, because it means that a Standard MIN can be operated “blindly” if traffic is uniform enough.

Interconnection scheduling bottleneck is eliminated (CLOS, etc.) by using a Time-Division Permutation Routing strategy.

IV.3 Performance: global control with blind alternate

6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0

1

2

3

4

5


Pro

bab

ilit

y o

f p

acket

accep

tan

ce

crossbar

“fair” switching

“blind” alternate

Bu

ffer s

ize

“Blind” Switch alternation of a GSMIN

As expected “blind” alternation of switch states gives same performance than a “fair” switch-selection

(for uniform traffic)


inp

ut

ou

tpu

t

delay-line “buffer”

IV.4 Delay-line “buffered” architecture

What about just delaying packets?

Reliable optical memories are still too difficult to implement...

(since there are only two states per stage, only a single delay-line may give good performance)


inp

ut

ou

tpu

t

Switch

delay-line “buffer”… we didn’t study a “standard” MIN with delay-lines



6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0

1

2

3

4

5

delay-line

crossbar

Global

Switched MIN


Pro

bab

ilit

y o

f p

acket

accep

tan

ce

Bu

ffer s

ize

(we didn’t study a “standard” MIN with delay-lines)

Using a single selectable delay per channel and per stage, performance lies somewhere in between one and two-packet sized FIFO buffered architecture.

Blind alternation of global witch states is assumed

IV.4 Performance of a delay-line “buffered” architecture


V. Conclusion

V.1 Results

V.2 Further Research


V.1 Conclusion

Summarizing:

• Column-Control simplifies MIN hardware and control;

• Column-Controlled MIN can be efficiently implemented using dense plane-to-plane optical interconnections;

• Column-Control MIN may have enough permutation capacity for specific applications (highly parallel algorithms);

• Column-Controlled MIN can be used for packet switching if buffered, giving roughly the same performance than “standard” MINs;• Path-selection mechanism may be “blind” (i.e. round-robin, time-division permutation routing) without appreciable degradation of performance.


V.2 Further Research

• Other models of buffers: in particular, inter-stage virtual output queues (VOQ) may gives very good performance in CCMIN (because with a speed-up of only 2, each stage will have 100% throughput). Two parallel delay-line buffers ?

On transparent circuit switched CCMINs

On buffered packet switched CCMINs:

• An arbitrary permutation request may be serviced by multiplexing in time the available set of permutations. This needs input buffers and speed-up (i.e. short switching latency). This has been explored in standard MINs using 2x2 switches…

• Design of “active” modules, and multi-function modules (containing more than two permutations, but also other optical functions - e.g. optical delay lines)

• How heavily the the studied architectures rely on the URM assumption? Study more realistic traffic models / ways to balance the non-regular traffic.


stack of PLC layers coupled in the normal direction

cross state

by-pass state

cross

straight

input

switching region

input

switching region

• Simulation of a crossbar by speed-up (TDM connections for local area networks)

•Core of a permutation routing switches for inter-processor communications in a parallel computer

Reconfiguration time can be of the order of nanoseconds!

slide not shown in main

presentation

V.2 Fast switching permutation modules


Based on the observation that VOQ and speed-up, plus optimal permutation decomposition are the basic ingredients of the Birkhof-von Newmann Switch (plus load-balancing to simplify the decomposition => Tiny-Tera switch) with 100% throughput, it will be interesting to study then:

1) a “constrained” decomposition of a rate matrix onto the set of available CCMIN permutations

2) a multistage version of the BVN switch, where the permutation decomposition is done:

a) at each stage (using bi-permutation modules, this will probably lead to simple forced-alternate mode, and reduce the size of the VOQ, to only 2, which may be accommodated by simple delay-lines!),

b) every some stages, so that the available set of permutations will be very reduced, but still larger than 2. This may optimize the design of buffer functions (no need to put in all stages).


presentation

Thank you for your attention

V.2 …“advanced” further research


VI. Some References

Traffic models:

J. Cao et al., “Internet traffic tends toward Poisson and Independent as load Increases”, Nonlinear Estimation and Classification, eds. C. Holmes et al., Springer, NY, 2002.

thermo-optic matrix [Goh01]

round-robin (TDM). [Thompson91].

Crosstalk can be solved decomposing a permutation into semi-permutations, with an increase of the number of network stages [Qiao]

“Volume-consumption comparisons of free-space and guided-wave optical interconnections”, Y.Li and J. Popelek, p.1815-1825, Appl.Opt. Vol 39, n.11, april 2000.

Study of inter-stage VOQ in MINs:

Kolias, “Dual Banyan Switch”, [Kolias]

W.J. Dainty, “Virtual-Channel Flow Control“, IEEE Trans. Parallel and Distr. Systems, Vol. 3, No. 2, Mar. 1992, pp. 194-205. Dainy studies “DAMQ” (dynamically allocated multi-queue buffers), which looks quite similar to “hop-mode” buffers.


presentation

Ishikawa Laboratory UNIVERSITY OF TOKYO Stage-Distributed Time- Division Permutation.

Documents