http://www.k2.t.u-tokyo.ac.jp/index- e.html Stage-Distributed Time- Division Permutation Routing in a Multistage Optically Interconnected Fabric Alvaro Cassinelli (1) , Makoto Naruse (2) , Alain Goulet (1) , and Masatoshi Ishikawa (1) (1) University of Tokyo, Dept. Information Physics and Computing, 7-3-1 Hongo Bunkyo-ku, Tokyo 113-0033, Japan. (2) Communications Research Laboratory, 4-2-1 Nukui-kita, Koganei, Tokyo 184- 8795, Japan. Multistage optical hypercube Processor arrays X Y W Z
42
Embed
Ishikawa Laboratory UNIVERSITY OF TOKYO Stage-Distributed Time- Division Permutation.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
http://www.k2.t.u-tokyo.ac.jp/index-e.html
Stage-Distributed Time-Division Permutation Routing in a Multistage Optically Interconnected Fabric
Alvaro Cassinelli(1), Makoto Naruse(2), Alain Goulet(1), and Masatoshi Ishikawa(1)
(1) University of Tokyo, Dept. Information Physics and Computing, 7-3-1 Hongo Bunkyo-ku, Tokyo 113-0033, Japan.
(2) Communications Research Laboratory, 4-2-1 Nukui-kita, Koganei, Tokyo 184-8795, Japan.
Multistage optical hypercube
Processor arrays
XY
W
Z
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
PLAN of the presentation
II. Column-Control in Multistage Interconnection Networks (CCMINs)
III. Folded Optical Implementation of a transparent CCMIN
IV. Packet switching in a buffered CCMIN (“new”)
V. Conclusion and Further Research
I. Introduction: space-domain optical switching fabrics
VI. Some References
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
1) Processor-memory bottleneck in Supercomputers
2) Router bottleneck in Next Generation Optical Internet
I. Introduction: the problem on study
How to design an efficient optical switching fabric for addressing:
These problems have some similarities:
low latency required, synchronization, high bandwidth…
Traffic characteristics changes:
synchronous/asynchronous, regular/arbitrary request patterns, fixed/variable length of data bursts (granularity)
In fact, the above problems are case studies among a continuum of situations…
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
I. Introduction: optics inside routers
Scheme of a router
controller
inp
ut
inte
rface
ou
tpu
t in
terf
ace
switching
fabric
•interconnect router subsystems
• at the (unbuffered) switching fabric (OXC)
•at the interfaces and controller (“all-optical routing”)
Where optics?
This presentation concerns:
SPACE-DOMAIN OPTICAL SWITCHING FABRICS
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
II. Column-Control in Multistage Interconnection Networks
II.1 Multistage Interconnection Networks
II.2 Column-Control in MINs
II.3 Permutation Capacity of CCMIN
II.4 Unbuffered CCMIN for permutation routing
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
Circuit Switching: good for low-latency memory-processor communications.
Packet Switching: Maximum throughput of 63% without buffers (uniform traffic).
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
…It still has point-to-point full connectivity.
Alternative architecture:
Multistage Interconnection Network (MIN)
(and is “self-routing”)
• Internal blocking
• Large optical losses
• Large crosstalk
• Full point-to-point connectivity• O(N.log2N) complexity • Distributed routing possible• Fault tolerance possible (re-routing)• Easier repairing thanks to modularity
II.1 Multistage Interconnection Networks
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
• Column-control simplifies hardware and control
“stage-global switch”column-control lines…
Nice: CCMIN it is still capable of point-to-point connectivity
II.2 Column-Control in MINs
2-states “global” switches with long-range interconnectionssuited for optical implementation (free-space, guided-wave)
• Possible physical-merge of active switching and passive interconnection:
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
…if blocking was a problem for a MIN…
…things are much worse for the CCMIN
“global-stage” blocking
local-blocking
As a consequence of “global-stage” blocking, permutation capacity of the CCMIN is extremely reduced.
II.3 Permutation Capacity of CCMIN
However…
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
•Request serviced by circuit switching, (or by on-the-flight packet switching)
•Input requests are indep. Bernoulli trials (parameter )
• Uniform Traffic: equal probability of requesting any output port
Input request probability per unit time ()
Pro
bab
ilit
y o
f re
qu
est
accep
tan
ce
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CCMIN
Standard MIN
crossbartends to 63% when N, because HOL blocking.
both tend to 0 when N
CCMIN cannot be used to service arbitrary requests in a circuit-switched manner!
64x64 network
II.3 Permutation Capacity of CCMIN
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
II.4 Unbuffered CCMIN for permutation routing
4-D hypercube-connected multiprocessor…
Synchronous, weak-connected parallel computer
(processors use same permutation / time slot)
C2 C3 C4C11234
16
.
.
.
.
.
.
1
3
5
9
6
8
4
711
12
1516
1314
102
C 3
C2
C1
C4
Reduced permutation capacity may still be useful for synchronous “permutation routing” in parallel
computers(*)
(*) issue well studied in the past on “standard” blocking MINs
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
III. Folded Optical Implementation of a transparent CCMIN
III.1 Designing a CCMIN for circuit-switched permutation routing
III.2 “Folded” Optical Implementation
III.3 Experimental Demonstration
III.4 Possible applications
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
3 stage CC-”Baseline Network”
{c3, id} {c2, id} {c1, id}
• Number of permutations: 2n (n=3)
A multistage version of most parallel-computer direct-network topologies (hypercube, cube-connected-cycles, deBruijn, etc.) can be implemented as a CCMIN with properly designed inter-stage permutation modules.
III.1 Designing a CCMIN for circuit-switched permutation routing
• These are {c3, id}x{c2, id}x{c1, id}• These are just the required permutations to implement the (3D) hypercube!
c2
c 3
c1
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
III.2 “Folded” Optical Implementation
Multistage Interconnection Network architecture
Dense & Efficient 3D folded inter-stage optical interconnects
Optical Multistage Architecture Paradigm
(fixed interconnections)
+
shuffle
shuffleshuffle
plane implementation
•electronic
•planar lightwave circuit (PLC)
3D implementation
•free space
•guided-wave
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
• fixed, no broadcast: optical fiber ok.
• better efficiency (and just like free-space optics, no cross-talk in 3D).
• No space-invariance imposed.
• Precise and robust alignment possible.
• Theoretically more volume efficient than free-space counterpart.
• “hard” to build? not fundamentally difficult (can be automated, permutation decomposition possible)
• Alignment of output and input
• Power dissipation fundamental limit very far compared with electronics.
III.2 Demonstration: electromechanical actuator slide not shown in main presentation
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
No electromagnetic actuation:
Electromagnetic actuation:
Input: slow row/column scan of VCSEL array
Fixed Identity permutation Identity & Cube2 permutations alternate at 860 Hz.
III.2 Demonstration: electromechanical actuator slide not shown in main presentation
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
Actuator position
Photodetector signal
200ms
Input: 635nm laser modulated at 500MHz
Output: High speed photodetector
If 10Gb/s optical link, burst size is 2 Mbits per channel, (every millisecond). Average bandwidth of 2 Gb/s per channel
• Switching latency between interconnections ≈ 0,96 ms (*)
• Time Slot (3dB) ≈ 200ms
III.2 Demonstration: electromechanical actuator
(*) MEMS routers: ms range.
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
Possible computing applications:
• The present system is not usable for typical memory-processor communications, which requires low latencies (< 100 ns), unless another switching hardware is used (Acousto-optic cells: s range / electro-optical material: ns range)
• If processing time is large (slow switching latency) and “burst” of data large, the electromechanical system may be used (FFT, large database retrieval, ?…)
Communication networks:
• burst switching at the WAN level (ms range reconfiguration times).
• scientific-dedicated, transparent networks with long holding times and high-bandwidth (TransLight, GLIF). MEMS switches are currently used (reconfiguration times in the range of a second is ok). An optical GSMIN may be used to regularly provide interconnection configurations.
• if switching time is reduced, it can be used to perform cyclic permutation scheduling in an virtual output queued (VOQ) switch, leading to 100% throughput (Standford “Tiny-Tera Switch”)
III.4 Possible applications of an optical CCMIN
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
time
Burst Interconnects
Computationone-stage(ex. 1 ms)
Burst interconnection within “short” time slot
(Ex. 10Gbps, 100nsec 1kbit)
Inte
rco
nn
ect
1
Inte
rco
nn
ect
2
Interconnection switching interval
(Ex. 1ms)=
…Slow switching may be okay
slide not shown in main presentation
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
IV. Packet switching in a buffered CCMIN
IV.1 Buffering in blocking networks
IV.2 FIFO Buffered CCMIN architecture
IV.3 Performance evaluation
IV.4 Delay-line “buffered” architecture
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
• Unbuffered networks (even wide-sense non-blocking) suffer from HOL blocking: buffering is unavoidable.
•Input queues, Output Queues and Virtual Output Queues and internal buffering has been explored in crossbars as well as in MINs;
• However, an advantage of buffered MINs over buffered crossbars is that the stage-distributed switching marries well with the distribution of buffering (thus avoiding large buffers)
Blocking is a serious drawback for circuit switching
…Less serious for packet switching
Buffering is a solution adopted in “usual” MINs…
IV.1 Buffering for packet switching
…how much a CCMIN is improved by buffering?
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
…
…
…
Total length of buffer
arbitration
Buffer 1
Buffer N
depth of analysis
Length of transferred packets/cycle
Switc
h G
S-E
n(k
)
…
…
…
…
…
Total length of buffer
arbitration
Buffer 1
Buffer N
depth of analysis
Length of transferred packets/cycle
Switc
h G
S-E
n(k
)
…
…
inp
ut
ou
tpu
t
inter-stage FIFO
buffers
Why this architecture may compare well with “standard” buffered MINs?
• For uniform traffic, at each stage half of the packets wait, and half pass: individual switch/buffer control is, presumably, not really required…
IV.2 FIFO Buffered CCMIN architecture
What’s more: • Arbitration for configuring the Global Switches may not be necessary at all !
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
00
1
2
3
4
56
crossbar
standard MINGlobal Switched MIN
1
2
3
45
0
• GSMIN performance evolve quicker with buffer size
• For buffer size = 5 packets, equivalent performances
• For buffer size = 3 packets, performances are better than Xbar
IV.3 Performance: global control vs. local control
Seven stage - 128x128 Input/Output fabrics(rem: inter-stage transfer with maximum speed-up equal to the size of the buffer)
Performance of Global Switched MIN compares very well with that of a standard MIN.
Input request probability per unit time ()
Pro
bab
ilit
y o
f p
acket
accep
tan
ce
Bu
ffer s
ize
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
This is very interesting, because it means that a Standard MIN can be operated “blindly” if traffic is uniform enough.
Interconnection scheduling bottleneck is eliminated (CLOS, etc.) by using a Time-Division Permutation Routing strategy.
IV.3 Performance: global control with blind alternate
6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0
1
2
3
4
5
Input request probability per unit time ()
Pro
bab
ilit
y o
f p
acket
accep
tan
ce
crossbar
“fair” switching
“blind” alternate
Bu
ffer s
ize
“Blind” Switch alternation of a GSMIN
As expected “blind” alternation of switch states gives same performance than a “fair” switch-selection
(for uniform traffic)
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
inp
ut
ou
tpu
t
delay-line “buffer”
IV.4 Delay-line “buffered” architecture
What about just delaying packets?
Reliable optical memories are still too difficult to implement...
(since there are only two states per stage, only a single delay-line may give good performance)
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
inp
ut
ou
tpu
t
Switch
delay-line “buffer”… we didn’t study a “standard” MIN with delay-lines
slide not shown in main presentation
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0
1
2
3
4
5
delay-line
crossbar
Global
Switched MIN
Input request probability per unit time ()
Pro
bab
ilit
y o
f p
acket
accep
tan
ce
Bu
ffer s
ize
(we didn’t study a “standard” MIN with delay-lines)
Using a single selectable delay per channel and per stage, performance lies somewhere in between one and two-packet sized FIFO buffered architecture.
Blind alternation of global witch states is assumed
IV.4 Performance of a delay-line “buffered” architecture
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
V. Conclusion
V.1 Results
V.2 Further Research
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
V.1 Conclusion
Summarizing:
• Column-Control simplifies MIN hardware and control;
• Column-Controlled MIN can be efficiently implemented using dense plane-to-plane optical interconnections;
• Column-Control MIN may have enough permutation capacity for specific applications (highly parallel algorithms);
• Column-Controlled MIN can be used for packet switching if buffered, giving roughly the same performance than “standard” MINs;• Path-selection mechanism may be “blind” (i.e. round-robin, time-division permutation routing) without appreciable degradation of performance.
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
V.2 Further Research
• Other models of buffers: in particular, inter-stage virtual output queues (VOQ) may gives very good performance in CCMIN (because with a speed-up of only 2, each stage will have 100% throughput). Two parallel delay-line buffers ?
On transparent circuit switched CCMINs
On buffered packet switched CCMINs:
• An arbitrary permutation request may be serviced by multiplexing in time the available set of permutations. This needs input buffers and speed-up (i.e. short switching latency). This has been explored in standard MINs using 2x2 switches…
• Design of “active” modules, and multi-function modules (containing more than two permutations, but also other optical functions - e.g. optical delay lines)
• How heavily the the studied architectures rely on the URM assumption? Study more realistic traffic models / ways to balance the non-regular traffic.
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
stack of PLC layers coupled in the normal direction
cross state
by-pass state
cross
straight
input
switching region
input
switching region
• Simulation of a crossbar by speed-up (TDM connections for local area networks)
•Core of a permutation routing switches for inter-processor communications in a parallel computer
Reconfiguration time can be of the order of nanoseconds!
slide not shown in main
presentation
V.2 Fast switching permutation modules
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
Based on the observation that VOQ and speed-up, plus optimal permutation decomposition are the basic ingredients of the Birkhof-von Newmann Switch (plus load-balancing to simplify the decomposition => Tiny-Tera switch) with 100% throughput, it will be interesting to study then:
1) a “constrained” decomposition of a rate matrix onto the set of available CCMIN permutations
2) a multistage version of the BVN switch, where the permutation decomposition is done:
a) at each stage (using bi-permutation modules, this will probably lead to simple forced-alternate mode, and reduce the size of the VOQ, to only 2, which may be accommodated by simple delay-lines!),
b) every some stages, so that the available set of permutations will be very reduced, but still larger than 2. This may optimize the design of buffer functions (no need to put in all stages).
slide not shown in main
presentation
Thank you for your attention
V.2 …“advanced” further research
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
VI. Some References
Traffic models:
J. Cao et al., “Internet traffic tends toward Poisson and Independent as load Increases”, Nonlinear Estimation and Classification, eds. C. Holmes et al., Springer, NY, 2002.
thermo-optic matrix [Goh01]
round-robin (TDM). [Thompson91].
Crosstalk can be solved decomposing a permutation into semi-permutations, with an increase of the number of network stages [Qiao]
“Volume-consumption comparisons of free-space and guided-wave optical interconnections”, Y.Li and J. Popelek, p.1815-1825, Appl.Opt. Vol 39, n.11, april 2000.
Study of inter-stage VOQ in MINs:
Kolias, “Dual Banyan Switch”, [Kolias]
W.J. Dainty, “Virtual-Channel Flow Control“, IEEE Trans. Parallel and Distr. Systems, Vol. 3, No. 2, Mar. 1992, pp. 194-205. Dainy studies “DAMQ” (dynamically allocated multi-queue buffers), which looks quite similar to “hop-mode” buffers.