Page 1
1
High PerformanceSwitching and RoutingTelecom Center Workshop: Sept 4, 1997.
EE384Y: Packet Switch ArchitecturesPart II
Load-balanced Switch
(Borrowed from Isaac Keslassy’s Defense Talk)
Nick McKeownProfessor of Electrical Engineering and Computer Science, Stanford University
[email protected] ://www.stanford.edu/~nickm
Page 2
2
The Arbitration Problem
A packet switch fabric is reconfigured for every packet transfer.
For example, at 160Gb/s, a new IP packet can arrive every 2ns.
The configuration is picked to maximize throughput and not waste capacity.
Known algorithms are probably too slow.
Page 3
3
Approach
We know that a crossbar with VOQs, and uniform Bernoulli i.i.d. arrivals, gives 100% throughput for the following scheduling algorithms: Pick a permutation uar from all permutations. Pick a permutation uar from the set of size N in which each
input-output pair (i,j) are connected exactly once in the set. From the same set as above, repeatedly cycle through a fixed
sequence of N different permutations.
Can we make non-uniform, bursty traffic uniform “enough” for the above to hold?
Page 4
4
Design Example
GoalsScale to High Linecard Speeds (160Gb/s)
No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity
Scale to High Number of Linecards (640)
Provide Performance Guarantees 100% Throughput Guarantee No Packet Reordering
Stanford “Optics in Routers” projecthttp://yuba.stanford.edu/or/
Some challenging numbers: 100Tb/s 160Gb/s linecards 640 linecards
Page 5
5
Outline
Basic idea of load-balancing Packet mis-sequencing An optical switch fabric Scaling number of linecards Arbitrary arrangement of linecards
Page 6
6
In
In
In
Out
Out
Out
R
R
R
R
R
R
Router capacity = NRSwitch capacity = N2R
100% Throughput in a Mesh Fabric
?
?
?
?
?
?
?
?
?
R
R
R
R
R
R
R
R
R
RRRR
Page 7
7
R
In
In
In
Out
Out
Out
R
R
R
R
R
R/N
R/N
R/N
R/NR/N
R/N
R/N
R/N
R/N
If Traffic Is Uniform
RNR /NR /NR /
R
NR / NR /
Page 8
8
Real Traffic is Not Uniform
R
In
In
In
Out
Out
Out
R
R
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
RNR /NR /NR /
R
RNR /NR /NR /
R
RNR /NR /NR /
R
R
R
R
?
Page 9
9
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Load-Balanced Switch
Load-balancing stage Forwarding stage
In
In
In
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R
R
R
100% throughput for weakly mixing traffic (Valiant, C.-S. Chang)
Page 10
10
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
112233
Load-Balanced Switch
Page 11
11
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N33
22
11
Load-Balanced Switch
Page 12
12
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/NR/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Intuition: 100% Throughput
Arrivals to second mesh:
Capacity of second mesh:
Second mesh: arrival rate < service rate
111
111
111
where,1
UaUN
b
01
-b RUaUN
C
UN
RC
Cba
[C.-S. Chang]
Page 13
13
Another way of thinking about it
1
N
1
N
1
N
External Outputs
Internal Inputs
External Inputs
Load-balancing cyclic shift
Switching cyclic shift
Load Balancing
First stage load-balances incoming packets Second stage is a cyclic shift
Page 14
14
Load-Balanced Switch
External Outputs
Internal Inputs
1
N
ExternalInputs
Load-balancing cyclic shift
Switching cyclic shift
1
N
1
N
11
2
2
Page 15
15
ˆ( ) ,
ˆ mod
1. Consider a periodic sequence of permutation matrices:
where is a one-cycle permutation matrix
(f or example, a TDM sequence), and .
2. I f 1st stage is
tP t P P
t t N
Main Result [Chang et al.]:
1 1
1
2 2
( ) ( ),
( ) ( ),
scheduled by a sequence of permutation
matrices:
where is a random starting phase, and
3. The 2nd stage is scheduled by a sequence of permutation
matrices:
4. Then the swit
P t P t
P t P t
ch gives 100% throughput f or a very broad
range of traffi c types.
1st stage makes non-unif orm traffi c unif orm,
and breaks up burstiness.
Observation:
Page 16
16
Outline of Chang’s Proof
1
( )
( )
( ) ( ) ( )
( )
( 1)
1. Let be the matrix of arrivals at time , where
indicates an arrival at f or .
2. Let be the input traffi c to the second stage.
3. Let be the queue length matrix:
ij
a t t
a t i j
b t P t a t
q t
q t
2
20
1
1 1
max ( ) ( 1) ( 1), 0 ,
( ) max .
( ) ( ).
1( ) ( ) ( ) ( ) ( ) .
1lim
expands to
I f no output is oversubscribed, converges to steady state
t
s ts
t
q t b t P t
q t b P
q t q
E b t E P t a t E P t E a t eN
bt
:Theorem
Proof :
21
1 1( ) ( ) 0.
( )Holds under some mild conditions on (weakly mixing arrival processes).
t
s
s P s e eN N
a t
Page 17
17
Outline
Basic idea of load-balancing Packet mis-sequencing An optical switch fabric Scaling number of linecards Arbitrary arrangement of linecards
Page 18
18
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Packet Reordering
12
Page 19
19
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Bounding Delay Difference Between Middle Ports
1
2
cells
Page 20
20
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
123
0
UFS (Uniform Frame Spreading)
12
Page 21
21
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
FOFF (Full Ordered Frames First)
12
Page 22
22
FOFF (Full Ordered Frames First)
Input Algorithm N FIFO queues corresponding to the N output flows Spread each flow uniformly: if last packet was sent to
middle port k, send next to k+1. Every N time-slots, pick a flow:
- If full frame exists, pick it and spread like UFS - Else if all frames are partial, pick one in round-robin order and send it
123
12
4
N
Page 23
23
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Bounding Reordering
123
NN
Page 24
24
FOFF
Output properties N FIFO queues corresponding to the N middle
ports Buffer size less than N2 packets If there are N2 packets, one of the head-of-line
packets is in order
111
22
333
Output
4
N
Page 25
25
FOFF Properties
Property 1: FOFF maintains packet order.
Property 2: FOFF has O(1) complexity.
Property 3: Congestion buffers operate independently.
Property 4: FOFF maintains an average packet delay within constant from ideal output-queued router.
Corollary: FOFF has 100% throughput for any adversarial traffic.
Page 26
26
In
In
In
Out
Out
Out
R
R
R
R
R
R
Output-Queued Router?
?
?
?
?
?
?
?
?
R
R
R
R
R
R
R
R
R
RRRR
Page 27
27
Outline
Basic idea of load-balancing Packet mis-sequencing An optical switch fabric Scaling number of linecards Arbitrary arrangement of linecards
Page 28
28
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
From Two Meshes to One Mesh
One linecard
In
Out
Page 29
29
From Two Meshes to One Mesh
First meshIn Out
In Out
In Out
In Out
One linecard
Second mesh
R R
R
R
R
Page 30
30
From Two Meshes to One Mesh
Combined meshIn Out
In Out
In Out
In Out
2RR
2R
2R
2R
Page 31
31
Many Fabric Options
Options
Space: Full uniform meshTime: Round-robin crossbarWavelength: Static WDM
Any spreadingdevice
C1, C2, …, CN
C1
C2
C3
CN
In Out
In Out
In Out
In Out
N channels each at rate 2R/NOne linecard
Page 32
32
AWGR (Arrayed Waveguide Grating Router) A Passive Optical Component
Wavelength i on input port j goes to output port (i+j-1) mod N
Can shuffle information from different inputs
1, 2…N
NxN AWGR
Linecard 1
Linecard 2
Linecard N
1
2
N
Linecard 1
Linecard 2
Linecard N
Page 33
33
In Out
In Out
In Out
In Out
Static WDM Switching: Packaging
AWGR
Passive andAlmost Zero
Power
A
B
C
D
A, B, C, D
A, B, C, D
A, B, C, D
A, B, C, D
A, A, A, A
B, B, B, B
C, C, C, C
D, D, D, D
N WDM channels, each at rate 2R/N
Page 34
34
Outline
Basic idea of load-balancing Packet mis-sequencing An optical switch fabric Scaling number of linecards Arbitrary arrangement of linecards
Page 35
35
Scaling Problem
For N < 64, an AWGR is a good solution. We want N = 640. Need to decompose.
Page 36
36
A Different Representation of the Mesh
In Out
In Out
In Out
In Out
R 2R
Mesh
2R In Out
In Out
In Out
In Out
R
2RR
Page 37
37
A Different Representation of the Mesh
In Out
In Out
In Out
In Out
R In Out
In Out
In Out
In Out
R2R/N
Page 38
38
1
2
3
4
Example: N=8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
2R/8
Page 39
39
When N is Too LargeDecompose into groups (or racks)
4R/42R 2R1
2
3
4
5
6
7
8
2R2R
1
2
3
4
5
6
7
8
4R 4R
Page 40
40
When N is Too LargeDecompose into groups (or racks)
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
2RL
2RL 2RL
2RL2RL/G
2RL/G
2RL/G
2RL/G
Page 41
41
Outline
Basic idea of load-balancing Packet mis-sequencing An optical switch fabric Scaling number of linecards Arbitrary arrangement of linecards
Page 42
42
When Linecards Fail
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
2RL
2RL 2RL
2RL2RL/G
2RL/G
2RL/G
2RL/G
2RL
Solution: replace mesh with sum of permutations
= + +
2RL/G 2RL/G 2RL/G 2RL/G
≤
2RL 2RL/G
G *
Page 43
43
Hybrid Electro-Optical ArchitectureUsing MEMS Switches
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
MEMSSwitch
MEMSSwitch
Page 44
44
When Linecards Fail
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
MEMSSwitch
MEMSSwitch
Page 45
45
Fiber Link Capacity
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
MEMSSwitch
MEMSSwitch
MEMSSwitch
Link Capacity ≈ 64 λ’s * 5 Gb/s/λ = 320 Gb/s = 2R
Laser/Modulator
MUX
Page 46
46
Group/Rack 1
1
2
2R
2R 4R
Group/Rack 2
1
2
2R
2R 4R
Example2 Groups of 2 Linecards
1
2
2R
2R
Group/Rack 1
1
2
2R
2R
Group/Rack 2
4R
4R
2R
2R
2R
2R
2R
2R
Page 47
47
Theorem: M≡L+G-1 MEMS switches are sufficient for bandwidth.
Number of MEMS Switches
Examples:
5540,16,640
2
MGLN
NMNGL
G groups, Li linecards in group i,
G
iiLN
1
,max kk
LL
Page 48
48
Group A
1
2
2R
2R 4R
Group B
1
2
2R
2R 4R
Packet Schedule
1
2
2R
2R
Group A
1
2
2R
2R
Group B
4R
4R
2R
2R
2R
2R
Page 49
49
At each time-slot: Each transmitting linecard sends one packet Each receiving linecard receives one packet (MEMS constraint) Each transmitting group i
sends at most one packet to each receiving group j through each MEMS connecting them
In a schedule of N time-slots: Each transmitting linecard sends exactly one
packet to each receiving linecard
Rules for Packet Schedule
Page 50
50
Packet Schedule
T+1 T+2 T+3 T+4
Tx LC A1 ? ? ? ?
Tx LC A2 ? ? ? ?
Tx LC B1 ? ? ? ?
Tx LC B2 ? ? ? ?
Tx Group A
Tx Group B
Page 51
51
Packet Schedule
T+1 T+2 T+3 T+4
Tx LC A1 A1 A2 B1 B2
Tx LC A2 B2 A1 A2 B1
Tx LC B1 B1 B2 A1 A2
Tx LC B2 A2 B1 B2 A1
Tx Group A
Tx Group B
Page 52
52
Bad Packet Schedule
T+1 T+2 T+3 T+4
Tx LC A1 A1 A2 B1 B2
Tx LC A2 B2 A1 A2 B1
Tx LC B1 B1 B2 A1 A2
Tx LC B2 A2 B1 B2 A1
Tx Group A
Tx Group B
Page 53
53
Group Schedule
T+1 T+2 T+3 T+4
Tx Group A AB AB AB AB
Tx Group B AB AB AB AB
Page 54
54
Good Packet Schedule
T+1 T+2 T+3 T+4
Tx LC A1 A1 A2 B1 B2
Tx LC A2 B2 B1 A2 A1
Tx LC B1 B1 B2 A1 A2
Tx LC B2 A2 A1 B2 B1
Theorem: There exists a polynomial-time algorithm that finds the correct packet schedule.
Tx Group A
Tx Group B
Page 55
55
Outline
Basic idea of load-balancing Packet mis-sequencing An optical switch fabric Scaling number of linecards Arbitrary arrangement of linecards