Scaling Internet Routers Using Optics UW, October 16 th , 2003 Nick McKeown Joint work with research groups of: David Miller, Mark Horowitz, Olav Solgaard. Students: Isaac Keslassy, Shang-Tse Chuang, Kyoungsik Yu. Department of Electrical Engineering, Stanford University Paper: http://klamath.stanford.edu/~nickm/papers/sigcomm2003.pdf Web site: http://klamath.stanford.edu/or
50
Embed
Scaling Internet Routers Using Optics UW, October 16 th, 2003 Nick McKeown Joint work with research groups of: David Miller, Mark Horowitz, Olav Solgaard.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Scaling Internet Routers Using Optics
UW, October 16th, 2003
Nick McKeown
Joint work with research groups of: David Miller, Mark Horowitz, Olav Solgaard. Students: Isaac Keslassy, Shang-Tse Chuang, Kyoungsik Yu.
Department of Electrical Engineering, Stanford University
Unless something changes, operators will need: 16 times as many routers, consuming 16 times as much space, 256 times the power, Costing 100 times as much.
Actually need more than that…
Stanford 100Tb/s Internet Router
Goal: Study scalability Challenging, but not impossible Two orders of magnitude faster than deployed routers We will build components to show feasibility
40Gb/s
40Gb/s
40Gb/s
40Gb/s
OpticalOpticalSwitchSwitch
• Line termination
• IP packet processing
• Packet buffering
• Line termination• IP packet processing
• Packet buffering
Electronic
Linecard #1Electronic
Linecard #1ElectronicLinecard #625
ElectronicLinecard #625160-
320Gb/s
160Gb/s
160-320Gb/s
100Tb/s = 640 * 160Gb/s
Throughput Guarantees
Operators increasingly demand throughput guarantees: To maximize use of expensive long-haul links For predictability and planning
Despite lots of effort and theory, no commercial router today has a throughput guarantee.
Requirements of our router
100Tb/s capacity 100% throughput for all traffic Must work with any set of linecards present Use technology available within 3 years Conform to RFC 1812
What limits router capacity?
0
2
4
6
8
10
12
1990 1993 1996 1999 2002 2003
Power
(kW
)
Approximate power consumption per rack
Power density is the limiting factor today
Crossbar
Linecards
Switch Linecards
Trend: Multi-rack routersReduces power density
Alcatel 7670 RSP Juniper TX8/T640
TX8
ChiaroAvici TSR
Limits to scaling
Overall power is dominated by linecards Sheer number Optical WAN components Per packet processing and buffering.
Instead, can we use an optical fabric at 100Tb/s with 100% throughput?
Conventional answer: No. Need to reconfigure switch too often 100% throughput requires complex
electronic scheduler.
Outline
How to guarantee 100% throughput? How to eliminate the scheduler? How to use an optical switch fabric? How to make it scalable and practical?
In
In
In
Out
Out
Out
R
R
R
R
R
R
Router capacity = NRSwitch capacity = N2R
100% Throughput?
?
?
?
?
?
?
?
?
R
R
R
R
R
R
R
R
R
RRRR
R
In
In
In
Out
Out
Out
R
R
R
R
R
R/N
R/N
R/N
R/NR/N
R/N
R/N
R/N
R/N
If traffic is uniform
RNR /NR /NR /
R
NR / NR /
Real traffic is not uniform
R
In
In
In
Out
Out
Out
R
R
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
RNR /NR /NR /
R
RNR /NR /NR /
R
RNR /NR /NR /
R
R
R
R
?
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Two-stage load-balancing switch
Load-balancing stage Switching stage
In
In
In
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R
R
R
100% throughput for weakly mixing, stochastic traffic.[C.-S. Chang, Valiant]
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
33 1
2
3
3333
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N33
1
2
3
33
33
Chang’s load-balanced switchGood properties
1. 100% throughput for broad class of traffic
1. No scheduler needed
Scalable
Chang’s load-balanced switchBad properties
FOFF: Load-balancing algorithm Packet sequence maintained No pathological patterns 100% throughput - always Delay within bound of ideal (See paper for details)
FOFF: Load-balancing algorithm Packet sequence maintained No pathological patterns 100% throughput - always Delay within bound of ideal (See paper for details)
1. Packet mis-sequencing
2. Pathological traffic patterns
Throughput 1/N-th of capacity
3. Uses two switch fabrics
Hard to package
4. Doesn’t work with some linecards missing
Impractical
In
In
In
Out
Out
Out
R
R
R
R
R
R
2R/N
2R/N
2R/N
2R/N
2R/N
2R/N
2R/N
2R/N
2R/N
Single Mesh Switch
One linecard
In
In
In
R
R
R
Out
Out
Out
Backplane
R
R
R
Packaging2R/N
2R/N
2R/N
2R/N
2R/N
2R/N
2R/N
2R/N
R/N
Many fabric options
OptionsSpace: Full uniform meshTime: Round-robin crossbarWavelength: Static WDM
Any permutationnetwork
C1, C2, …, CN
C1
C2
C3
CN
In Out
In Out
In Out
In Out
N channels each at rate 2R/N
In Out
In Out
In Out
In Out
Static WDM switching
Array Waveguide
Router(AWGR)
Passive andAlmost Zero
Power
A
B
C
D
A, B, C, D
A, B, C, D
A, B, C, D
A, B, C, D
A, A, A, A
B, B, B, B
C, C, C, C
D, D, D, D
4 WDM channels, each at rate 2R/N
RWDM
1
N
ROut
WDM
1
N
WDM
1
N
R R
2
R
R
4
21
Linecard dataflow
WDM
1
N
22
22
22
22 22 22
11
33
11
11
11111111
R R
3
In
11 11 11 11
Problems of scale
For N < 64, WDM is a good solution. We want N = 640. Need to decompose.
Decomposing the mesh
2R/81
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
Decomposing the mesh
2R/42R/8
2R/8
2R/8
2R/8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
TDMWDM
When N is too largeDecompose into groups (or racks)
1, 2, …, G
1
Array Waveguide
Router(AWGR)
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
G1, 2, …, G
When a linecard is missing
Each linecard spreads its data equally over every other linecard.
Problem: If one is missing, or failed, then the spreading no longer works.
When a linecard fails
In
In
Out
Out
Out
R
R
R
2R/3
2R/3
2R/3
2R/32R/32R/3
2R/3
2R/3
2R/3
InR
R
R
2R/3 + 2R/6
2R/3 + 2R/6
2R/3 + 2R/6 + 2R/3 + 2R/6 = 2R
2R/3 + 2R/6
2R/3 + 2R/6
Solution:1. Move light beams
Replace AWGR with MEMS switch. Reconfigure when linecard added, removed or
fails.2. Finer channel granularity
Multiple paths.
2R/3 + 2R/3 = (4/3)R
SolutionUse transparent MEMS switches
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G=40
MEMSSwitch
1
G
MEMSSwitch
1
G
MEMSSwitch
1
G
Theorems: 1. Require L+G-1 MEMS switches2. Polynomial time reconfiguration algorithm
MEMS switches reconfigured only when linecard added, removed or fails.
First-Stage
GxGMiddleSwitch
Group 1
LxMLocal
Switch
Linecard 1
Linecard 2
Linecard L
Group 2
LxMLocal
Switch
Linecard 1
Linecard 2
Linecard L
LxMLocal
Switch
Linecard 1
Linecard 2
Linecard L
Group G
MxLLocal
Switch
Linecard 1
Linecard 2
Linecard L
Final-Stage
Group 1
MxLLocal
Switch
Linecard 1
Linecard 2
Linecard L
Group 2
MxLLocal
Switch
Linecard 1
Linecard 2
Linecard L
Group G
GxGMiddleSwitch
GxGMiddleSwitch
GxGMiddleSwitch
1
2
3
M
Middle-Stage
1
2
3
M
1
2
3
M
1
2
3
M
1
2
3
M
1
2
3
M
1
2
3
M
Hybrid Architecture: Logical View
Hybrid Electro-Optical ArchitectureFixedLasers
ElectronicSwitches
GxGMEMS
Group 1
LxMCrossbar
Linecard 1
Linecard 2
Linecard L
Group 2
LxMCrossbar
Linecard 1
Linecard 2
Linecard L
LxMCrossbar
Linecard 1
Linecard 2
Linecard L
Group G
MxLCrossbar
Linecard 1
Linecard 2
Linecard L
ElectronicSwitches
OpticalReceivers
Group 1
MxLCrossbar
Linecard 1
Linecard 2
Linecard L
Group 2
MxLCrossbar
Linecard 1
Linecard 2
Linecard L
Group G
GxGMEMS
GxGMEMS
GxGMEMS
1
2
3
M
StaticMEMS
1
2
3
M
1
2
3
M
1
2
3
M
1
2
3
M
1
2
3
M
1
2
3
M
Number of MEMS Switches
Linecard 1
Linecard 2
Linecard 3
Crossbar
Crossbar
Crossbar
Crossbar
Linecard 1
Linecard 2
Linecard 3
R
RR
R
Linecard 1
Linecard 2
Linecard 3
Crossbar
Crossbar
Crossbar
Crossbar
Linecard 1
Linecard 2
Linecard 3
StaticMEMS
Linecard 4 Linecard 4
Linecard 3 Linecard 4
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
Number of MEMS SwitchesLinecard 1
Linecard 2
Linecard 3
Crossbar
Crossbar
Crossbar
Crossbar
Linecard 1
Linecard 2
Linecard 3
4R/3
2R/32R/3
R/3
Linecard 1
Linecard 2
Linecard 3
Crossbar
Crossbar
Crossbar
Crossbar
Linecard 1
Linecard 2
Linecard 3
StaticMEMS
R
R/3
2R/3
R/3
2R/3
R
R
R
R
R
R
R
R
R
R
R
R
Number of MEMS needed for a schedule
Li: number of linecards in group i, 1 ≤ i ≤ G. Group i needs to send to group j:
1
( )( ), G
ji i
i
LL R where N L
N
Assume each group can send at most R to each MEMS. Number of MEMS
needed between groups i and j:
1( )( ) j i j
ij i
L L LA L R
N R N
Number of MEMS needed for a schedule
The number of MEMS needed for group i to send to group j is Aij.
The total number of MEMS needed for group i is the sum of the Aij’s
1 1 1
1G G G
i j i jij i
j j j
L L L LA L G
N N
1, max( ) iL G where L L
Constraints for the TDM Schedule
1. Latin Square: In any period N, each transmitting linecard is connected to each receiving linecard exactly once.
2. MEMS constraint: In any time-slot, there are at most Aij connections between transmitting group i and receiving group j, where:
i jij
L LA
N
Example
Assume L1=3, L2=2, L3=1
Then
E.g., at most 2 packets from the first group to the first group at each time-slot
2 1 1
1 1 1
1 1 1
i j
ij
ij
L LA
N
Bad TDM Transmit Schedule
t = 0 t = 1 t = 2 t = 3 t = 4 t = 5
LC 1 1 2 3 4 5 6
LC 2 6 1 2 3 4 5
LC 3 5 6 1 2 3 4
LC 4 4 5 6 1 2 3
LC 5 3 4 5 6 1 2
LC 6 2 3 4 5 6 1
Good TDM Transmit Schedule
t = 0 t = 1 t = 2 t = 3 t = 4 t = 5
LC 1 1 2 3 4 5 6
LC 2 5 1 2 3 6 4
LC 3 6 5 4 1 2 3
LC 4 2 3 1 6 4 5
LC 5 4 6 5 2 3 1
LC 6 3 4 6 5 1 2
Configuration Algorithm
1. Assign connections between groups, so MEMS constraint is satisfied.
2. Assign group connections to specific linecards, so there is exactly one connection per linecard pair in the schedule.
Comments: Algorithm is surprisingly complex. Best running time so far: 40 seconds for 640 linecards.