Scaling Internet Routers Using Optics Isaac Keslassy, Shang-Tse Da Chuang, Kyoungsik Yu, David Miller, Mark Horowitz, Olav Solgaard, Nick McKeown Department of Electrical Engineering Stanford University http://yuba/~keslassy/papers/
Jan 15, 2016
Scaling Internet Routers Using Optics
Isaac Keslassy, Shang-Tse Da Chuang, Kyoungsik Yu, David Miller, Mark Horowitz, Olav Solgaard, Nick McKeown
Department of Electrical EngineeringStanford University
http://yuba/~keslassy/papers/
Backbone router capacity
1986 1988 1990 1992 1994 1996 1998 2000 2002 2004
1Tb/s
1Gb/s
10Gb/s
100Gb/s
Router capacity per rack2x every 18 months
Backbone router capacity
1986 1988 1990 1992 1994 1996 1998 2000 2002 2004
1Tb/s
1Gb/s
10Gb/s
100Gb/s
Router capacity per rack2x every 18 months
Traffic2x every year
Extrapolating
2003 2005 2007 2009 2011 2013 2015
1Tb/s
Router capacity2x every 18 months
Traffic2x every year
100Tb/s 2015: 16x disparity
Consequence
Unless something changes, operators will need: 16 times as many routers, consuming 16 times as much space, 256 times the power, Costing 100 times as much.
Actually need more than that…
Stanford 100Tb/s Internet Router
Goal: Study scalability Challenging, but not impossible Two orders of magnitude faster than deployed routers We will build components to show feasibility
40Gb/s
40Gb/s
40Gb/s
40Gb/s
OpticalOpticalSwitchSwitch
• Line termination
• IP packet processing
• Packet buffering
• Line termination• IP packet processing
• Packet buffering
Electronic
Linecard #1Electronic
Linecard #1ElectronicLinecard #625
ElectronicLinecard #625160-
320Gb/s
160Gb/s
160-320Gb/s
100Tb/s = 640 * 160Gb/s
Throughput Guarantees
Operators increasingly demand throughput guarantees: To maximize use of expensive long-haul links For predictability and planning
Despite lots of effort and theory, no commercial router today has a throughput guarantee.
Requirements of our router
100Tb/s capacity 100% throughput for all traffic Must work with any set of linecards present Use technology available within 3 years Conform to RFC 1812
What limits router capacity?
0
2
4
6
8
10
12
1990 1993 1996 1999 2002 2003
Power
(kW
)
Approximate power consumption per rack
Power density is the limiting factor today
Crossbar
Linecards
Switch Linecards
Trend: Multi-rack routersReduces power density
Alcatel 7670 RSP Juniper TX8/T640
TX8
ChiaroAvici TSR
Limits to scaling
Overall power is dominated by linecards Sheer number Optical WAN components Per packet processing and buffering.
But power density is dominated by switch fabric
Trend: Multi-rack routersReduces power density
Switch Linecards
Limit today ~2.5Tb/s Electronics Scheduler scales <2x every 18 months Opto-electronic conversion
In
OutWAN
Linecard
InWAN
Multi-rack routers
Out
Switch fabric
Question
Instead, can we use an optical fabric at 100Tb/s with 100% throughput?
Conventional answer: No. Need to reconfigure switch too often 100% throughput requires complex
electronic scheduler.
Outline
How to guarantee 100% throughput? How to eliminate the scheduler? How to use an optical switch fabric? How to make it scalable and practical?
In
In
In
Out
Out
Out
R
R
R
R
R
R
Router capacity = NRSwitch capacity = N2R
100% Throughput?
?
?
?
?
?
?
?
?
R
R
R
R
R
R
R
R
R
RRRR
R
In
In
In
Out
Out
Out
R
R
R
R
R
R/N
R/N
R/N
R/NR/N
R/N
R/N
R/N
R/N
If traffic is uniform
RNR /NR /NR /
R
NR / NR /
Real traffic is not uniform
R
In
In
In
Out
Out
Out
R
R
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
RNR /NR /NR /
R
RNR /NR /NR /
R
RNR /NR /NR /
R
R
R
R
?
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Two-stage load-balancing switch
Load-balancing stage Switching stage
In
In
In
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R
R
R
100% throughput for weakly mixing, stochastic traffic.[C.-S. Chang, Valiant]
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
33 1
2
3
3333
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N33
1
2
3
33
33
Chang’s load-balanced switchGood properties
1. 100% throughput for broad class of traffic
1. No scheduler needed
Scalable
Chang’s load-balanced switchBad properties
FOFF: Load-balancing algorithm Packet sequence maintained No pathological patterns 100% throughput - always Delay within bound of ideal (See paper for details)
FOFF: Load-balancing algorithm Packet sequence maintained No pathological patterns 100% throughput - always Delay within bound of ideal (See paper for details)
1. Packet mis-sequencing
2. Pathological traffic patterns
Throughput 1/N-th of capacity
3. Uses two switch fabrics
Hard to package
4. Doesn’t work with some linecards missing
Impractical
In
In
In
Out
Out
Out
R
R
R
R
R
R
2R/N
2R/N
2R/N
2R/N
2R/N
2R/N
2R/N
2R/N
2R/N
Single Mesh Switch
One linecard
In
In
In
R
R
R
Out
Out
Out
Backplane
R
R
R
Packaging2R/N
2R/N
2R/N
2R/N
2R/N
2R/N
2R/N
2R/N
R/N
Many fabric options
OptionsSpace: Full uniform meshTime: Round-robin crossbarWavelength: Static WDM
Any permutationnetwork
C1, C2, …, CN
C1
C2
C3
CN
In Out
In Out
In Out
In Out
N channels each at rate 2R/N
In Out
In Out
In Out
In Out
Static WDM switching
Array Waveguide
Router(AWGR)
Passive andAlmost Zero
Power
A
B
C
D
A, B, C, D
A, B, C, D
A, B, C, D
A, B, C, D
A, A, A, A
B, B, B, B
C, C, C, C
D, D, D, D
4 WDM channels, each at rate 2R/N
RWDM
1
N
ROut
WDM
1
N
WDM
1
N
R R
2
R
R
4
21
Linecard dataflow
WDM
1
N
22
22
22
22 22 22
11
33
11
11
11111111
R R
3
In
11 11 11 11
Problems of scale
For N < 64, WDM is a good solution. We want N = 640. Need to decompose.
Decomposing the mesh
2R/81
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
Decomposing the mesh
2R/42R/8
2R/8
2R/8
2R/8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
TDMWDM
When N is too largeDecompose into groups (or racks)
1, 2, …, G
1
Array Waveguide
Router(AWGR)
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
G1, 2, …, G
When a linecard is missing
Each linecard spreads its data equally over every other linecard.
Problem: If one is missing, or failed, then the spreading no longer works.
When a linecard fails
In
In
Out
Out
Out
R
R
R
2R/3
2R/3
2R/3
2R/32R/32R/3
2R/3
2R/3
2R/3
2R/3 + 2R/3 = 4R/3
InR
R
R
2R/3 + 2R/6
2R/3 + 2R/6
2R/3 + 2R/6 + 2R/3 + 2R/6 = 2R
2R/3 + 2R/6
2R/3 + 2R/6
Solution:1. Move light beams
Replace AWGR with MEMS switch. Reconfigure when linecard added, removed or
fails.2. Finer channel granularity
Multiple paths.
SolutionUse transparent MEMS switches
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G=40
MEMSSwitch
1
G
MEMSSwitch
1
G
MEMSSwitch
1
G
Theorems: 1. Require L+G-1 MEMS switches2. Polynomial time reconfiguration algorithm
MEMS switches reconfigured only when linecard added, removed or fails.
Challenges
RWDM
1
G
GROut
WDM
1
G
WDMPkt
Switch
1
G
R R
G
G2
R=160Gb/s
R
4
21
WDM
1
G
GAddressLookup
11
R R
3
In
How to build a 250ms
160Gb/s buffer?
Low-cost, low-power
optoelectronic conversion?
What we are building
Buffer Manager
90nm ASIC
Buffer Manager
90nm ASIC
250ms DRAM
160Gb/s 160Gb/s
320Gb/sChip #1: 160Gb/s Packet Buffer
CMOS ASIC
16 x 10Gb/s
To Linecards To Optical Fabric
Chip #2: 16 x 55 Opto-electronic crossbar
55 x 10Gb/s55 x 10Gb/s
Optical source
Optical DetectorOptical Modulator
100Tb/s Load-Balanced Router
L = 16160Gb/s linecards
Linecard Rack G = 40
L = 16160Gb/s linecards
Linecard Rack 1
L = 16160Gb/s linecards
55 56
1 2
40 x 40MEMS
Switch Rack < 100W