Scaling Internet Routers Using Optics UW, October 16 th, 2003 Nick McKeown Joint work with research groups of: David Miller, Mark Horowitz, Olav Solgaard.

Scaling Internet Routers Using Optics

UW, October 16th, 2003

Nick McKeown

Joint work with research groups of: David Miller, Mark Horowitz, Olav Solgaard. Students: Isaac Keslassy, Shang-Tse Chuang, Kyoungsik Yu.

Department of Electrical Engineering, Stanford University

Paper: http://klamath.stanford.edu/~nickm/papers/sigcomm2003.pdfWeb site: http://klamath.stanford.edu/or

http://klamath.stanford.edu/~nickm/papers/sigcomm2003.pdf

Backbone router capacity

1986 1988 1990 1992 1994 1996 1998 2000 2002 2004

1Tb/s

1Gb/s

10Gb/s

100Gb/s

Router capacity per rack2x every 18 months

Backbone router capacity

1986 1988 1990 1992 1994 1996 1998 2000 2002 2004

1Tb/s

1Gb/s

10Gb/s

100Gb/s

Router capacity per rack2x every 18 months

Traffic2x every year

Extrapolating

2003 2005 2007 2009 2011 2013 2015

1Tb/s

Router capacity2x every 18 months

Traffic2x every year

100Tb/s 2015: 16x disparity

Consequence

Unless something changes, operators will need: 16 times as many routers, consuming 16 times as much space, 256 times the power, Costing 100 times as much.

Actually need more than that…

Stanford 100Tb/s Internet Router

Goal: Study scalability Challenging, but not impossible Two orders of magnitude faster than deployed routers We will build components to show feasibility

40Gb/s

40Gb/s

40Gb/s

40Gb/s

OpticalOpticalSwitchSwitch

• Line termination

• IP packet processing

• Packet buffering

• Line termination• IP packet processing

• Packet buffering

Electronic

Linecard #1Electronic

Linecard #1ElectronicLinecard #625

ElectronicLinecard #625160-

320Gb/s

160Gb/s

160-320Gb/s

100Tb/s = 640 * 160Gb/s

Throughput Guarantees

Operators increasingly demand throughput guarantees: To maximize use of expensive long-haul links For predictability and planning

Despite lots of effort and theory, no commercial router today has a throughput guarantee.

Requirements of our router

100Tb/s capacity 100% throughput for all traffic Must work with any set of linecards present Use technology available within 3 years Conform to RFC 1812

What limits router capacity?

0

2

4

6

8

10

12

1990 1993 1996 1999 2002 2003

Power

(kW

)

Approximate power consumption per rack

Power density is the limiting factor today

Crossbar

Linecards

Switch Linecards

Trend: Multi-rack routersReduces power density

Alcatel 7670 RSP Juniper TX8/T640

TX8

ChiaroAvici TSR

Limits to scaling

Overall power is dominated by linecards Sheer number Optical WAN components Per packet processing and buffering.

But power density is dominated by switch fabric

Trend: Multi-rack routersReduces power density

Switch Linecards

Limit today ~2.5Tb/s Electronics Scheduler scales <2x every 18 months Opto-electronic conversion

In

OutWAN

Linecard

InWAN

Multi-rack routers

Out

Switch fabric

Question

Instead, can we use an optical fabric at 100Tb/s with 100% throughput?

Conventional answer: No. Need to reconfigure switch too often 100% throughput requires complex

electronic scheduler.

Outline

How to guarantee 100% throughput? How to eliminate the scheduler? How to use an optical switch fabric? How to make it scalable and practical?

In

In

In

Out

Out

Out

R

R

R

R

R

R

Router capacity = NRSwitch capacity = N2R

100% Throughput?

?

?

?

?

?

?

?

?

R

R

R

R

R

R

R

R

R

RRRR

R

In

In

In

Out

Out

Out

R

R

R

R

R

R/N

R/N

R/N

R/NR/N

R/N

R/N

R/N

R/N

If traffic is uniform

RNR /NR /NR /

R

NR / NR /

Real traffic is not uniform

R

In

In

In

Out

Out

Out

R

R

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

RNR /NR /NR /

R

RNR /NR /NR /

R

RNR /NR /NR /

R

R

R

R

?

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

Two-stage load-balancing switch

Load-balancing stage Switching stage

In

In

In

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R

R

R

100% throughput for weakly mixing, stochastic traffic.[C.-S. Chang, Valiant]

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

33 1

2

3

3333

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N33

1

2

3

33

33

Chang’s load-balanced switchGood properties

1. 100% throughput for broad class of traffic

1. No scheduler needed

Scalable

Chang’s load-balanced switchBad properties

FOFF: Load-balancing algorithm Packet sequence maintained No pathological patterns 100% throughput - always Delay within bound of ideal (See paper for details)

FOFF: Load-balancing algorithm Packet sequence maintained No pathological patterns 100% throughput - always Delay within bound of ideal (See paper for details)

1. Packet mis-sequencing

2. Pathological traffic patterns

Throughput 1/N-th of capacity

3. Uses two switch fabrics

Hard to package

4. Doesn’t work with some linecards missing

Impractical

In

In

In

Out

Out

Out

R

R

R

R

R

R

2R/N

2R/N

2R/N

2R/N

2R/N

2R/N

2R/N

2R/N

2R/N

Single Mesh Switch

One linecard

In

In

In

R

R

R

Out

Out

Out

Backplane

R

R

R

Packaging2R/N

2R/N

2R/N

2R/N

2R/N

2R/N

2R/N

2R/N

R/N

Many fabric options

OptionsSpace: Full uniform meshTime: Round-robin crossbarWavelength: Static WDM

Any permutationnetwork

C1, C2, …, CN

C1

C2

C3

CN

In Out

In Out

In Out

In Out

N channels each at rate 2R/N

In Out

In Out

In Out

In Out

Static WDM switching

Array Waveguide

Router(AWGR)

Passive andAlmost Zero

Power

A

B

C

D

A, B, C, D

A, B, C, D

A, B, C, D

A, B, C, D

A, A, A, A

B, B, B, B

C, C, C, C

D, D, D, D

4 WDM channels, each at rate 2R/N

RWDM

1

N

ROut

WDM

1

N

WDM

1

N

R R

2

R

R

4

21

Linecard dataflow

WDM

1

N

22

22

22

22 22 22

11

33

11

11

11111111

R R

3

In

11 11 11 11

Problems of scale

For N < 64, WDM is a good solution. We want N = 640. Need to decompose.

Decomposing the mesh

2R/81

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8

Decomposing the mesh

2R/42R/8

2R/8

2R/8

2R/8

1

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8

TDMWDM

When N is too largeDecompose into groups (or racks)

1, 2, …, G

1

Array Waveguide

Router(AWGR)

2

L

2R2R

2R

1

2

L

2R2R

2R

Group/Rack 1

Group/Rack G

1

G1, 2, …, G

When a linecard is missing

Each linecard spreads its data equally over every other linecard.

Problem: If one is missing, or failed, then the spreading no longer works.

When a linecard fails

In

In

Out

Out

Out

R

R

R

2R/3

2R/3

2R/3

2R/32R/32R/3

2R/3

2R/3

2R/3

InR

R

R

2R/3 + 2R/6

2R/3 + 2R/6

2R/3 + 2R/6 + 2R/3 + 2R/6 = 2R

2R/3 + 2R/6

2R/3 + 2R/6

Solution:1. Move light beams

Replace AWGR with MEMS switch. Reconfigure when linecard added, removed or

fails.2. Finer channel granularity

Multiple paths.

2R/3 + 2R/3 = (4/3)R

SolutionUse transparent MEMS switches

1

2

L

2R2R

2R

1

2

L

2R2R

2R

Group/Rack 1

Group/Rack G=40

MEMSSwitch

1

G

MEMSSwitch

1

G

MEMSSwitch

1

G

Theorems: 1. Require L+G-1 MEMS switches2. Polynomial time reconfiguration algorithm

MEMS switches reconfigured only when linecard added, removed or fails.

First-Stage

GxGMiddleSwitch

Group 1

LxMLocal

Switch

Linecard 1

Linecard 2

Linecard L

Group 2

LxMLocal

Switch

Linecard 1

Linecard 2

Linecard L

LxMLocal

Switch

Linecard 1

Linecard 2

Linecard L

Group G

MxLLocal

Switch

Linecard 1

Linecard 2

Linecard L

Final-Stage

Group 1

MxLLocal

Switch

Linecard 1

Linecard 2

Linecard L

Group 2

MxLLocal

Switch

Linecard 1

Linecard 2

Linecard L

Group G

GxGMiddleSwitch

GxGMiddleSwitch

GxGMiddleSwitch

1

2

3

M

Middle-Stage

1

2

3

M

1

2

3

M

1

2

3

M

1

2

3

M

1

2

3

M

1

2

3

M

Hybrid Architecture: Logical View

Hybrid Electro-Optical ArchitectureFixedLasers

ElectronicSwitches

GxGMEMS

Group 1

LxMCrossbar

Linecard 1

Linecard 2

Linecard L

Group 2

LxMCrossbar

Linecard 1

Linecard 2

Linecard L

LxMCrossbar

Linecard 1

Linecard 2

Linecard L

Group G

MxLCrossbar

Linecard 1

Linecard 2

Linecard L

ElectronicSwitches

OpticalReceivers

Group 1

MxLCrossbar

Linecard 1

Linecard 2

Linecard L

Group 2

MxLCrossbar

Linecard 1

Linecard 2

Linecard L

Group G

GxGMEMS

GxGMEMS

GxGMEMS

1

2

3

M

StaticMEMS

1

2

3

M

1

2

3

M

1

2

3

M

1

2

3

M

1

2

3

M

1

2

3

M

Number of MEMS Switches

Linecard 1

Linecard 2

Linecard 3

Crossbar

Crossbar

Crossbar

Crossbar

Linecard 1

Linecard 2

Linecard 3

R

RR

R

Linecard 1

Linecard 2

Linecard 3

Crossbar

Crossbar

Crossbar

Crossbar

Linecard 1

Linecard 2

Linecard 3

StaticMEMS

Linecard 4 Linecard 4

Linecard 3 Linecard 4

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

Number of MEMS SwitchesLinecard 1

Linecard 2

Linecard 3

Crossbar

Crossbar

Crossbar

Crossbar

Linecard 1

Linecard 2

Linecard 3

4R/3

2R/32R/3

R/3

Linecard 1

Linecard 2

Linecard 3

Crossbar

Crossbar

Crossbar

Crossbar

Linecard 1

Linecard 2

Linecard 3

StaticMEMS

R

R/3

2R/3

R/3

2R/3

R

R

R

R

R

R

R

R

R

R

R

R

Number of MEMS needed for a schedule

Li: number of linecards in group i, 1 ≤ i ≤ G. Group i needs to send to group j:

1

( )( ), G

ji i

i

LL R where N L

N

Assume each group can send at most R to each MEMS. Number of MEMS

needed between groups i and j:

1( )( ) j i j

ij i

L L LA L R

N R N

Number of MEMS needed for a schedule

The number of MEMS needed for group i to send to group j is Aij.

The total number of MEMS needed for group i is the sum of the Aij’s

1 1 1

1G G G

i j i jij i

j j j

L L L LA L G

N N

1, max( ) iL G where L L

Constraints for the TDM Schedule

1. Latin Square: In any period N, each transmitting linecard is connected to each receiving linecard exactly once.

2. MEMS constraint: In any time-slot, there are at most Aij connections between transmitting group i and receiving group j, where:

i jij

L LA

N

Example

Assume L1=3, L2=2, L3=1

Then

E.g., at most 2 packets from the first group to the first group at each time-slot

2 1 1

1 1 1

1 1 1

i j

ij

ij

L LA

N

Bad TDM Transmit Schedule

t = 0 t = 1 t = 2 t = 3 t = 4 t = 5

LC 1 1 2 3 4 5 6

LC 2 6 1 2 3 4 5

LC 3 5 6 1 2 3 4

LC 4 4 5 6 1 2 3

LC 5 3 4 5 6 1 2

LC 6 2 3 4 5 6 1

Good TDM Transmit Schedule

t = 0 t = 1 t = 2 t = 3 t = 4 t = 5

LC 1 1 2 3 4 5 6

LC 2 5 1 2 3 6 4

LC 3 6 5 4 1 2 3

LC 4 2 3 1 6 4 5

LC 5 4 6 5 2 3 1

LC 6 3 4 6 5 1 2

Configuration Algorithm

1. Assign connections between groups, so MEMS constraint is satisfied.

2. Assign group connections to specific linecards, so there is exactly one connection per linecard pair in the schedule.

Comments: Algorithm is surprisingly complex. Best running time so far: 40 seconds for 640 linecards.

Challenges

RWDM

1

G

GROut

WDM

1

G

WDMPkt

Switch

1

G

R R

G

G2

R=160Gb/s

R

4

21

WDM

1

G

GAddressLookup

11

R R

3

In

How to build a 250ms

160Gb/s buffer?

Low-cost, low-power

optoelectronic conversion?

What we are building

Buffer Manager

90nm ASIC

Buffer Manager

90nm ASIC

250ms DRAM

160Gb/s 160Gb/s

320Gb/sChip #1: 160Gb/s Packet Buffer

CMOS ASIC

16 x 10Gb/s

To Linecards To Optical Fabric

Chip #2: 16 x 55 Opto-electronic crossbar

55 x 10Gb/s55 x 10Gb/s

1500nm Optical source

Optical DetectorOptical Modulator

100Tb/s Load-Balanced Router

L = 16160Gb/s linecards

Linecard Rack G = 40


Linecard Rack 1


55 56

1 2

40 x 40MEMS

Switch Rack < 100W

Scaling Internet Routers Using Optics UW, October 16 th, 2003 Nick McKeown Joint work with research groups of: David Miller, Mark Horowitz, Olav Solgaard.

Documents

gbs slide

gbs router capacity

tbs router capacity

power density slide

tbs capacity

year slide

eduor slide

disparity slide