The Fork-Join Router Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University [email protected] http://www.stanford.edu/~nickm
Jan 13, 2016
High PerformanceSwitching and RoutingTelecom Center Workshop: Sept 4, 1997.
The Fork-Join Router
Nick McKeownAssistant Professor of Electrical Engineering and Computer Science, Stanford University
[email protected]://www.stanford.edu/~nickm
Outline
• Quick Background on Packet Switches
• What’s the problem?“What if data rates exceed memory
bandwidth?”
• The Fork-Join Router• Parallel Packet Switches
First Generation Packet Switches
Shared Backplane
Line Interface
CPU
Memory
CPU BufferMemory
LineInterface
DMA
MAC
LineInterface
DMA
MAC
LineInterface
DMA
MAC
Fixed length “DMA” blocksor cells. Reassembled on egress
linecard
Fixed length cells or variable length packets
Second Generation Packet Switches
CPU BufferMemory
LineCard
DMA
MAC
LocalBuffer
Memory
LineCard
DMA
MAC
LocalBuffer
Memory
LineCard
DMA
MAC
LocalBuffer
Memory
Third Generation Packet Switches
LineCard
MAC
LocalBuffer
Memory
CPUCard
LineCard
MAC
LocalBuffer
Memory
Switched Backplane
Line Interface
CPUMem
ory
Fourth Generation Packet Switches
Two Basic Techniques
Input-queued Crossbar
Shared Memory
1+1 = 2 operations per cell time
N+N = 2N operations per cell time
Shared MemoryThe Ideal
A
ZZ
A
ZZZ
A
A
Z
A
ZPIKTD
AAAAAAA
FXHBAD
Numerous work has proven and made possible:– Fairness– Delay Guarantees– Delay Variation Control– Loss Guarantees– Statistical Guarantees
Precise Emulation of an Output Queued Switch
N N
Output Queued Switch
1
N
Combined Input-Output Queued Switch
= ?
Scheduler
Result
Theorem: A speedup of 2-1/N is necessary
and sufficient for a combined input- and output-queued switch to precisely emulate an output-queued switch for all traffic.
Joint work with Balaji Prabhakar at Stanford.
Outline
• Quick Background on Packet Switches
• What’s the problem?“What if data rates exceed memory
bandwidth?”
• The Fork-Join Router• Parallel Packet Switches
Buffer MemoryHow Fast Can I Make a Packet Buffer?
BufferMemory
5ns SRAM
Rough Estimate:– 5ns per memory operation.– Two memory operations per
packet.– Therefore, maximum
51.2Gb/s.
– In practice, closer to 40Gb/s.
64-byte wide bus 64-byte wide bus
Buffer MemoryIs It Going to Get Better?
time
Specmarks,Memory size,Gate density
time
MemoryBandwidth
(to core)
Optical Physical Layers……are Going to Make Things “Worse”
DWDM:– More ’s per fiber more “ports” per switch.– # ports: 16, …, 1000’s.
Data rate:– More b/s per higher capacity.– Data rates: 2.5Gb/s, 10Gb/s, 40Gb/s, 160Gb/s, …
Approach #1: Ping-pong Buffering
BufferMemory
64-byte wide bus
BufferMemory
64-byte wide bus
Approach #1: Ping-pong Buffering
BufferMemory
64-byte wide bus
BufferMemory
64-byte wide bus
Memory bandwidth doubled to ~80 Gb/s
Approach #2: Multiple Parallel Buffers
aka Banking, Interleaving
BufferMemory
BufferMemory
BufferMemory
BufferMemory
Outline
• Quick Background on Packet Switches
• What’s the problem?“What if data rates exceed memory
bandwidth?”
• The Fork-Join Router• Parallel Packet Switches
The Fork-Join Router
1
2
k
1
N
rate, R
rate, R
rate, R
rate, R
1
N
Router
Bufferless
The Fork-Join Router
• Advantages– kmemory bandwidth – klookup/classification rate – k routing/classification table size
• Problems– How to demultiplex prior to
lookup/classification?– How does the system perform/behave?– Can we predict/guarantee performance?
Outline
• Quick Background on Packet Switches
• What’s the problem?“What if data rates exceed memory
bandwidth?”
• The Fork-Join Router• Parallel Packet Switches
A Parallel Packet Switch
1
N
rate, R
rate, R
rate, R
rate, R
1
N
OutputQueuedSwitch
OutputQueuedSwitch
OutputQueuedSwitch
1
2
k
Parallel Packet SwitchQuestions
1. Can it be work-conserving?2. Can it emulate a single big output
queued switch?3. Can it support delay guarantees,
strict-priorities, WFQ, …?4. What happens with multicast?
Parallel Packet SwitchWork Conservation
rate, R1rate, R
1
2
k
1
R/k
R/k
R/k
R/k
R/k
R/k
Input LinkConstraint
Output LinkConstraint
Parallel Packet SwitchWork Conservation
rate, R1rate, R
1
2
k
1
R/k
R/k
R/k
R/k
R/k
R/k
1
2
3 Output LinkConstraint
45
1
2
3
4
1234115
Parallel Packet SwitchWork Conservation
1
N
rate, R
rate, R
rate, R
rate, R
1
N
OutputQueuedSwitch
OutputQueuedSwitch
OutputQueuedSwitch
1
2
k
S(R/k)
S(R/k)
S(R/k)
S(R/k)
S(R/k)
S(R/k)
Precise Emulation of an Output Queued Switch
N N
Output Queued Switch
1
N
Parallel Packet Switch
= ?
1
N
1
N
Parallel Packet SwitchTheorems
1. If S > 2k/(k+2) 2 then a parallel packet switch can be work-conserving for all traffic.
2. If S > 2k/(k+2) 2 then a parallel packet switch can precisely emulate a FCFS output-queued switch for all traffic.
Parallel Packet SwitchTheorems
3. If S > 3k/(k+3) 3 then a parallel packet switch can be precisely emulate a switch with WFQ, strict priorities, and other types of QoS, for all traffic.
An asideUnbuffered Clos Circuit Switch
Expansion factor required = 2-1/N
Clos Network
I1
IX
a
b
c
O1
OXm {
}m
}m
m {
O1 O2 O3 Ox
I1 I2
I3 Ix
b
<= min(R,m) entries in each row <= min(R,m) entries in each column
R middlestage switches
Clos Network
I1
IX
ab
c
O1
OXm {
}m
}m
m {
O1 O2 O3 Ox
I1 I2
I3 Ix
b
<= min(R,m) entries in each row<= min(R,m) entries in each column
R middlestage switches
Define: UIL(Ii) = used links at switch Ii to connect to middle stages. UOL(Oi) = used links at switch Oi to connect to middle stages.
If we wish to connect Ii to Oi:
When adding connection: |UIL(Ii)| <= m-1 and |UOL(Oi)| <= m-1
Worst-case: |UIL(Ii) U UOL(Oi)| = 2m -2
Therefore, if R >= 2m-2 there are always enough middle stages.
An asideUnbuffered Clos Circuit Switch
Expansion factor required = 2-1/N
Expansion 2 - 4/(k+2)
Fork-Join Router ProjectWhat’s next?
• Theory: – Extending results to distributed
algorithms.– Extending results to multicast.
• Implementation/Prototyping:– Under discussion...