DATA CENTER FABRIC COOKBOOK Do It Yourself ! DATA CENTER FABRIC COOKBOOK How to prepare something new from well known ingredients Emil Gągala
DATA CENTER FABRIC COOKBOOK
Do It Yourself !
DATA CENTER FABRIC COOKBOOKHow to prepare something new from well known ingredients
Emil Gągała
WHAT DOES AN
IDEAL FABRICLOOK LIKE?
2 Copyright © 2011 Juniper Networks, Inc. www.juniper.net
LOOK LIKE?
REQUIREMENTS - THE NETWORK FABRIC
Scalability and resilience of a networkA Network Fabric has the….
1. Any-to-any flat connectivity with fairness and full non-blocking
2. Low latency and jitter
3. No packet drops under congestion
3 Copyright © 2011 Juniper Networks, Inc. www.juniper.net
Performance and simplicity ofa single switch
4. Linear cost and power scaling with the number of interfaces
5. Support of virtual Layer 2 and Layer 3 networks and services
6. Modular distributed implementation that is highly reliable and scalable
7. Single logical device
FABRIC OF SWITCHES OR …DISTRIBUTED SWITCH FABRIC
Aggregation
Layer
Core Layer
4 Copyright © 2011 Juniper Networks, Inc. www.juniper.net
Access Layer
FABRIC OF SWITCHES OR …DISTRIBUTED SWITCH FABRIC
Switch
Fabric
One NetworkFlat, any-to-any
connectivity
5 Copyright © 2011 Juniper Networks, Inc. www.juniper.net
…BUT WHAT ABOUT SCALING?
Latency
Scale vs. Latency
Traditionaldesign
Fabric
Ethernet
Multi Tier Network
6 Copyright © 2011 Juniper Networks, Inc. www.juniper.net
Scale
Scale
Bandwidth
Scale vs. Bandwidth
Traditionaldesign
Fabric
Servers NAS
INGREDIENTS - THE NETWORK FABRIC
1. Control Plane – Routing Engine
2. Switching Plane - Fabric
3. Forwarding Plane - I/O Modules
7 Copyright © 2011 Juniper Networks, Inc. www.juniper.net
Single deviceN=1
Single switch does not scaleSingle point of failure
CONTROL PLANE
� Can we have a single Routing Engine to control all the TORswitches as line cards?
� Answer is NO:
� Key Reason: Scalability issues. Eg.
– On a typical router, RE distributes the complete forwarding state to all PFEs
- This cannot scale to Fabric levels, with a few thousand TORs
– Protocol processing cannot scale with such a large number of interfaces
- LACP:O(3000) sessions
11 Copyright © 2011 Juniper Networks, Inc. www.juniper.net
- LACP:O(3000) sessions
- ARP:O(100K – 500K endpoints)
� Solution:
� Use Multiple REs
� Distributed and virtualized
control plane
3 BASIC FUNCTIONS ANY SWITCH HAS TO SOLVE
1. System Discovery
� Who are we?
2. Fabric Discovery
� How many ways can we send data to each other?
13 Copyright © 2011 Juniper Networks, Inc. www.juniper.net
3. Control State Propagation
� How do we exchange control state between ourselves?
THE 3 INTERNAL PROTOCOLS EXAMPLES
1. System Discovery
� IS-IS Based
� Runs on Control Plane Ethernet Network
2. Fabric Topology Discovery
� IS-IS Based
3. Control State Propagation
14 Copyright © 2011 Juniper Networks, Inc. www.juniper.net
3. Control State Propagation
� Fabric Control Protocol
� BGP Based
LOGICAL SYSTEM VIEW AFTER SYSTEM DISCOVERY
One Big LAN
1. System Discovery
� IS-IS Based
� Runs on Control Plane Ethernet Network
15 Copyright © 2011 Juniper Networks, Inc. www.juniper.net
Node 1 Node 2 Node 128
Interconnect 1 Interconnect 4
Fabric Manager
Fabric Control 1
Fabric Control 2
NNG RE (Active)
NNG RE (Backup)
WHY TO CENTRALIZE FABRIC DATA PLANE FORWARDING LOGIC?
� “Centralize what you can, distribute what you must”
� Frees us from the tyranny of equal-cost shortest path
Link 1: 4 Gbps
Link 2: 2 GbpsA B
2. Fabric Topology Discovery
� IS-IS Based
16 Copyright © 2011 Juniper Networks, Inc. www.juniper.net
equal-cost shortest path routing
� Can use all possible paths between any two QFNodes within the fabric
� Traffic on each path proportional to the cost (or inverse) of each path
� Faster convergence times
� Less computation-intensive
Link 3: 4 Gbps
With Spanning Tree
� Only one link is used
� Complex configuration to make sure that the 4Gbps link is used
� Effective fabric bandwidth : 4 Gbps
With TRILL (FabricPath)
� Both 4Gbps links are used
� Distributed equal cost constraint rules out Link 2
� Effective fabric bandwidth : 8 Gbps
With Fabric
� All links are used
� Spray weights intelligently send appropriate traffic on each link
� Effective fabric bandwidth : 10 Gbps
FABRIC CONTROL PROTOCOL – WHY BGP?� Let’s consider the desirable attributes of any such protocol
� Should have in-built scaling model
� Should have multi-version support – needed for Partitions
� Should be extensible – needs to carry both Mac and IP routes
� Should have overlapping address space support
� Should be hardened – it is the heart of the system after all
� BGP fits the fill perfectly
17 Copyright © 2011 Juniper Networks, Inc. www.juniper.net
� BGP fits the fill perfectly
� Route Reflector mechanism
� Standard open protocol
� TLV mechanism
� Route Distinguisher, Route Target constructs
� Field Proven
DIFFERENT DC FABRIC ARCHITECTURES
#1 Architecture• L2 at access• L3, ACLs, buffering in core
Cost
Scale
Rich Edge
#2 Architecture• L2, FC, ACLs in access• L3, FC, ACLs, buffering in core
Cost
Scale
FeatureRich Core
FeatureRich Core
Minimal Edge
21 Copyright © 2011 Juniper Networks, Inc. www.juniper.net
Scale Scale
Rich Edge
#3 Architecture• L2, L3, FC, ACLs in access• No features in core
Cost
Scale
MinimalCore
KNOWING FABRIC – IN BRIEF
� LARGE scale distributed system which acts as a single logical L2/ L3 switch
� Physically consists of multiple chassis
� Composed of
� An intelligent edge which makes complex forwarding decisions - TOR/ LCC
� A high speed but dumb core which transfers packets across the
22 Copyright © 2011 Juniper Networks, Inc. www.juniper.net
� A high speed but dumb core which transfers packets across the
intelligent edge - Interconnect chassis
CLOS TOPOLOGY
Topology developed in 1950s for telephone switching equipment
I/p 0
I/p 3
I/p 4
O/p 0
O/p 3
O/p 4
First Stage Second Stage Third Stage
F1 F2 F3
23 Copyright © 2011 Juniper Networks, Inc. www.juniper.net
I/p 4
I/p 7
I/p 8
I/p 11
I/p 12
I/p 15
O/p 4
O/p 7
O/p 8
O/p 11
O/p 12
O/p 15
F1
F1
F1
F2 F3
F2 F3
F2 F3
CLOS SWITCH OPERATION
� First stage: spray cells evenly
� Second: Provides All-to-All Connectivity
� Third stages: Provides non-blocking property
1st StageF1
2nd StageF2
3rd StageF2
24 Copyright © 2011 Juniper Networks, Inc. www.juniper.net
CLOS PROPERTIES
Multiple redundant paths per source destination pair.
Topology is rearrangably nonblocking
� In a circuit-switched Clos network, a new connection can always be made if existing connections can be moved to different paths
� Spraying cells achieves the same effect as moving circuits
Each cell may have different transition time through fabric
28 Copyright © 2011 Juniper Networks, Inc. www.juniper.net
Each cell may have different transition time through fabric
Reorder buffer in Egress PFE to put cells back into sequence
before forwarding packet
Multiple links between each stage of the fabric.
Graceful bandwidth degradation when a component fails
FABRIC TOPOLOGY EXAMPLE
128 x TOR
DCF#0
TOR #01
128
29 Copyright © 2011 Juniper Networks, Inc. www.juniper.net
DCF#1
TOR #127
128
1
128
3 years in development
1 million man hours
34 Copyright © 2011 Juniper Networks, Inc. www.juniper.net
1 million man hours
$100s of millions invested
Over 125 patents pending