© 2009 IBM Corporation © 2013 IBM Corporation MEMS Optical Switching in the Datacenter Silicon Photonics for Next Generation Computing Systems HiPEAC Computer Systems Week October 2013 Kostas Katrinis – IBM Research, Ireland
May 06, 2015
© 2009 IBM Corporation
© 2013 IBM Corporation
MEMS Optical Switching in the Datacenter
Silicon Photonics for Next Generation Computing SystemsHiPEAC Computer Systems WeekOctober 2013
Kostas Katrinis – IBM Research, Ireland
2
Outline
● Scope & Background● Motivation & Challenges● Hybrid Network Architecture● Data Plane● Control-Plane● System Evaluation● Use Cases● Conclusion
Part I - Introduction
Part II – Arch & Tech
Part III – Evaluation
Part IV – Use Cases
3
ScopePart I - Background
● Target Markets● (Cloud) Datacenters - Θ(10K) Servers● HPC Clusters (82% in Nov'12 Top-500)
● Target Systems:● Data Network Fabric
4
DC Traffic TrendsPart I - Background
● 76% of the traffic is
intra-datacenter *● Total DC traffic CAGR
33% to 2015 *● Traffic percentage exiting the rack
is high (up to 90%) **● ...and we expect it to increase (scale-out workloads)
* Cisco Global Cloud Index: Forecast and Methodology, 2011–2016** Benson et al., “Network Traffic Characteristics of Data Centers in the Wild”, IMC'10
5
Design Trade-offs (Performance)
Performance
Co
st
Highlyoversubcr.
HighBisection
Trad
e-off
Part I - Background
● We need high-capacity between any two points in the DC
● and at various scales (incremental deployment)● ... we need $$$
6
Motivating Example
Item Item List Price (USD)** Qty Total List Price (USD)
BNT G8264 (64-port switch) 30,000 5,120 153,600,000
BNT SFP+ SR Transceiver 665 262,164 174,325,760
MM Fiber Cable 28 131,072 3,670,016
Estimated List Prices
**Source: ibm.com
Fabric Price331M USD
≈ Compute Price(@5K/server)
Part I - Motivation
● Full-bisection fat-tree @ 65k servers
● Building block: 64-port Ethernet switches (ala VL2*)
● Denser switches will not help you (e.g. 288-port Mellanox Vantage)
Greenberg et al., “VL2: A Scalable and Flexible Data Center Network”, SIGCOMM 2009
7
Motivating Example (cont.)
Total Price331m USD
AND.....
#Cables to route131,000
Can you count the birds in the nest?
Part I - Motivation
8
Paradigm Shift - Switch Light
Tiltable Mirrors implemented via MEMS (Micro-Electrical Mechanical Systems)
+ High-radix (320 ports you can buy, 1024 feasible)
+ No transceivers
+ Decreasing $/port
+ 50x less Watts/port vs. electronics
+ Can switch up to ~1Tbps
+ Protocol Agnostic
Electronic Switch (Ethernet) Optical MEMS switch
Price/Port (USD) 1100 (includes TxRx cost) 350
Bandwidth/Port 10Gbps “Rate-free”
Power/Port (W) 10 0.2
Requires TxRxs Yes No
x3
x∞
x50
Part I - Motivation
9
MEMS Switch in the DCPart I - Challenges
● Repurposing is not free:● 10-200ms switching latency vs. sub-μsec Ethernet
switch (point-to-point “circuits”)● L2 spanning-tree forwarding bad option for ROI
(applies to electronic redundant topologies too!)● Traffic Engineering (becomes dynamic Topology
Management?) is important● Collectives?
10
Related Approaches
Codename Affiliation Targets Working Prototype Comments
Helios UCSD/Google HPC/DC Yes First-principles, lacking integration, no edge routing, supporting
infrastructure (e.g. monitoring)
c-Through CMU/Rice/Intel DC No (Emulation) Reconfiguration algorithms, traffic splitting, problems not addressed
at scale
OSA (previously
Proteus)
Northwestern/UIUC/NEC DC Yes (with Wavelength-Division Multiplexing)
Mostly pursuing multiple wavelengths/fiber
Plexxi Plexxi DC Product offering Not a re-configurable architecture, low-bisection ring between racks
Part II – Architecture
11
Hybrid Fabric ArchitecturePart II – Architecture
12
High-level FunctionalityPart II – Architecture
● Bijective TE:● Mice are routed via the 1G electronic fabric● Elephants are routed via the 10G optical fabric
● Optical Fabric is reconfigurable● Centralized control optimizes topology against
traffic pattern and demand volume
13
Multi-hop & Multi-path Data PlanePart II – Data Plane
● Our simulation work showed that multi-hop reduces overhead of slow switching latency● Relaxes the impact of slowly movable p2p circuits● Larger topology space (not just bipartite graphs)
● Multi-path as throughput booster (utilization)
Multi-hop: Rack-2 reaches Rack-4 via Rack-3 TOR switch
14
VLAN-based Forwarding● Routing over 802.1p overlays
● TOR ports along a multi-hop path are assigned the same VLAN-ID
● Paths “touching” common TOR switch(es) use distinct VLAN-IDs
● Dynamic VLAN-ID assignment/revoking via central controller
Part II – Data Plane
15
Server-based Path Selection Part II – Data Plane
● OVS based
● Mice flows per default via eth0, elephant flowspec pushed by the controller to OVS
16
VLAN forwarding - A bird-eye view
● Not clean: re-purposing a feature to cancel another feature (spanning-trees)
● Not infinitely scalable (4094 IDs)● Server support is off datacenter
provider/networking vendor premise in some models (e.g. IaaS)● Tenant is the master of the server
● VLAN tagging is slow (coming up...)
Part II – Data Plane
17
VLANs vs. Openflow PerformancePart II – Data Plane
● All measurements at IBM G8264 (7.6.1 firmware)
● At 32 ports switching, OF is 2x faster
● VLAN tagging latency has a 700ms “DC” component
● OF support is work-in-progress
802.1p Openflow
18
Controller LoopPart II – Control Plane
19
Dynamic Topology Management
● Input:● Traffic Matrix (bytes)● Optical physical topology● Circuit state (used/not-used)
● Output:● Optical Topology (optical cross-connections)● Mapping of multi-hop paths to circuits
● Goal:● Maximize optical throughput (volume of TM routed
optically)
Part II – Control Plane
20
Dynamic Topology Mgmt Algorithms
● Showed that the problem is NP-complete (reduction to circular arrangement problem)
● Heuristic approaches:● High-Demand First (HDF): cluster demand based
on proximity and fit as much demand as possible to optical fabric available capacity
● Simulated-Annealing (SA): couple HDF loops with SA optimization
● ILP modelling for optimality sense at lower scale
Part II – Control Plane
21
Topology Mgmt Algos Evaluation
● Hop-bytes as throughput measure here (lower is better)
● SA-100 best in throughput vs. performance trade-off
Part II – Control Plane
22
Cost Competitiveness● Comparison vs. fat-tree at various over-subscription levels
(parameter β)
● Hybrid is 30% cheaper at full-bisection
● Competitiveness diminishes but hybrid is a winner throughout
Part III – Cost Eval.
23
Proof-of-Concept Prototype Part III – Perf. Eval.
24
Evaluation Scenarios
● 4 racks, 40 servers (10 servers/rack)● Equi-cost comparisons vs. fat-tree
● For a given hybrid network setup (parameter β), evaluate application performance against electronic fat-tree
● HPC Workload Input● NAS Parallel Benchmarks● FFTW
Part III – Perf. Eval.
25
Evaluation Results Set-1
● Comparison vs. 1:25 fat-tree● 25% improvement for most workloads● At least as good for 2 cases
Part III – Perf. Eval.
26
Evaluation Results Set-2
● Comparison vs. 1:4 fat-tree● Up to 35% improvement● At least as good for 2 cases
Part III – Perf. Eval.
27
Further “Killer” Use-casesPart IV – Use-cases
● HPC workloads are challenging (collectives, dynamic)
● We are working on integrating and evaluating:● Data-intensive (Big Data) frameworks (Hadoop)● Massive VM migration● Checkpointing
● ...on-going
28
Conclusions● Hybrid optical/electrical networks are cost-
competitive● Results show that performance is not degraded
(to say the least)● Edge engineering burden is not necessarily less
than routing/flow scheduling in electronic fat-tree● Main Challenges Ahead:
● SDN edge● Bring Traffic Engineering/Topology Management
closer to the application● Optical performance in multi-stage optical setups● More use-cases to increase confidence/persuasion
Part IV – Conclusions
29
Results Publication Diego Lugones, Konstantinos Christodoulopoulos, Kostas Katrinis, Marco Ruffni, Donal O'Mahony, and Martin Collier,"Accelerating communication-intensive parallel workloads using commodity optical switches and a software-configurable control stack”, in Proceedings of the 2013 International European Conference on Parallel and Distributed Computing (Euro-Par 2013), Aachen, Germany, August 2013 Kostas Katrinis, Guohui Wang and Laurent Schares, "SDN control for hybrid OCS/electrical datacenter networks:an enabler or just a convenience?", in Proceedings of the 2013 IEEE Summer Topicals, IEEE Photonics Society , Hawai, USA, July 2013 Konstantinos Christodoulopoulos, Kostas Katrinis, Marco Ruffini and Donal O’Mahony, "Tailoring the Network to the Problem: Topology Configuration in Hybrid EPS/OCS Interconnects", in CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Journal, Wiley Interscience, invited article (in press) Diego Lugones, Kostas Katrinis, Martin Collier and Georgios Theodoropoulos, "Parallel Simulation Models for the Evaluation of Future Large-Scale Datacenter Networks", in Proceedings of the 16th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications, Dublin, Ireland, October 2012 Konstantinos Christodoulopoulos, Marco Ruffini, Donal O’Mahony and Kostas Katrinis, "Topology Configuration in Hybrid EPS/OCS Interconnects", in Proceedings of the 2012 International European Conference on Parallel and Distributed Computing (Euro-Par 2012), Rhodes Island, Greece, August 2012 (Distinguished Paper Award) Diego Lugones, Kostas Katrinis and Martin Collier, "A Reconfigurable Optical/Electrical Interconnect Architecture for Large-scale Clusters and Datacenters", in Proceedings of the ACM International Conference on Computing Frontiers (CF '12), Cagliari, Italy, May 2012 (Best Paper Award)
30
Dr. Diego Lugones (co-worker)
Dr. Martin Collier (co-author)
Dr. K Christodoulopoulos (co-worker)
Dr. Marco Ruffini (co-author)
Prof. Dr. Donal O'Mahony (co-author)
Trinity CollegeDublin
Dublin CityUniversity
Credit
31
THANK YOU!
Q&A