Adaptive System Fabric

Adaptive System FabricAn adaptive fabric based datacenter architecture for

Exascale systems

G S MadhusudanPrincipal Research Scientist

Department of Computer Science and EngineeringIIT-Madras

Chennai, India

02/02/14 Copyright IIT-Madras (2013) 2

Introduction

● IIT-Madras is defining an adaptive fabric based architecture to support exascale data centers

● An hybrid memory + compute/I/O fabric is being proposed to unify memory, compute, storage and networking

● CPU memory and I/O architecture is also being redefined so that CPUs can have a single physical interface for any type of connectivity– Application specific protocols will dynamically configure an interconnect link for

specific purpose– The fabric's topology can be changed dynamically to suit the workload in questions.

● The intelligence will reside in the fabric router– The CPU itself can dynamically reconfigure ports for different usage scenarios

● A Microkernel based OS architecture is also being proposed to take advantage of the HW fabric, computation can dynamically move to where the data is located


Rationale – Convergence Trends

● Convergence of networking, storage and computing requires a new unified interconnect standard based on a easily routable network fabric with low-latency switching– Should allow unlimited scaling and multi-terabit interconnect speeds– The fabric should also allow unlimited redundant paths for 100% fabric availability

● Memory interfaces have also started following the same trend with standards like Hybrid Memory Cube encapsulating the memory interfaces inside the memory complex and exposing a high level SERDES interface

● Since all major I/O standards also use SERDES based links, the physical fabric connections can be a based on a unified serdes/serial link standard (electrical or optical)

– A combination of direct links or routed links can accommodate latency sensitive and latency tolerant portions of the compute stack

● With the slowing down of Moore's law , massive core count is the only way to increase computing power

– In multi-core architectures also, specialization is the key to increase efficiency thereby leading to heterogeneous architectures with disaggregated CPU complexes

– This mandates a distributed SW architecture which in turn necessitates adaptive fabrics


Normative Architecture

CPU Nodes

I/O Nodes

Rapid I/O Fabric

HMC

HMC

HMC Fabric


Fabric Components

● Phase 1– Memory Interconnect fabric (electrical)

– CPU + I/O Interconnect fabric (electrical/optical)

● Phase 2 (if it makes sense)– Combined fabric (most likely optical)

● Challenge in combining fabrics is the switching element


Example Configuration

SRIOSwitch

SRIOSwitch

SRIOSwitch

SRIOSwitch

HMC

HMC

HMC

HMC

CPU

CPU

CPU

StorageProc

NICBank

StorageProc


Memory Fabric


Hybrid memory cube


Networked DRAM


Extending HMC logic layer


HMC Connectivity

● HMC fabrics can be configured in any topology, a mesh topology is being explored here

● Average latency grows but is a constant regardless of the individual HMC’s depth in the mesh

● Latency sensitive scenarios like main memory can directly connect to the required HMC bank

● Shared memory between CPUs can tolerate greater latencies and can hence route their memory requests through adjacent HMC nodes– Even this worst case scenario is significantly more efficient than shipping data through

the RapidIO fabric

– Ideally when CPUs can share an HMC complex, the SRIO fabric should be used for the control plane and the HMC fabric used for the data plane

– For distant nodes, the SRIO fabric will serve as the control and data plane


Compute + I/O Fabric


Serial RapidIO 3.0

● The upcoming Serial RapidIO standard is proposed to be used as the CPU + I/O interconnect

● It is proposed to use the 10/25G SERDES version of this standard. An ASF port will consist of 4, 8 or 16 lanes of 10/25 G each transported over electrical/optical links.

● Extensions to the SRIO standard if necessary● 10 lanes of 10G or 4 lanes of 25G muxed onto the

802.3bm standard's 100G optical link● Extending max packet size of SRIO to 4k

– TBD based on performance of 256 byte packets


Fabric Topology

● No specific topology is envisaged since it is a switched fabric and SRIO and HMC allows flexible routing

● Node IDs and node count limitations will be determined by RapidIO

● Performance and redundancy will dictate the topology required for specific deployments

● Switching latency for SRIO is expected to be 100ns or lower per hop, with end to end application latency in the order of 1-2 us.


Compute Nodes


Adaptivity enabled by Memory fabric

● The memory fabric allows the following– Self-healing memory architecture with multiple

redundant memory banks, all with uniform latency

– Dynamic allocation of memory to different nodes● Distributed Shared memory with uniform latency allows

computing to be shifted to the node depending on capability or availability of resources

● Of course this typically is feasible only with CPUs on the same card or adjacent cards but it still is a lot better than DDR controllers embedded in a CPU die


Compute enabled I/O nodes

● I/O processors (NICs, storage processors) are increasingly intelligent and an adaptive architecture requires the ability to run portions of the OS on the I/O processor– TCP/IP off load

– File system buffer manager

– First level of key-value search in Hadoop

– This necessarily has to be dynamic since the I/O processor cannot be pre-programmed with all possible usage scenarios.

– Typically most off-loaded code tends to be user defined

● Example Scenario– An RDBMS may want custom atomic operations on the SSD instead of performing the operations

on the CPU

– Since the memory fabric allows memory sharing, only the compute has to shift to the I/O processor

– Even if the memory complex is not directly accessible, the DSM feature of the SRIO fabric still allows location transparency of the memory, albeit at the cost of latency


Microkernel OS

● While nothing in the architecture prevents an OS like Linux from being used, most of the advantages accrue only when a Microkernel is used

● Since a MK dis-aggregates compute into a set of collaborating processes communicating through IPC– Location transparency of functional components is assured

– This in turn allows components to migrate to the most optimal nodes● Effectively a virtual topology is layered on top of the physical topology

– A nice bonus is the redundancy of functionality since failover is automatically achieved by location transparency

– Memory fabric allows IPC to be achieved with zero-copy

● Size and complexity of the OS can be dynamically changed depending on the capability of the node– Nodes can be added or deleted at will

– Nodes can be heterogeneous

– Trusted nodes that are at the most secure level of the system TCB can be kept small (less than 15k LOC) to allow formal verification of the OS


Shakti S Class Processor

● An experimental version of IIT-M's Shakti processor family's server variant will support ASF natively

● All standard ports will be based on 25Gbps SERDES– Each port will consist of a maximum of 16 SERDES lanes

● Smaller port sizes can be used subject to the availability of controller blocks

– Each port can be configured as HMC or SRIO ports

– One port will be dedicated as HMC port and one as SRIO to ensure minimal memory and I/O connectivity

● Note absence of PCIe or Ethernet controllers on-chip– These are expected to be provided on separate Node cards


Shakti-S Architecture

Core Cluster (including L2 cache banks)(2-16 CPUs)

L3 Bank

Crossbar/NoC

SERDES

(256 lanes)

SRIO(1-4)

HMC(1-4)

SRIO(1-4)

HMC(1-4)

SERDES

(256 lanes)


SRIO based CPU-CPU interconnect

● Where a lack of memory fabric connectivity prevents nodes from sharing memory, ASF provides two fall back options– Application driven DSM provided as standard by the RapidIO fabric

– ASF extension to RapidIO's DSM mechanism to allow transparent cache coherent links to nodes

● For low node counts, a MOESIF like protocol is proposed● For higher node counts,a directory based scheme is planned

– IIT-Ms SHAKTI processors will support this in their S profile● The SRIO controller will be linked directly to the cache coherency port of the

CPU so that no SW stack is needed in the data path● An extension to the ISA is also being examined to allow message passing

without SW stack overhead


Storage Nodes(See Lightstor proposal for more details)


Lightstor Standard

● The storage nodes will use the proposed Lightstor standard, a storage fabric standard scalable to 100s of terabits/sec of aggregate bandwidth

● The proposed standard's key features are– Clear separation of logical storage layers from physical storage layers. – Virtual channels with flow control and QoS support– An extensible, declarative Storage Configuration Language to specify

fabric virtual topology, storage behaviour like failover/RAID/replication and QoS parameters to define SLAs.

– A clear separation of control and data planes at the architecture and fabric level with appropriate virtual channel support

● Standard will be based on extensions to T10 and NVMe command sets


Normative Storage Architecture

Appliance Processor (File System, Object Store, RAID, Fail-over)

Storage Processor

NAND Controller (32 channel)

NAND Banks NAND Banks

NAND Banks NAND Banks

Storage Processor (New Storage API, Security)

NAND Controller (32/64 channel) NAND Controller (32/64 channel)

SRIO Fabric

Switch (TBD)

App. ProcessorsSRIO Fabric

Control Processor (Can be dedicated

or run on host or App. Processor)


Control Processor

– The Control Plane of Lightstor is a collection of Control Processors● The control processing function can be hosted on a dedicated machine or hosted on a host or an

appliance processor. The preferred configuration is a dedicated machine● In keeping with Lightstor's philosophy of redundancy, multiple Control Processors can be

configured with replicated metadata. ● Control Commands will run on dedicated virtual channels with guaranteed QoS

– Lightstor will specify a standard set of APIs (T10 + significant extensions), vendors can extend this API

– APIs and behaviour will be specified for ● Redundancy/Fail-over – Snapshots, Replication, Copy● Storage pool/Volumes, Enclosure, Migration, Provisioning, Service class/QoS● Security – Key management, Capabilities (experimental), will leverage Storage Processor's

encryption capabilities● Control Processor app security – all SCL apps will be digitally signed and distributed to the data

plane components using a Public/Private key mechanism– The standard will specify security characteristics of the OS and the HW components (Symmetric keys – 256 bit AES,

Public keys 2048 bit RSA or Elliptic Curve )– HW support for secure boot and verification of signed applications will be required


Control Processor Applications

● The control processor applications are written in the Storage Configuration language● Applications will have two components

– a configuration component that specifies features like topology, failover scheme, RAID levels etc

– a dynamic component (basically a collection of agent code) that will be invoked by various components of the system based on event triggers

● Data plane components will have a control plane agent running in a protected VM to execute

● Control processor applications typically run in a decoupled mode – the data plane components are initialized with the policies and agent code specified in the SCL

application

– the control plane's proxy in the data plane component is responsible for enforcing these policies and exceptions

– But provision is made for raising exception to a control processor. This is not expected to be the normal behaviour


Storage Data Plane


Storage Processor – Virtualized Addressing

● The Storage Processor will provide an SSD optimized API. ● Specific optimizations for file systems, caching systems, key-value based

systems and database systems● Virtual Storage API with unified global address,

– Unified address range across multiple storage processors to be achieved by using distributed metadata. (host + SP or host only - Final arch. TBD)● Allow client applications access to a single level store and removes

the need to deal with various page mapping issues ● Not using host is less optimal than running the FTL completely on

the host but allows multiple hosts to share a global address space.– This makes shared disk architectures trivial to implement– Node based access control and distributed lock manager support is a natural extension to this


Storage API – Distributed FTL

● Distributed FTL (host + storage processor)

– TRIM like functionality will now be more optimal since decisions like garbage collection can be taken at the right layer in the storage processing stack

– Will allow mitigation of bulk erasure latencies– Application controlled wear leveling – Log structured allocation strategy– Allows coupling of caching and FTL strategy for

efficient operations


Storage API – Native data store support

– Atomic storage API – allows delegation of transactional writes to the storage layer instead of doing it sub-optimally at application layers● Atomic batching of multiple I/Os● FTL layers optimized for atomic writes

– Initially log structured FTL is envisaged but other schemes can be added

– Key-Value API – allows optimized implementation of Hadoop like systems that rely on key-value stores


Storage API - Other

● The proposed API will also support the following functionality

– User defined commands (limits and context of UDCs is a major area of concern.)

– Performance – Write Caching, NVRAM/DRAM support, QoS/priority

– Failover – RAID, separate RAID scheme for block level metadata– Management – SSD configuration, cache, environment/enclosure,

redundancy– Security – encryption (AES), Authentication, user defined encryption


Storage API – hosting OS code

● OS storage layer can be partly hosted on Storage processor– Micro kernel like distributed server strategy

● Hybrid memory cube can be used to share DRAM with host


Appliance Processor

● The Appliance processor hosts standard storage appliance functionality and exposes this through a standardized interface.– It is envisaged that storage functionality like RDBMS storage layers, File systems, backup/replication

can be delegated to the appliance processor

– Since the API is standardized, users can mix and match appliance processors from multiple vendors

● Standard T10 File (CIFS, NFS) and Object Store level commands● LightStor Extensions

– User defined commands

– Performance – Easy Clustering semantics for scale out, Mirroring

– Redundancy/Fail-over – Snapshots, Replication, Copy

– Management (client side) – Cluster/Redundancy, Storage pool/Volumes, Enclosure, Migration, Provisioning, Service class/QoS

– Security (client side) – Key management, Capabilities (experimental), will leverage Storage Processor's encryption capabilities

– Storage Optimization - Compression, De-duplication (TBD – partly to be done on Storage processor)


I/O Virtualization


Virtualization

● ASF provides standardized I/O virtualization– Virtualized fabric provided by enhancements proposed

to SRIO (changes are not anticipated to be major since message passing is easily virtualizable)

● Existing host side virtualization can continue to be used but can now leverage ASF's virtualization features to reduce CPU load– Virtualization in systems like ASF, probably belong

more to the fabric layer than the individual nodes


Storage Virtualization


Implementation


Staging

● The first versions of the ASF system will be built using commodity components to leverage off the shelf SRIO components and to complete reference implementations of the Lightstor standard

● Later variants will first use FPGA versions of the Shakti processor to test the memory fabric followed by ASIC versions of the processor


Collaboration

● We are looking for Academic and Industry partners to take the ASF effort forward

● All of IIT-M's designs - HDL code for the processors/interconnect/SSD Controllers) Schematics/gerber for reference HW, Microkernel OS and Storage systems SW - will be released into Open Source.– Please see www.bitbucket.org/riselab for further details

http://www.bitbucket.org/riselab


Architecture experiments with HMC

- For a CPU with fully virtual caches, it would make sense to shift the MMU to the HMC so that a HMC block can provide a section of the total system memory to the CPUs attached to it. This work well in single address space OSs which can have fully virtual caches.

So in a sense, the HMC boots firsts, sets up a virtual memory region and then the CPUs attach to these regions.

- once you have logic inside DRAM, you can try out all kinds of things inside memory, data prefetch, virus scanning.

Adaptive System Fabric

Technology

srio fabric

fabric topology

introduction iitmadras

computeio fabric

hw fabric

fabric router

switched fabric

memory complex