Top Banner
Adaptive System Fabric An adaptive fabric based datacenter architecture for Exascale systems G S Madhusudan Principal Research Scientist Department of Computer Science and Engineering IIT-Madras Chennai, India
40
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Adaptive System Fabric

Adaptive System FabricAn adaptive fabric based datacenter architecture for

Exascale systems

G S MadhusudanPrincipal Research Scientist

Department of Computer Science and EngineeringIIT-Madras

Chennai, India

Page 2: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 2

Introduction

● IIT-Madras is defining an adaptive fabric based architecture to support exascale data centers

● An hybrid memory + compute/I/O fabric is being proposed to unify memory, compute, storage and networking

● CPU memory and I/O architecture is also being redefined so that CPUs can have a single physical interface for any type of connectivity– Application specific protocols will dynamically configure an interconnect link for

specific purpose– The fabric's topology can be changed dynamically to suit the workload in questions.

● The intelligence will reside in the fabric router– The CPU itself can dynamically reconfigure ports for different usage scenarios

● A Microkernel based OS architecture is also being proposed to take advantage of the HW fabric, computation can dynamically move to where the data is located

Page 3: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 3

Rationale – Convergence Trends

● Convergence of networking, storage and computing requires a new unified interconnect standard based on a easily routable network fabric with low-latency switching– Should allow unlimited scaling and multi-terabit interconnect speeds– The fabric should also allow unlimited redundant paths for 100% fabric availability

● Memory interfaces have also started following the same trend with standards like Hybrid Memory Cube encapsulating the memory interfaces inside the memory complex and exposing a high level SERDES interface

● Since all major I/O standards also use SERDES based links, the physical fabric connections can be a based on a unified serdes/serial link standard (electrical or optical)

– A combination of direct links or routed links can accommodate latency sensitive and latency tolerant portions of the compute stack

● With the slowing down of Moore's law , massive core count is the only way to increase computing power

– In multi-core architectures also, specialization is the key to increase efficiency thereby leading to heterogeneous architectures with disaggregated CPU complexes

– This mandates a distributed SW architecture which in turn necessitates adaptive fabrics

Page 4: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 4

Normative Architecture

CPU Nodes

I/O Nodes

Rapid I/O Fabric

HMC

HMC

HMC Fabric

Page 5: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 5

Fabric Components

● Phase 1– Memory Interconnect fabric (electrical)

– CPU + I/O Interconnect fabric (electrical/optical)

● Phase 2 (if it makes sense)– Combined fabric (most likely optical)

● Challenge in combining fabrics is the switching element

Page 6: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 6

Example Configuration

SRIOSwitch

SRIOSwitch

SRIOSwitch

SRIOSwitch

HMC

HMC

HMC

HMC

CPU

CPU

CPU

StorageProc

NICBank

StorageProc

Page 7: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 7

Memory Fabric

Page 8: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 8

Hybrid memory cube

Page 9: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 9

Networked DRAM

Page 10: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 10

Extending HMC logic layer

Page 11: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 11

HMC Connectivity

● HMC fabrics can be configured in any topology, a mesh topology is being explored here

● Average latency grows but is a constant regardless of the individual HMC’s depth in the mesh

● Latency sensitive scenarios like main memory can directly connect to the required HMC bank

● Shared memory between CPUs can tolerate greater latencies and can hence route their memory requests through adjacent HMC nodes– Even this worst case scenario is significantly more efficient than shipping data through

the RapidIO fabric

– Ideally when CPUs can share an HMC complex, the SRIO fabric should be used for the control plane and the HMC fabric used for the data plane

– For distant nodes, the SRIO fabric will serve as the control and data plane

Page 12: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 12

Compute + I/O Fabric

Page 13: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 13

Serial RapidIO 3.0

● The upcoming Serial RapidIO standard is proposed to be used as the CPU + I/O interconnect

● It is proposed to use the 10/25G SERDES version of this standard. An ASF port will consist of 4, 8 or 16 lanes of 10/25 G each transported over electrical/optical links.

● Extensions to the SRIO standard if necessary● 10 lanes of 10G or 4 lanes of 25G muxed onto the

802.3bm standard's 100G optical link● Extending max packet size of SRIO to 4k

– TBD based on performance of 256 byte packets

Page 14: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 14

Fabric Topology

● No specific topology is envisaged since it is a switched fabric and SRIO and HMC allows flexible routing

● Node IDs and node count limitations will be determined by RapidIO

● Performance and redundancy will dictate the topology required for specific deployments

● Switching latency for SRIO is expected to be 100ns or lower per hop, with end to end application latency in the order of 1-2 us.

Page 15: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 15

Compute Nodes

Page 16: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 16

Adaptivity enabled by Memory fabric

● The memory fabric allows the following– Self-healing memory architecture with multiple

redundant memory banks, all with uniform latency

– Dynamic allocation of memory to different nodes● Distributed Shared memory with uniform latency allows

computing to be shifted to the node depending on capability or availability of resources

● Of course this typically is feasible only with CPUs on the same card or adjacent cards but it still is a lot better than DDR controllers embedded in a CPU die

Page 17: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 17

Compute enabled I/O nodes

● I/O processors (NICs, storage processors) are increasingly intelligent and an adaptive architecture requires the ability to run portions of the OS on the I/O processor– TCP/IP off load

– File system buffer manager

– First level of key-value search in Hadoop

– This necessarily has to be dynamic since the I/O processor cannot be pre-programmed with all possible usage scenarios.

– Typically most off-loaded code tends to be user defined

● Example Scenario– An RDBMS may want custom atomic operations on the SSD instead of performing the operations

on the CPU

– Since the memory fabric allows memory sharing, only the compute has to shift to the I/O processor

– Even if the memory complex is not directly accessible, the DSM feature of the SRIO fabric still allows location transparency of the memory, albeit at the cost of latency

Page 18: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 18

Microkernel OS

● While nothing in the architecture prevents an OS like Linux from being used, most of the advantages accrue only when a Microkernel is used

● Since a MK dis-aggregates compute into a set of collaborating processes communicating through IPC– Location transparency of functional components is assured

– This in turn allows components to migrate to the most optimal nodes● Effectively a virtual topology is layered on top of the physical topology

– A nice bonus is the redundancy of functionality since failover is automatically achieved by location transparency

– Memory fabric allows IPC to be achieved with zero-copy

● Size and complexity of the OS can be dynamically changed depending on the capability of the node– Nodes can be added or deleted at will

– Nodes can be heterogeneous

– Trusted nodes that are at the most secure level of the system TCB can be kept small (less than 15k LOC) to allow formal verification of the OS

Page 19: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 19

Shakti S Class Processor

● An experimental version of IIT-M's Shakti processor family's server variant will support ASF natively

● All standard ports will be based on 25Gbps SERDES– Each port will consist of a maximum of 16 SERDES lanes

● Smaller port sizes can be used subject to the availability of controller blocks

– Each port can be configured as HMC or SRIO ports

– One port will be dedicated as HMC port and one as SRIO to ensure minimal memory and I/O connectivity

● Note absence of PCIe or Ethernet controllers on-chip– These are expected to be provided on separate Node cards

Page 20: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 20

Shakti-S Architecture

Core Cluster (including L2 cache banks)(2-16 CPUs)

L3 Bank

Crossbar/NoC

SERDES

(256 lanes)

SRIO(1-4)

HMC(1-4)

SRIO(1-4)

HMC(1-4)

SERDES

(256 lanes)

Page 21: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 21

SRIO based CPU-CPU interconnect

● Where a lack of memory fabric connectivity prevents nodes from sharing memory, ASF provides two fall back options– Application driven DSM provided as standard by the RapidIO fabric

– ASF extension to RapidIO's DSM mechanism to allow transparent cache coherent links to nodes

● For low node counts, a MOESIF like protocol is proposed● For higher node counts,a directory based scheme is planned

– IIT-Ms SHAKTI processors will support this in their S profile● The SRIO controller will be linked directly to the cache coherency port of the

CPU so that no SW stack is needed in the data path● An extension to the ISA is also being examined to allow message passing

without SW stack overhead

Page 22: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 22

Storage Nodes(See Lightstor proposal for more details)

Page 23: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 23

Lightstor Standard

● The storage nodes will use the proposed Lightstor standard, a storage fabric standard scalable to 100s of terabits/sec of aggregate bandwidth

● The proposed standard's key features are– Clear separation of logical storage layers from physical storage layers. – Virtual channels with flow control and QoS support– An extensible, declarative Storage Configuration Language to specify

fabric virtual topology, storage behaviour like failover/RAID/replication and QoS parameters to define SLAs.

– A clear separation of control and data planes at the architecture and fabric level with appropriate virtual channel support

● Standard will be based on extensions to T10 and NVMe command sets

Page 24: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 24

Normative Storage Architecture

Appliance Processor (File System, Object Store, RAID, Fail-over)

Storage Processor

NAND Controller (32 channel)

NAND Banks NAND Banks

NAND Banks NAND Banks

Storage Processor (New Storage API, Security)

NAND Controller (32/64 channel) NAND Controller (32/64 channel)

SRIO Fabric

Switch (TBD)

App. ProcessorsSRIO Fabric

Control Processor (Can be dedicated

or run on host or App. Processor)

Page 25: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 25

Control Processor

– The Control Plane of Lightstor is a collection of Control Processors● The control processing function can be hosted on a dedicated machine or hosted on a host or an

appliance processor. The preferred configuration is a dedicated machine● In keeping with Lightstor's philosophy of redundancy, multiple Control Processors can be

configured with replicated metadata. ● Control Commands will run on dedicated virtual channels with guaranteed QoS

– Lightstor will specify a standard set of APIs (T10 + significant extensions), vendors can extend this API

– APIs and behaviour will be specified for ● Redundancy/Fail-over – Snapshots, Replication, Copy● Storage pool/Volumes, Enclosure, Migration, Provisioning, Service class/QoS● Security – Key management, Capabilities (experimental), will leverage Storage Processor's

encryption capabilities● Control Processor app security – all SCL apps will be digitally signed and distributed to the data

plane components using a Public/Private key mechanism– The standard will specify security characteristics of the OS and the HW components (Symmetric keys – 256 bit AES,

Public keys 2048 bit RSA or Elliptic Curve )– HW support for secure boot and verification of signed applications will be required

Page 26: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 26

Control Processor Applications

● The control processor applications are written in the Storage Configuration language● Applications will have two components

– a configuration component that specifies features like topology, failover scheme, RAID levels etc

– a dynamic component (basically a collection of agent code) that will be invoked by various components of the system based on event triggers

● Data plane components will have a control plane agent running in a protected VM to execute

● Control processor applications typically run in a decoupled mode – the data plane components are initialized with the policies and agent code specified in the SCL

application

– the control plane's proxy in the data plane component is responsible for enforcing these policies and exceptions

– But provision is made for raising exception to a control processor. This is not expected to be the normal behaviour

Page 27: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 27

Storage Data Plane

Page 28: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 28

Storage Processor – Virtualized Addressing

● The Storage Processor will provide an SSD optimized API. ● Specific optimizations for file systems, caching systems, key-value based

systems and database systems● Virtual Storage API with unified global address,

– Unified address range across multiple storage processors to be achieved by using distributed metadata. (host + SP or host only - Final arch. TBD)● Allow client applications access to a single level store and removes

the need to deal with various page mapping issues ● Not using host is less optimal than running the FTL completely on

the host but allows multiple hosts to share a global address space.– This makes shared disk architectures trivial to implement– Node based access control and distributed lock manager support is a natural extension to this

Page 29: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 29

Storage API – Distributed FTL

● Distributed FTL (host + storage processor)

– TRIM like functionality will now be more optimal since decisions like garbage collection can be taken at the right layer in the storage processing stack

– Will allow mitigation of bulk erasure latencies– Application controlled wear leveling – Log structured allocation strategy– Allows coupling of caching and FTL strategy for

efficient operations

Page 30: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 30

Storage API – Native data store support

– Atomic storage API – allows delegation of transactional writes to the storage layer instead of doing it sub-optimally at application layers● Atomic batching of multiple I/Os● FTL layers optimized for atomic writes

– Initially log structured FTL is envisaged but other schemes can be added

– Key-Value API – allows optimized implementation of Hadoop like systems that rely on key-value stores

Page 31: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 31

Storage API - Other

● The proposed API will also support the following functionality

– User defined commands (limits and context of UDCs is a major area of concern.)

– Performance – Write Caching, NVRAM/DRAM support, QoS/priority

– Failover – RAID, separate RAID scheme for block level metadata– Management – SSD configuration, cache, environment/enclosure,

redundancy– Security – encryption (AES), Authentication, user defined encryption

Page 32: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 32

Storage API – hosting OS code

● OS storage layer can be partly hosted on Storage processor– Micro kernel like distributed server strategy

● Hybrid memory cube can be used to share DRAM with host

Page 33: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 33

Appliance Processor

● The Appliance processor hosts standard storage appliance functionality and exposes this through a standardized interface.– It is envisaged that storage functionality like RDBMS storage layers, File systems, backup/replication

can be delegated to the appliance processor

– Since the API is standardized, users can mix and match appliance processors from multiple vendors

● Standard T10 File (CIFS, NFS) and Object Store level commands● LightStor Extensions

– User defined commands

– Performance – Easy Clustering semantics for scale out, Mirroring

– Redundancy/Fail-over – Snapshots, Replication, Copy

– Management (client side) – Cluster/Redundancy, Storage pool/Volumes, Enclosure, Migration, Provisioning, Service class/QoS

– Security (client side) – Key management, Capabilities (experimental), will leverage Storage Processor's encryption capabilities

– Storage Optimization - Compression, De-duplication (TBD – partly to be done on Storage processor)

Page 34: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 34

I/O Virtualization

Page 35: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 35

Virtualization

● ASF provides standardized I/O virtualization– Virtualized fabric provided by enhancements proposed

to SRIO (changes are not anticipated to be major since message passing is easily virtualizable)

● Existing host side virtualization can continue to be used but can now leverage ASF's virtualization features to reduce CPU load– Virtualization in systems like ASF, probably belong

more to the fabric layer than the individual nodes

Page 36: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 36

Storage Virtualization

Page 37: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 37

Implementation

Page 38: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 38

Staging

● The first versions of the ASF system will be built using commodity components to leverage off the shelf SRIO components and to complete reference implementations of the Lightstor standard

● Later variants will first use FPGA versions of the Shakti processor to test the memory fabric followed by ASIC versions of the processor

Page 39: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 39

Collaboration

● We are looking for Academic and Industry partners to take the ASF effort forward

● All of IIT-M's designs - HDL code for the processors/interconnect/SSD Controllers) Schematics/gerber for reference HW, Microkernel OS and Storage systems SW - will be released into Open Source.– Please see www.bitbucket.org/riselab for further details

Page 40: Adaptive System Fabric

02/02/14 Copyright IIT-Madras (2013) 40

Architecture experiments with HMC

- For a CPU with fully virtual caches, it would make sense to shift the MMU to the HMC so that a HMC block can provide a section of the total system memory to the CPUs attached to it. This work well in single address space OSs which can have fully virtual caches.

So in a sense, the HMC boots firsts, sets up a  virtual memory region and then the CPUs attach to these regions.

- once you have logic inside DRAM, you can try out all kinds of things inside memory, data prefetch, virus scanning.