Silicon Operating System for Large Scale Heterogeneous Cores and its FPGA Implementation

* *Silicon Operating System for Large Scale Heterogeneous Cores and its FPGA Implementation Huang, Xiang, Department of Electrical Engineering, National Cheng Kung UniversityTainan, Taiwan, R.O.C(06)2757575 62400 2825, Office: , 6F,95602Email: [email protected] address: http://j92a21b.ee.ncku.edu.tw/broad/index.html

NCKU SoC & ASIC Lab * Huang, Xiang,

Abstract (1/3)Grand challenge applications have a strong hunger for high performance supercomputing clusters to satisfy their requirements.

Competent node architecture in supercomputing clusters is critical to quench the requirements of the varied computationally demanding applications.This accentuates the need for heterogeneous multicore node architectures in supercomputing clusters, thus paving way for the novel concept of execution of Simultaneous Multiple Application (SMAPP) non space-time sharing.


Abstract (2/3)OS is the other side of the coin in attaining exa-flop performance in supercomputing clusters.Conventional OSs being software driven, their performance becomes a bottleneck since it involves the complexities associated with parallel mapping and scheduling of different applications across the underlying nodes.In this context it is suitable if the kernel of the OS is made completely hardware based.Further the simultaneous multiple application execution with non-space time sharing calls for a parallel and hierarchy based multi-host system.

Hence the hardware design for an OS for supercomputing clusters designed to meet these demands known as Silicon Operating System [SILICOS] was evolved at Waran Research Foundation [WARFT].


Abstract (3/3)This thesis analyses the architecture and design of SILICOS at greater depths.

The SILICOS architecture is integrated with the Warft India Many Core [WIMAC] simulator a clock driven, cycle accurate simulator.


1. Origin and History (1/10)The execution of Simultaneous Multiple Application (SMAPP) non space-time sharing will be a major step forward towards attaining exa-flop computing.Some positives of SMAPP are:Enhanced resource utilization due to a large scale increase in the execution of independent instructions as the number of applications and their problem size increases.Cost effectiveness across multiple applications being run in a single cluster.Eliminates conventional space sharing and time sharing leading to increased performance.SMAPP is supported by virtue of the Heterogeneous Multi Core, node Architectures based on CUBEMACH (CUstom Built hEterogeneous Multi-Core ArCHitectures) design paradigm[2].CUBEMACH design paradigm achieves low power yet high performance.


1. Origin and History (2/10)Figure 1: Concept of SMAPP


1. CUBEMACH (3/10)The CUBEMACH design paradigm is aimed towards creation of high performance, low power and cost effective heterogeneous multicore architectures capable of executing wide range of applications without space or time sharing [1].

The use of Hardwired Algorithm Level Functional Units (ALFU) [3] and its corresponding Backbone Instruction Set Architecture also called Algorithm Level Instruction Set Architecture (ALISA), brings about increased performance due to much reduced number of instruction generation and hence memory fetches.


1. Algorithm Level Functional Unit (4/10)Why ALFU and Why not ALU?ALFUs handle higher order computations by processing blocks of data in a single operation when compared to using a set of ALUs to execute the same computations.1 ALFU instructions=Several ALU instructionsALFU based cores are proven to offer better performance at reduced power compared to ALU based cores [3].ALISA is a superset of other instruction sets such as vector instructions, CISC and VLIW which are used in various multi-core/many core processors.A single ALISA instruction encompasses the data dependencies associated with several equivalent ALU instructions and helps in minimizing the number of cache misses.Parallel issue of ALISA instructions pose a major challenge to the compilers and cannot be handled by a purely software based compiler hence we have resolved to a hardware based compiler.


1. Customizable Compiler On Silicon (5/10)The Compiler-On-Silicon[4] is an easily customizable hardware based compiler to suit different CUBEMACH architecture for different classes of applications.

Compiler-On-Silicon is made up of a two stage hierarchy.The Primary Compiler On Silicon.The Secondary Compiler On Silicon.

The hardware based dependency analyzer in COS, is the key to increase the rate of instruction generation.


1. Customizable Compiler On Silicon (6/10)Figure 3: Hierarchical Architecture of Compiler on Silicon


1. On Core Network Architecture (7/10)CUBEMACH architecture uses a novel cost effective On Chip Network called the On Core Network (OCN).

The hierarchy of OCN is emphasized by the presence of a Sub-Local Router for a group of ALFUs (population), a Local router for across population communication.While populations of ALFUs form a core, global routers are used to establish communication across them.


1. Customizable Compiler On Silicon (8/10)Figure 4: Hierarchical OCN Architecture of single core


1. Silicon Operating System (9/10)OS of current day supercomputers is managed by a stripped kernel present in the nodes. Core OS functionalities such as process scheduling, memory management, I/O handling and exception handling are monitored by this stripped kernel.

This level of operation at the cluster does suffice for parallel execution of applications. But in case of SMAPP non space-time sharing the communication complexity involved is huge, hence needs to be monitored by efficient mapping strategies.

The hardware design of the OS for supercomputing clusters designed to meet these demands known as Silicon Operating System (SILICOS) was evolved at WARFT [1].


2. Overview of Linux Kernels (1/8)The Linux Kernel abstracts and mediates access to all hardware resources including the CPU.

One important aspect of Linux kernel is support to multitasking.Each process can act individually in the system with exclusive memory access and other hardware usage.The kernel is responsible for providing this facility by running each process concurrently, providing an equal share to hardware resources for each process and also maintaining the inter-process security.


2. Overview of Linux Kernels (2/8)The Linux kernel as defined by Iwan T.Bowman [8] is composed of five main subsystems:the Process Scheduler (SCHED)the Memory Manager (MM)the Virtual File System (VFS)the Network Interface (NET)the Inter-Process Communication (IPC) subsystem

This thesis analyses the architecture and design of SILICOS at greater depths.


2. Loop Unroller and Dependency Analyser (3/8)Dependencies across application libraries play a major role in allocation of libraries to the underlying nodes.The information on dependent libraries needs to be passed onto the process scheduler for efficient scheduling thus extracting maximum work from the underlying nodes.In addition, in case of complex applications may be loops across the dependent libraries hence the loops needs to be unrolled in order to effectively identify the execution time of each iteration and to schedule those libraries.In this regard, a dependency analyzer, to extract the dependency and execution time for each of the dependent libraries and also to perform loop unrolling is needed.


2. Loop Unroller and Dependency Analyser (4/8)The graph traversal unit is used to traverse across the dependency graph and extract the dependent libraries. This information from the unit is updated in the library detail table.The loop unroller unit forms an integral part of the dependency analyzer. It unrolls loops by replicating the libraries using the loop index value. Thus this unit greatly assists the dependency analyzer in time stamp generation.After extracting the dependencies across the libraries the information is used to generate time stamp of child libraries.


2. ISA of the Dependency Analyzer (5/8)Figure 12: Overall Architecture of Dependency Analyzer


2. Design of Hardware Based Programmable Scheduler for SMAPP (6/8)The existing scheduler is not programmable hence cannot facilitate any new scheduling heuristics to be programmed into it.

Hence the scheduler needs to be made adaptive in such a way that the user himself can choose the scheduling heuristics.

The optimization techniques which we have adopted for our scheduler are:Game Theory-Simulated Annealing based SchedulingAnt Colony Optimization based Scheduling


2. Design of Hardware Based Programmable Scheduler for SMAPP (7/8)Game Theory-Simulated Annealing based SchedulingThe communication and computation complexity of the nodes are considered as cost function in the GT-SA based approach.By varying the parameters of the cluster system, an optimal cost function of the nodes in secondary host is achieved.This scheduler unit schedules libraries to underlying nodes by maintaining the computation and communication complexity (cost functions) of the node plane in its optimized state.The GT-SA based scheduler unit compares the current state of the cost function with a next state obtained by varying the system parameters load available, queue length of buffers.The unit also accepts poor next states based on probability equation in order to not to get stuck in a local minima.


2. Design of Hardware Based Programmable Scheduler for SMAPP (8/8)Ant Colony Optimization based SchedulingThe behavior of ants for shortest path finding to the food using pheromone extraction has been adopted in this scheduling algorithm [21].Here, the application libraries to be mapped onto a distant node need to traverse through the shortest path across the nodes to reach the destination node.In order to choose the path to reach a destination node, to broadcast the host can make use of this scheduling unit. Information about distances in between nodes and shortest path are constantly updated by this unit hence can be utilized to reduce the communication complexity across the nodes.Thus based on the network topology and also the traffic in the network, optimized path to reach a particular node are identified.


3. Xilinx Virtex FPGA FamilyThe Xilinx Virtex family FPGAs are being utilized in prototyping the SILICOS architecture.

The Xilinx Virtex 7 FPGA kit being the latest in the Virtex family consists of 2,000,000 logic cells. It provides a 68Mb ram space. It consists of a 3600 DSP slices thus providing higher bandwidth and aid for programming parallel processing logics into the FPGA kit.

*********************

Silicon Operating System for Large Scale Heterogeneous Cores and its FPGA Implementation

Documents