* *Silicon Operating System for Large Scale Heterogeneous Cores
and its FPGA Implementation Huang, Xiang, Department of Electrical
Engineering, National Cheng Kung UniversityTainan, Taiwan,
R.O.C(06)2757575 62400 2825, Office: , 6F,95602Email:
[email protected] address:
http://j92a21b.ee.ncku.edu.tw/broad/index.html
NCKU SoC & ASIC Lab * Huang, Xiang,
Abstract (1/3)Grand challenge applications have a strong hunger
for high performance supercomputing clusters to satisfy their
requirements.
Competent node architecture in supercomputing clusters is
critical to quench the requirements of the varied computationally
demanding applications.This accentuates the need for heterogeneous
multicore node architectures in supercomputing clusters, thus
paving way for the novel concept of execution of Simultaneous
Multiple Application (SMAPP) non space-time sharing.
NCKU SoC & ASIC Lab * Huang, Xiang,
Abstract (2/3)OS is the other side of the coin in attaining
exa-flop performance in supercomputing clusters.Conventional OSs
being software driven, their performance becomes a bottleneck since
it involves the complexities associated with parallel mapping and
scheduling of different applications across the underlying nodes.In
this context it is suitable if the kernel of the OS is made
completely hardware based.Further the simultaneous multiple
application execution with non-space time sharing calls for a
parallel and hierarchy based multi-host system.
Hence the hardware design for an OS for supercomputing clusters
designed to meet these demands known as Silicon Operating System
[SILICOS] was evolved at Waran Research Foundation [WARFT].
NCKU SoC & ASIC Lab * Huang, Xiang,
Abstract (3/3)This thesis analyses the architecture and design
of SILICOS at greater depths.
The SILICOS architecture is integrated with the Warft India Many
Core [WIMAC] simulator a clock driven, cycle accurate
simulator.
NCKU SoC & ASIC Lab * Huang, Xiang,
1. Origin and History (1/10)The execution of Simultaneous
Multiple Application (SMAPP) non space-time sharing will be a major
step forward towards attaining exa-flop computing.Some positives of
SMAPP are:Enhanced resource utilization due to a large scale
increase in the execution of independent instructions as the number
of applications and their problem size increases.Cost effectiveness
across multiple applications being run in a single
cluster.Eliminates conventional space sharing and time sharing
leading to increased performance.SMAPP is supported by virtue of
the Heterogeneous Multi Core, node Architectures based on CUBEMACH
(CUstom Built hEterogeneous Multi-Core ArCHitectures) design
paradigm[2].CUBEMACH design paradigm achieves low power yet high
performance.
NCKU SoC & ASIC Lab * Huang, Xiang,
1. Origin and History (2/10)Figure 1: Concept of SMAPP
NCKU SoC & ASIC Lab * Huang, Xiang,
1. CUBEMACH (3/10)The CUBEMACH design paradigm is aimed towards
creation of high performance, low power and cost effective
heterogeneous multicore architectures capable of executing wide
range of applications without space or time sharing [1].
The use of Hardwired Algorithm Level Functional Units (ALFU) [3]
and its corresponding Backbone Instruction Set Architecture also
called Algorithm Level Instruction Set Architecture (ALISA), brings
about increased performance due to much reduced number of
instruction generation and hence memory fetches.
NCKU SoC & ASIC Lab * Huang, Xiang,
1. Algorithm Level Functional Unit (4/10)Why ALFU and Why not
ALU?ALFUs handle higher order computations by processing blocks of
data in a single operation when compared to using a set of ALUs to
execute the same computations.1 ALFU instructions=Several ALU
instructionsALFU based cores are proven to offer better performance
at reduced power compared to ALU based cores [3].ALISA is a
superset of other instruction sets such as vector instructions,
CISC and VLIW which are used in various multi-core/many core
processors.A single ALISA instruction encompasses the data
dependencies associated with several equivalent ALU instructions
and helps in minimizing the number of cache misses.Parallel issue
of ALISA instructions pose a major challenge to the compilers and
cannot be handled by a purely software based compiler hence we have
resolved to a hardware based compiler.
NCKU SoC & ASIC Lab * Huang, Xiang,
1. Customizable Compiler On Silicon (5/10)The
Compiler-On-Silicon[4] is an easily customizable hardware based
compiler to suit different CUBEMACH architecture for different
classes of applications.
Compiler-On-Silicon is made up of a two stage hierarchy.The
Primary Compiler On Silicon.The Secondary Compiler On Silicon.
The hardware based dependency analyzer in COS, is the key to
increase the rate of instruction generation.
NCKU SoC & ASIC Lab * Huang, Xiang,
1. Customizable Compiler On Silicon (6/10)Figure 3: Hierarchical
Architecture of Compiler on Silicon
NCKU SoC & ASIC Lab * Huang, Xiang,
1. On Core Network Architecture (7/10)CUBEMACH architecture uses
a novel cost effective On Chip Network called the On Core Network
(OCN).
The hierarchy of OCN is emphasized by the presence of a
Sub-Local Router for a group of ALFUs (population), a Local router
for across population communication.While populations of ALFUs form
a core, global routers are used to establish communication across
them.
NCKU SoC & ASIC Lab * Huang, Xiang,
1. Customizable Compiler On Silicon (8/10)Figure 4: Hierarchical
OCN Architecture of single core
NCKU SoC & ASIC Lab * Huang, Xiang,
1. Silicon Operating System (9/10)OS of current day
supercomputers is managed by a stripped kernel present in the
nodes. Core OS functionalities such as process scheduling, memory
management, I/O handling and exception handling are monitored by
this stripped kernel.
This level of operation at the cluster does suffice for parallel
execution of applications. But in case of SMAPP non space-time
sharing the communication complexity involved is huge, hence needs
to be monitored by efficient mapping strategies.
The hardware design of the OS for supercomputing clusters
designed to meet these demands known as Silicon Operating System
(SILICOS) was evolved at WARFT [1].
NCKU SoC & ASIC Lab * Huang, Xiang,
2. Overview of Linux Kernels (1/8)The Linux Kernel abstracts and
mediates access to all hardware resources including the CPU.
One important aspect of Linux kernel is support to
multitasking.Each process can act individually in the system with
exclusive memory access and other hardware usage.The kernel is
responsible for providing this facility by running each process
concurrently, providing an equal share to hardware resources for
each process and also maintaining the inter-process security.
NCKU SoC & ASIC Lab * Huang, Xiang,
2. Overview of Linux Kernels (2/8)The Linux kernel as defined by
Iwan T.Bowman [8] is composed of five main subsystems:the Process
Scheduler (SCHED)the Memory Manager (MM)the Virtual File System
(VFS)the Network Interface (NET)the Inter-Process Communication
(IPC) subsystem
This thesis analyses the architecture and design of SILICOS at
greater depths.
NCKU SoC & ASIC Lab * Huang, Xiang,
2. Loop Unroller and Dependency Analyser (3/8)Dependencies
across application libraries play a major role in allocation of
libraries to the underlying nodes.The information on dependent
libraries needs to be passed onto the process scheduler for
efficient scheduling thus extracting maximum work from the
underlying nodes.In addition, in case of complex applications may
be loops across the dependent libraries hence the loops needs to be
unrolled in order to effectively identify the execution time of
each iteration and to schedule those libraries.In this regard, a
dependency analyzer, to extract the dependency and execution time
for each of the dependent libraries and also to perform loop
unrolling is needed.
NCKU SoC & ASIC Lab * Huang, Xiang,
2. Loop Unroller and Dependency Analyser (4/8)The graph
traversal unit is used to traverse across the dependency graph and
extract the dependent libraries. This information from the unit is
updated in the library detail table.The loop unroller unit forms an
integral part of the dependency analyzer. It unrolls loops by
replicating the libraries using the loop index value. Thus this
unit greatly assists the dependency analyzer in time stamp
generation.After extracting the dependencies across the libraries
the information is used to generate time stamp of child
libraries.
NCKU SoC & ASIC Lab * Huang, Xiang,
2. ISA of the Dependency Analyzer (5/8)Figure 12: Overall
Architecture of Dependency Analyzer
NCKU SoC & ASIC Lab * Huang, Xiang,
2. Design of Hardware Based Programmable Scheduler for SMAPP
(6/8)The existing scheduler is not programmable hence cannot
facilitate any new scheduling heuristics to be programmed into
it.
Hence the scheduler needs to be made adaptive in such a way that
the user himself can choose the scheduling heuristics.
The optimization techniques which we have adopted for our
scheduler are:Game Theory-Simulated Annealing based SchedulingAnt
Colony Optimization based Scheduling
NCKU SoC & ASIC Lab * Huang, Xiang,
2. Design of Hardware Based Programmable Scheduler for SMAPP
(7/8)Game Theory-Simulated Annealing based SchedulingThe
communication and computation complexity of the nodes are
considered as cost function in the GT-SA based approach.By varying
the parameters of the cluster system, an optimal cost function of
the nodes in secondary host is achieved.This scheduler unit
schedules libraries to underlying nodes by maintaining the
computation and communication complexity (cost functions) of the
node plane in its optimized state.The GT-SA based scheduler unit
compares the current state of the cost function with a next state
obtained by varying the system parameters load available, queue
length of buffers.The unit also accepts poor next states based on
probability equation in order to not to get stuck in a local
minima.
NCKU SoC & ASIC Lab * Huang, Xiang,
2. Design of Hardware Based Programmable Scheduler for SMAPP
(8/8)Ant Colony Optimization based SchedulingThe behavior of ants
for shortest path finding to the food using pheromone extraction
has been adopted in this scheduling algorithm [21].Here, the
application libraries to be mapped onto a distant node need to
traverse through the shortest path across the nodes to reach the
destination node.In order to choose the path to reach a destination
node, to broadcast the host can make use of this scheduling unit.
Information about distances in between nodes and shortest path are
constantly updated by this unit hence can be utilized to reduce the
communication complexity across the nodes.Thus based on the network
topology and also the traffic in the network, optimized path to
reach a particular node are identified.
NCKU SoC & ASIC Lab * Huang, Xiang,
3. Xilinx Virtex FPGA FamilyThe Xilinx Virtex family FPGAs are
being utilized in prototyping the SILICOS architecture.
The Xilinx Virtex 7 FPGA kit being the latest in the Virtex
family consists of 2,000,000 logic cells. It provides a 68Mb ram
space. It consists of a 3600 DSP slices thus providing higher
bandwidth and aid for programming parallel processing logics into
the FPGA kit.
*********************