Hardware-based CIL-machine Nizhniy Novgorod State University, Russia Laboratory of Physical Fundamentals and Technologies of Wireless Communications reporter: Maxim Shuralev [email protected] Head of the project: Dr. Alexey Umnov [email protected]
Mar 27, 2015
Hardware-based
CIL-machineNizhniy Novgorod State University, Russia
Laboratory of Physical Fundamentals and Technologies of Wireless Communications
reporter: Maxim Shuralev
Head of the project: Dr. Alexey Umnov
Slide 2
Hardware CIL processor project teamHardware CIL processor project teamHardware:Maxim Shuralev, Maxim Sokolov, Dmitry Mordvinov(NNSU, Wireless Lab)
Software, workloads and tools: Andrey Eltsov (NNSU, Wireless Lab),Roman Mitin, Sergey Lyalin, Sergey Galkin, Ilia Golubev (NNSU, IT Lab)
Support:Dmitry Golovachev, Svetlana Surova, Elena Pankratova(NNSU, Wireles Lab)
Consultants:Aliaksei Chapyzhenka (Intel), Dmitry Ragozin (Intel),Sergey Chernyshov (Nizhniy Novgorod State Technology University)
Head of Wireless Lab: Alexey Umnov
Slide 3
AgendaIntroduction
Architecture of the CIL processor
Description of the DSP core
Description of the CIL core
Speed up features of the CIL core
a metainformation cache
a hardware stack
a hardware type control engine
Garbage collector implementation
Example of DSP workload for the processor
Development board for processor implementation
HW Implementation results
Software support & libraries
Conclusion and comparison
Slide 4
Introduction
port of the .NET engine to
energy-efficient low-power mobile platform
advantages and disadvantages of stack-based CIL engine:
• maximum execution speed of CIL instructions can not be more than one instruction per clock
• the stack engine is the most simplest way to execute some machine code, as instruction decoding and processor structure is very simple
• limited ability for parallel instruction execution
• low complexity and low power consumption
Slide 5
Introductionapplication target and target market
.NET.NET is intended for is intended for different Web-oriented services, different Web-oriented services, distributed business databases, online transactions, CRM distributed business databases, online transactions, CRM system support and etc.system support and etc.
CIL processor is not supposed to compete with desktop processors CIL processor is not supposed to compete with desktop processors and PDAs by performance – and PDAs by performance – but it is great for mobile market and but it is great for mobile market and digital home!digital home!The target is end-user specialized and oriented for:The target is end-user specialized and oriented for:
MOBILE DEVICES,MOBILE DEVICES, Web-terminals, Web-browsers, interactive TV,Web-terminals, Web-browsers, interactive TV, HOUSE CONTROL SYSTEMS HOUSE CONTROL SYSTEMS
Slide 6
Introductionrequirements for the CIL processor
•Execute the .NET (CIL) code Execute the .NET (CIL) code directlydirectly
.NET is native code .NET is native code
•Consume low power from power supplyConsume low power from power supply
Mobile low power devicesMobile low power devices
•Effectively handle DSP tasksEffectively handle DSP tasks
New generation of New generation of interactive multimedia mobile devicesinteractive multimedia mobile devices
Slide 7
ArchitectureHigh-level structure of the CIL processor implementationHigh-level structure of the CIL processor implementation
Programmers modelProgrammers model
Hardware CIL processor
DSP kernel CIL instruction decoder
CIL metainformation support
DSP library: codecs, protocols, software
defined radio, modems, multimedia processing
libraries
Standard CIL class libraries, custom CIL class libraries, other
performance libraries
CIL application
Slide 8
ArchitectureHigh-level hardware structure of the CIL processor
Hardware structure
Main data bus (X)
Main arithmetic unit
Secondary data bus (Y)
X data bus address generation unit
X-space address bus
Y-space address bus
Y data bus address generation unit
Instruction and data cache unit, system
control unit
CIL meta-information caches
CIL instruction decoder
native DSP set instruction decoder
Slide 9
Architecture
Why DSP-based ?Why DSP-based ?
Is it a waste of time during development or a necessary thing for Is it a waste of time during development or a necessary thing for digital home? digital home?
As CIL processor is an excellent solution for digital homeAs CIL processor is an excellent solution for digital home
Pro:Pro:•We have firmware layer for executing We have firmware layer for executing very complex CIL instructionsvery complex CIL instructions•increased in 5-10 times performance increased in 5-10 times performance in multimedia applicationsin multimedia applications
Contra:Contra:•increased development timeincreased development time•We need to implement only We need to implement only
““standard” CIL set, not DSPstandard” CIL set, not DSP
Slide 10
ArchitectureWhy DSP-based ?Why DSP-based ?
Hardware implementationHardware implementation
Pro:Pro:• Effective & low-power computational kernelEffective & low-power computational kernel• Good mapping “CIL instruction -> DSP instruction”Good mapping “CIL instruction -> DSP instruction”• Low power consumption in multimedia tasksLow power consumption in multimedia tasks• Similar technology to existing and efficient ARM/Java JazelleSimilar technology to existing and efficient ARM/Java Jazelle
Contra:Contra:•Only serial instruction execution (as we have CIL stack based
instruction set and do not want to use superscalar techniques)
Slide 11
Architecture
Why DSP-based ?Why DSP-based ?
• ““2-in-1”: 2 native instruction sets on-board2-in-1”: 2 native instruction sets on-board
• Complex CIL instructions (e.g. type hierarchy checks and Complex CIL instructions (e.g. type hierarchy checks and safety checks) are simply implemented in firmware as DSP safety checks) are simply implemented in firmware as DSP instructionsinstructions
• 5x-10x speed improvement for DSP workloads5x-10x speed improvement for DSP workloads
• Low overhead in terms of extra transistors on-chipLow overhead in terms of extra transistors on-chip
Slide 12
Description of DSP core units
Data memory vector input register
Program memory vector input register
Shifter
Shifter Shifter
Shifter
Cross-bar switch unit
16*16 multiplier
16*16 multiplier
Temporary product register 1
Temporary product register 2
Shifter
Shifter
ALU adder
Special functional unit
MUX Shifter
MUX MUX
Cross-bar switch unit
A0 (top of stack under CIL mode)
A1 (under-top-of-stack in CIL)
An
…
Accumulator register file Stack control
Saturation unit Saturation unit
Saturation unit
Saturation unit
Cross-bar switch unit
Data memory bus Program bus memory
Immediate value or standard
increment/decrement value
Index register file (2-4-8 registers)
MUX
+
Pointer register file
To address bus
Immediate value or standard
increment/decrement value
Index register file (2-4-8 registers)
MUX
+
MUX
MUX
DEMUX
Start of circular buffer pointer
registers (2-4 registers)
MUX
End of circular buffer pointer registers (2-4 registers)
Comparator
Pointer register file
To address bus
ALU
AGU-1
AGU-2
Slide 13
Description of CIL core
Program memory (PM)
Instruction fetch unit
DSP inst-ructtion decoder
Execution unit Register file
CIL instruction
buffer CIL decoder
CIL instructions
pipeline
Type generation and check unit
Type registers
Metainformation cache memory
Typed stack
Data memory (DM)
stage 1 stage 2 stage 3
stage 1
2 and other stages Last stage
Under the execution CIL mode, the programmer has
the exact implementation of the ECMA‑335 standard CIL engine
Slide 14
Speed up features of CIL core
Metainformation cacheMetainformation cache information
Selected block Table in main RAM
Information access
index
RAM fetch, using index LSBs as
address in RAM
Index-key
comparator
MUX Yes/No
From other cache lines
•Constant table
•String table
•Method table
•Class field table
•Type table
•Smart array table
Slide 15
Speed up features of CIL core
Hardware typed stackHardware typed stack
Stack memory Memory cell tag
Stack pointer
Register file with type tags
Metainformation cache table
instruction
Operand type checking unit
Type setting unit
exception
New type
Immediate type tag
Slide 16
Garbage collector
Automatic memory management Automatic memory management
• Division of objects into “big” and “small”
• The generational garbage collector with two generations for “small” objects
• Separate area of memory for “big” objects
generation 1 generation 0
large heap
Special coprocessor, based on reduced DSP kernel may be used for processing garbage collector tasks
Slide 17
Example of DSP workload
Our CIL processor is an excellent target for multimedia applications
Slide 18
Development boardVirtex-4 FPGA chip64 MBytes DDR SDRAM100 Mhz clock oscillatorExpansion bus up to 32 I/O linesStereo AC97 audio codecRS-232 serial portLCD display for debugging messagesVGA output (50 Mhz 24-bit video DAC)PS/2 mouse and PS/2 keyboard connectors System ACE™ configuration controller
access to external flash cards10/100/1000 Mbit Ethernet transceiver for
networkingUSB interface chipXilinx XC95144XL CPLD for FPGA configur.Xilinx XCF32P Platform Flash configurationJTAG configuration port for design loading or remote debugging from PC
495 USD only
Slide 19
Development board
Testing process for processor cores
The C++ model is a full-scale analog of the Verilog HDL model
The C++ model is considered as a reference model
Slide 20
Implementation results
The ALU consumes most of the FPGA resources
The DSP core uses only a small part of Virtex-4 LX25, and the CIL processor implementation takes only up to 5500
cells (~35 %) of our Virtex-4 FPGA (without optimizations)
Device Spartan-3 Virtex-4
Slices Slice Flip-Flops
4-input LUTs
Maximum frequency, MHz
Slices Slice Flip-Flops
4-input LUTs
Maximum frequency, MHz
AGU-1 331 220 548 N/A 300 200 560 228
AGU-2 385 320 543 N/A 300 200 560 228
ALU 4368 587 7917 N/A 4216 593 8056 55.4
Decoder 1227 60 2139 N/A 1319 40 2303 971
DSP 5365 628 9508 46.9 4981 628 9191 77.8
Slide 21
Implementation results
main ALU unit structure
Bit Manipulation Unit
(a part of the ALU unit)
whole DSP kernel
Slide 22
.NET instruction decoder
Exception decoder
DSP DAU signal mapper
Stack control decoder
DSP AGU signal mapper
Type operation decoder
Prefetch operation decoder
Type check decoder
Pipeline starter
DSP core DAU units (accumulator registers, ALU, adder unit and
BMU)
X-bus FPGA internal memory
Y-bus FPGA internal memory
External 16-bit video memory
1-stage DSP decoder X-memory prefetch unit
.NET complex instruction pipeline
Pipeline table memory ROM
Pipeline automaton
Pipeline signal
mapper
Interrupt controller
Exception mapper
Firmware: exception & interrupt handlers
Meta-information cache memory
CIL prefetch X-memory unit
Metainformation cache access controller
Y-memory .NET RAM unit
Fast internal stack memory
(X-Bus)
Fast internal stack tag memory (X-Bus)
Stack block transfer controller and address generator
DSP core X-bus addressing units (including XAU
registers)
DSP core Y-bus addressing units (including YAU
registers)
Type checking unit
Type setting unit Metainformation cache access controller
Meta-information cache blocks
Meta-information cache – meta-information memory transfer controller
Meta-information exception generator
CIL pre-decoder unit
Implementation results
Moderate detail-level structure of implemented CIL processor
Slide 23
Software support •Exception microcode – complex CIL instruction implementation in DSP
code
•Class library may ported from PC
•Supporting system libraries – I/O, memory management
•Multimedia libraries – for DSP core
•User applications
•Just in time compiler for CIL code, if necessary
•Compiler – we are using a retargeted GCC version
•Assembler / disassembler – retargetable utilities, used with compiler, they a specially tuned for CIL core
•Linker
•Hardware and software codesign suite (compiler, assembler, disassembler, Verilog instruction decoder generator
Slide 24
Conclusion & comparison Comparison with ARM-based software .NET engine for
embedded systems (www.dotnetcpu.com)
Hardware-based CIL-machine ARM-based .NET execution engine
80-100 Mhz FPGA implementation 27 Mhz
1-2 CIL operations per cycle(40-50 Millions of CIL operations per second)hardware execution for basic CIL operationshardware assisted stack implementation
450,000 CIL operation per secondinterpreted CIL operations execution
50x faster than interpreted execution 50x slower than hardware execution of basic operations
hardware type control software type control
garbage collector may be implemented as a hardware coprocessor or “intellectual” memory
software garbage collector
Meta-information cache hardware software meta-information processing
DSP core with two memory spaces ARM core
2 Multiply-Accumulate instructions and 2 ALU operations in cycle = up to 4 instruction per cycle
1 ALU operation in cycle
DSP core power consumption is 3-4x less than ARM core
ARM core power consumption in 3-4x more than DSP core
Slide 25
Conclusion & comparison
1.CIL processor is not only a software concept – it may be successfully implemented in hardware
2.Our dual architecture – the CIL processor, based on a DSP core, enables multimedia applications with low-power consumption, so the
CIL processor may be successfully used for digital home and digital entertainment
3.CIL typed engines are implemented in hardware, that greatly reduces overhead of type checking in run-time
4. Hardware CIL implementation greatly outperforms non-optimized software implementations
(by performance and power consumption)
Slide 26
Project participants
Slide 27
Express gratitude
Microsoft Corporation for grant, which allows us to joint people for different faculties of Nizhny Novgorod State
University into one team and develop our hardware solution
Laboratory of Physical Foundations and Technologies of Wireless Communications, Nizhny Novgorod State University, which is supported by Intel Corporation, for help during our
research activities
Special thanks for Aliaskey Chapyzhenka, Intel Corp. for spending his time advising us in hardware architectures
Slide 28