Hardware-based CIL-machine Nizhniy Novgorod State University, Russia Laboratory of Physical Fundamentals and Technologies of Wireless Communications reporter:

Hardware-based

CIL-machineNizhniy Novgorod State University, Russia

Laboratory of Physical Fundamentals and Technologies of Wireless Communications

reporter: Maxim Shuralev

[email protected]

Head of the project: Dr. Alexey Umnov

[email protected]

mailto:[email protected]

mailto:[email protected]

Slide 2

Hardware CIL processor project teamHardware CIL processor project teamHardware:Maxim Shuralev, Maxim Sokolov, Dmitry Mordvinov(NNSU, Wireless Lab)

Software, workloads and tools: Andrey Eltsov (NNSU, Wireless Lab),Roman Mitin, Sergey Lyalin, Sergey Galkin, Ilia Golubev (NNSU, IT Lab)

Support:Dmitry Golovachev, Svetlana Surova, Elena Pankratova(NNSU, Wireles Lab)

Consultants:Aliaksei Chapyzhenka (Intel), Dmitry Ragozin (Intel),Sergey Chernyshov (Nizhniy Novgorod State Technology University)

Head of Wireless Lab: Alexey Umnov

Slide 3

AgendaIntroduction

Architecture of the CIL processor

Description of the DSP core

Description of the CIL core

Speed up features of the CIL core

a metainformation cache

a hardware stack

a hardware type control engine

Garbage collector implementation

Example of DSP workload for the processor

Development board for processor implementation

HW Implementation results

Software support & libraries

Conclusion and comparison

Slide 4

Introduction

port of the .NET engine to

energy-efficient low-power mobile platform

advantages and disadvantages of stack-based CIL engine:

• maximum execution speed of CIL instructions can not be more than one instruction per clock

• the stack engine is the most simplest way to execute some machine code, as instruction decoding and processor structure is very simple

• limited ability for parallel instruction execution

• low complexity and low power consumption

Slide 5

Introductionapplication target and target market

.NET.NET is intended for is intended for different Web-oriented services, different Web-oriented services, distributed business databases, online transactions, CRM distributed business databases, online transactions, CRM system support and etc.system support and etc.

CIL processor is not supposed to compete with desktop processors CIL processor is not supposed to compete with desktop processors and PDAs by performance – and PDAs by performance – but it is great for mobile market and but it is great for mobile market and digital home!digital home!The target is end-user specialized and oriented for:The target is end-user specialized and oriented for:

MOBILE DEVICES,MOBILE DEVICES, Web-terminals, Web-browsers, interactive TV,Web-terminals, Web-browsers, interactive TV, HOUSE CONTROL SYSTEMS HOUSE CONTROL SYSTEMS

Slide 6

Introductionrequirements for the CIL processor

•Execute the .NET (CIL) code Execute the .NET (CIL) code directlydirectly

.NET is native code .NET is native code

•Consume low power from power supplyConsume low power from power supply

Mobile low power devicesMobile low power devices

•Effectively handle DSP tasksEffectively handle DSP tasks

New generation of New generation of interactive multimedia mobile devicesinteractive multimedia mobile devices

Slide 7

ArchitectureHigh-level structure of the CIL processor implementationHigh-level structure of the CIL processor implementation

Programmers modelProgrammers model

Hardware CIL processor

DSP kernel CIL instruction decoder

CIL metainformation support

DSP library: codecs, protocols, software

defined radio, modems, multimedia processing

libraries

Standard CIL class libraries, custom CIL class libraries, other

performance libraries

CIL application

Slide 8

ArchitectureHigh-level hardware structure of the CIL processor

Hardware structure

Main data bus (X)

Main arithmetic unit

Secondary data bus (Y)

X data bus address generation unit

X-space address bus

Y-space address bus

Y data bus address generation unit

Instruction and data cache unit, system

control unit

CIL meta-information caches

CIL instruction decoder

native DSP set instruction decoder

Slide 9

Architecture

Why DSP-based ?Why DSP-based ?

Is it a waste of time during development or a necessary thing for Is it a waste of time during development or a necessary thing for digital home? digital home?

As CIL processor is an excellent solution for digital homeAs CIL processor is an excellent solution for digital home

Pro:Pro:•We have firmware layer for executing We have firmware layer for executing very complex CIL instructionsvery complex CIL instructions•increased in 5-10 times performance increased in 5-10 times performance in multimedia applicationsin multimedia applications

Contra:Contra:•increased development timeincreased development time•We need to implement only We need to implement only

““standard” CIL set, not DSPstandard” CIL set, not DSP

Slide 10

ArchitectureWhy DSP-based ?Why DSP-based ?

Hardware implementationHardware implementation

Pro:Pro:• Effective & low-power computational kernelEffective & low-power computational kernel• Good mapping “CIL instruction -> DSP instruction”Good mapping “CIL instruction -> DSP instruction”• Low power consumption in multimedia tasksLow power consumption in multimedia tasks• Similar technology to existing and efficient ARM/Java JazelleSimilar technology to existing and efficient ARM/Java Jazelle

Contra:Contra:•Only serial instruction execution (as we have CIL stack based

instruction set and do not want to use superscalar techniques)

Slide 11

Architecture

Why DSP-based ?Why DSP-based ?

• ““2-in-1”: 2 native instruction sets on-board2-in-1”: 2 native instruction sets on-board

• Complex CIL instructions (e.g. type hierarchy checks and Complex CIL instructions (e.g. type hierarchy checks and safety checks) are simply implemented in firmware as DSP safety checks) are simply implemented in firmware as DSP instructionsinstructions

• 5x-10x speed improvement for DSP workloads5x-10x speed improvement for DSP workloads

• Low overhead in terms of extra transistors on-chipLow overhead in terms of extra transistors on-chip

Slide 12

Description of DSP core units

Data memory vector input register

Program memory vector input register

Shifter

Shifter Shifter

Shifter

Cross-bar switch unit

16*16 multiplier

16*16 multiplier

Temporary product register 1

Temporary product register 2

Shifter

Shifter

ALU adder

Special functional unit

MUX Shifter

MUX MUX


A0 (top of stack under CIL mode)

A1 (under-top-of-stack in CIL)

An

…

Accumulator register file Stack control

Saturation unit Saturation unit

Saturation unit

Saturation unit


Data memory bus Program bus memory

Immediate value or standard

increment/decrement value

Index register file (2-4-8 registers)

MUX

+

Pointer register file

To address bus

Immediate value or standard

increment/decrement value

Index register file (2-4-8 registers)

MUX

+

MUX

MUX

DEMUX

Start of circular buffer pointer

registers (2-4 registers)

MUX

End of circular buffer pointer registers (2-4 registers)

Comparator

Pointer register file

To address bus

ALU

AGU-1

AGU-2

Slide 13

Description of CIL core

Program memory (PM)

Instruction fetch unit

DSP inst-ructtion decoder

Execution unit Register file

CIL instruction

buffer CIL decoder

CIL instructions

pipeline

Type generation and check unit

Type registers

Metainformation cache memory

Typed stack

Data memory (DM)

stage 1 stage 2 stage 3

stage 1

2 and other stages Last stage

Under the execution CIL mode, the programmer has

the exact implementation of the ECMA‑335 standard CIL engine

Slide 14

Speed up features of CIL core

Metainformation cacheMetainformation cache information

Selected block Table in main RAM

Information access

index

RAM fetch, using index LSBs as

address in RAM

Index-key

comparator

MUX Yes/No

From other cache lines

•Constant table

•String table

•Method table

•Class field table

•Type table

•Smart array table

Slide 15

Speed up features of CIL core

Hardware typed stackHardware typed stack

Stack memory Memory cell tag

Stack pointer

Register file with type tags

Metainformation cache table

instruction

Operand type checking unit

Type setting unit

exception

New type

Immediate type tag

Slide 16

Garbage collector

Automatic memory management Automatic memory management

• Division of objects into “big” and “small”

• The generational garbage collector with two generations for “small” objects

• Separate area of memory for “big” objects

generation 1 generation 0

large heap

Special coprocessor, based on reduced DSP kernel may be used for processing garbage collector tasks

Slide 17

Example of DSP workload

Our CIL processor is an excellent target for multimedia applications

Slide 18

Development boardVirtex-4 FPGA chip64 MBytes DDR SDRAM100 Mhz clock oscillatorExpansion bus up to 32 I/O linesStereo AC97 audio codecRS-232 serial portLCD display for debugging messagesVGA output (50 Mhz 24-bit video DAC)PS/2 mouse and PS/2 keyboard connectors System ACE™ configuration controller

access to external flash cards10/100/1000 Mbit Ethernet transceiver for

networkingUSB interface chipXilinx XC95144XL CPLD for FPGA configur.Xilinx XCF32P Platform Flash configurationJTAG configuration port for design loading or remote debugging from PC

495 USD only

Slide 19

Development board

Testing process for processor cores

The C++ model is a full-scale analog of the Verilog HDL model

The C++ model is considered as a reference model

Slide 20

Implementation results

The ALU consumes most of the FPGA resources

The DSP core uses only a small part of Virtex-4 LX25, and the CIL processor implementation takes only up to 5500

cells (~35 %) of our Virtex-4 FPGA (without optimizations)

Device Spartan-3 Virtex-4

Slices Slice Flip-Flops

4-input LUTs

Maximum frequency, MHz

Slices Slice Flip-Flops

4-input LUTs

Maximum frequency, MHz

AGU-1 331 220 548 N/A 300 200 560 228

AGU-2 385 320 543 N/A 300 200 560 228

ALU 4368 587 7917 N/A 4216 593 8056 55.4

Decoder 1227 60 2139 N/A 1319 40 2303 971

DSP 5365 628 9508 46.9 4981 628 9191 77.8

Slide 21


main ALU unit structure

Bit Manipulation Unit

(a part of the ALU unit)

whole DSP kernel

Slide 22

.NET instruction decoder

Exception decoder

DSP DAU signal mapper

Stack control decoder

DSP AGU signal mapper

Type operation decoder

Prefetch operation decoder

Type check decoder

Pipeline starter

DSP core DAU units (accumulator registers, ALU, adder unit and

BMU)

X-bus FPGA internal memory

Y-bus FPGA internal memory

External 16-bit video memory

1-stage DSP decoder X-memory prefetch unit

.NET complex instruction pipeline

Pipeline table memory ROM

Pipeline automaton

Pipeline signal

mapper

Interrupt controller

Exception mapper

Firmware: exception & interrupt handlers

Meta-information cache memory

CIL prefetch X-memory unit

Metainformation cache access controller

Y-memory .NET RAM unit

Fast internal stack memory

(X-Bus)

Fast internal stack tag memory (X-Bus)

Stack block transfer controller and address generator

DSP core X-bus addressing units (including XAU

registers)

DSP core Y-bus addressing units (including YAU

registers)

Type checking unit

Type setting unit Metainformation cache access controller

Meta-information cache blocks

Meta-information cache – meta-information memory transfer controller

Meta-information exception generator

CIL pre-decoder unit


Moderate detail-level structure of implemented CIL processor

Slide 23

Software support •Exception microcode – complex CIL instruction implementation in DSP

code

•Class library may ported from PC

•Supporting system libraries – I/O, memory management

•Multimedia libraries – for DSP core

•User applications

•Just in time compiler for CIL code, if necessary

•Compiler – we are using a retargeted GCC version

•Assembler / disassembler – retargetable utilities, used with compiler, they a specially tuned for CIL core

•Linker

•Hardware and software codesign suite (compiler, assembler, disassembler, Verilog instruction decoder generator

Slide 24

Conclusion & comparison Comparison with ARM-based software .NET engine for

embedded systems (www.dotnetcpu.com)

Hardware-based CIL-machine ARM-based .NET execution engine

80-100 Mhz FPGA implementation 27 Mhz

1-2 CIL operations per cycle(40-50 Millions of CIL operations per second)hardware execution for basic CIL operationshardware assisted stack implementation

450,000 CIL operation per secondinterpreted CIL operations execution

50x faster than interpreted execution 50x slower than hardware execution of basic operations

hardware type control software type control

garbage collector may be implemented as a hardware coprocessor or “intellectual” memory

software garbage collector

Meta-information cache hardware software meta-information processing

DSP core with two memory spaces ARM core

2 Multiply-Accumulate instructions and 2 ALU operations in cycle = up to 4 instruction per cycle

1 ALU operation in cycle

DSP core power consumption is 3-4x less than ARM core

ARM core power consumption in 3-4x more than DSP core

Slide 25

Conclusion & comparison

1.CIL processor is not only a software concept – it may be successfully implemented in hardware

2.Our dual architecture – the CIL processor, based on a DSP core, enables multimedia applications with low-power consumption, so the

CIL processor may be successfully used for digital home and digital entertainment

3.CIL typed engines are implemented in hardware, that greatly reduces overhead of type checking in run-time

4. Hardware CIL implementation greatly outperforms non-optimized software implementations

(by performance and power consumption)

Slide 26

Project participants

Slide 27

Express gratitude

Microsoft Corporation for grant, which allows us to joint people for different faculties of Nizhny Novgorod State

University into one team and develop our hardware solution

Laboratory of Physical Foundations and Technologies of Wireless Communications, Nizhny Novgorod State University, which is supported by Intel Corporation, for help during our

research activities

Special thanks for Aliaskey Chapyzhenka, Intel Corp. for spending his time advising us in hardware architectures

Slide 28

Hardware-based CIL-machine Nizhniy Novgorod State University, Russia Laboratory of Physical Fundamentals and Technologies of Wireless Communications reporter:

Documents

dsp slide

stack slide

comparison slide

chip slide

standard cil engine

cil stack

cil processor description

low power consumption