Embedded Processors and CPU Cores - Ryerson Universitycourses/ee8205/lectures/... · Embedded Processors and CPU Cores EE8205: ... • Central processing unit ... CPU Pipelining What

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:1

Embedded Processors and CPU Cores

EE8205: Embedded Computer Systems http://www.ee.ryerson.ca/~courses/ee8205/

Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan

Electrical and Computer Engineering Ryerson University

Overview • CPU Basics Operation • Embedded CPUs • Nios-II CPU Organization • Nios-II Cores • MPS32K CPU Core

Text by M. Wolf: part of Chapters/Sections 2.1, 2.2, 2.3 and 3.1-3.5


Processor Architecture von Neumann Architecture

• Memory holds data, instructions. • Central processing unit (CPU) fetches instructions from memory. • Separate CPU and memory distinguishes programmable computer. • CPU registers help out: program counter (PC), instruction register

(IR), general-purpose registers, etc. Memory CPU

PCaddress

data

IRADD r5, r1, r3200


Harvard Architecture

• Harvard architecture cannot use self-modifying code. • Harvard allows two simultaneous memory fetches. • Most DSPs use Harvard architecture for streaming data: greater memory bandwidth more predictable bandwidth

CPU

PC

Data memory

Program memory

address

data

address

data


Instruction Execution Process Instruction Fetch: Reads next instruction into instruction register (IR). Instruction address is in the program counter (PC). Instruction Interpretation: Decodes the op-code, gets the required operands and routes them to ALU. Sequencing

Determines the address of next instruction and loads it into the PC. Execution: Generates control signals of ALU for execution.


System Organization

Memory and I/O having Separate Bus

Memory Bus

I/O Bus


Memory Mapped Peripherals


Single Accumulator Architecture


Cache Memory

• Cache: Expensive but very fast memory directly connected to CPU interacting with slower but much larger main memory.

• Invisible to OS and user programs but interact with memory management hardware.

• Processor first checks if the addresses word is in cache. • If the word is not found in cache, a block of memory containing

the word is moved to the cache.

CPU

cachecontroller

Cache

Main Memory

data

data

address

data

address


address

datacache

00..0016

FF..FF16

copies ofdata

registers

processor

dataaddress

address

instructionsaddress

cache

copies ofinstructions

instructions

memory

instructions

data

Separate Data and Instruction Caches


A Unified Instruction and Data Cache

address

instructionscache memory

copies of

instructions

data

00..00 16

FF..FF 16

instructions

copies ofdata

registers

processor

instructionsaddress

and data

and data


CPU Pipelining

Improve performance by increasing instruction throughput. Ideal speedup is equal to number of stages in the pipeline.

2nd Instruction (Adrs: PC+1)

Fetch Decode Execute

1st Instruction (Adrs: PC)


3rd Instruction (Adrs: PC+2)


Time


CPU Pipelining What makes pipelining easy?

• When all instructions are of the same length. • Few instruction formats. • Memory operands appear only in loads and stores.

What makes pipelining hard? • Structural Hazards:

Single memory • Control Hazards:

Branch instructions • Data Hazards

An instruction depends on a previous instruction


CPU Power Consumption

• Most modern CPUs are designed with power consumption in mind to some degree.

• Power vs. energy: Heat depends on power consumption; Battery life depends on energy consumption.

Power Saving Strategies • Reduce power supply voltage. • Run at lower clock frequency. • Disable the CPU functional units with control signals

when they are not in use. • Disconnect parts from power supply when not in use.


RISC Architecture

RISC: Reduce Instruction Set Computer CISC: Complex Instruction Set Computer

RISC Features • RISC instruction set attributes includes a load/store model of execution.

(CISC has memory/register model) • A non-destructive triadic register file that provides a distinct and highly

efficient data preservation model. (CISC has destructive accumulator/register file) • Normalized fixed length instructions.

(CISC has variable length contextual-field instructions)

Load/Store Model of Execution • The only instructions that can access main memory are load and store

instructions. • All the other instructions operate on internal register files. • Increases concurrency, raising performance by de-coupling the loading

and storing operation from data processing operations.


Non-Destructive Model

Non-destructive architecture, Data Preserving Model, is fundamental to minimize the load-store traffic.

For example in an addition instruction: RISC Model: AR <= BR + CR A combined load/store and non-destructive register model provides a dramatic boost in RISC performance. • RISC ability to minimize the load/store traffic from/to

memory. • De-coupling load/store and processing operations. • Allow optimizing compilers to fill the stall slots.


Leading RISC Architectures

SPARC, MIPS, ARM, ARM Cortex, Nios-II, µBlaze SPARC: Scalable Processor Architecture

• The first and open RISC architecture • Created by Sun Microsystems in response to the declining

performance enhancement of CISC microprocessors • RISC work is based on Patterson research at UC Berkeley

MIPS32 4Kx based soft processor core. • Supported by 5 semiconductor licenses

Cypress, Texas Instruments, LSI Logic, Fujitsu, and Bit • A flexible, scalable RISC architecture.

Current implementations are in CMOS, ECL and GaAs. A complete architecture having:

• Integer unit, Floating point unit • Memory Management and Cache Controller • Multiprocessing


Processor Operation Modes

User mode A user program is running. Certain instructions are not allowed. Memory mapping (base and bound) is enabled.

Supervisor mode The operating system is running. All instructions are allowed. Memory mapping (base and bound) is disabled.

A single PSW (processor status word) bit sets the above two modes: For instance: PSW-bit =1 for Supervisor mode

PSW-bit =0 for User mode


Interrupts A computer program has only two ways to determine the conditions that exist in internal and external circuits. • One method uses software instructions that jump to subroutine on

some flag status. • The second method responds to hardware signals called interrupts

that force the program to call interrupt-handling subroutines. • Interrupts take processor time only when action is required. • Processor can respond to an external event much faster by

using interrupts.

The whole programming of microcomputers and micro-controller by using interrupts is called real-time programming. Interrupts are often the only way in which real-time programming can be done successfully.


Interrupt based I/O Computers permit I/O modules to INTERRUPT a CPU during its normal operation. • I/O module asserts an interrupt request on the control bus. • CPU transfers the control to an interrupt handler routine. • Interrupt handler is generally part of the operating system.

Interrupts Allows the processor to execute other instructions while an

I/O operation is in progress. Suspension of a processing caused by an event external to

a processor in such a way that the computation can be resumed.

Improves processing efficiency.


Instruction Cycle with Interrupts CPU check for interrupts at the end of each instruction and executes the interrupt handler if required.

Interrupt Handler program identifies the nature/source of an interrupt and performs whatever actions are needed. • It takes over the control after the interrupt. • Control is transferred back to the interrupted program that will resume

execution from the point of interruption. • Point of interruption can occur anywhere in a program. • State of the program is saved. (PC + PSW + relevant registers + …)

User program

Interrupt handler i

i+1

1


Interrupt Processing Device controller or

other system h/w

issues an interrupt

Processor finishes

execution of current

instruction

Processor signals

acknowledgment

of interrupt

Processor pushes

PSW and PC onto

control stack

Processor loads new

PC value based on

the interrupt

Save remainder of

process state

information

Process interrupt

Restore process

state information

Restore old PSW, PC, etc.

SoftwareHardware


Multiple Interrupts (Sequential Order) • Disable interrupts to complete the interrupting task at hand. • Additional interrupts remain pending until interrupts are enabled.

Then interrupts are considered in order • After completing the interrupt handler routine, the processor

checks for additional interrupts.


Multiple Interrupts (Nested)

• Higher priority interrupts cause lower-priority interrupts to wait. • A lower-priority interrupt handler is interrupted.

For example, when input arrives from a communication line, it needs to be absorbed quickly to make room for additional inputs.

Interrupt the interrupting task!


Microcontrollers and Embedded CPUs

Microcontrollers are available in 4 to 32-bit word sizes. 8-bit or 16-bit micro-controllers are widely used.

Main/Popular Embedded CPUs and Microcontrollers

Model Pins I/O

RAM ROM CounterTimer

Remarks

Intel 8051 40:32 128 4K 2 128K External, Serial port Motorola 6811 52:40 256 8K 2 A/D, Watch Dog timer, Serial Port ARM7 CPUs TMS470-ARM

Soft HW

8-64KB

64KB-1MB

IP 16/32-bit RISC DMA, Watch Dog timer, CAN, I2C

Intel, 80C196 68:40 232 8K 2 16-bit 64K External, Serial, A/D, WDIntel, 80960 (i960)

132 512 Ins Cache

-- 20MHz Clock

32-bit bus, FPU, Interrupt control No I/O on-chip

MIPS32 4K Soft core

I/D Cache

-- IPs 32-bit RISC, EJTAG and On-chip bus

MC68360 QUICC

240/241

4 or 16-bit

32-bit, WD, Fault-tolerant, 4-Ethernet, 2 serial, Integrated Com Controller

Nios II Soft core

On-chip Ram

IPs 32-bit RISC, Soft Core for Embedded CPUs on FPGA

ARM11, Cortex-MPCore

Soft/ HW

Cache IPs 64-bit Databus, Harvard and support multiple cores on a chip.


NIOS-II: A Soft-Core CPU An Embedded CPU for Altera FPGAs

• 32-bit Instruction set, data path, and address space • Harvard Architecture • 32 general-purpose registers • 32 external interrupt sources • Single-instruction 32 × 32 multiply & divide • Dedicated instructions for computing 64-bit and 128-bit

products of multiplication • Floating-point instructions for single-precision operations • Single-instruction barrel shifter • Hardware-assisted debug module enabling processor start, stop,

step and trace under integrated development environment (IDE) control.

• Configurable High performance CPU cores


Nios-II Soft Processor Core


Memory and I/O Organization

• Instruction master port: An Avalon master port that connects to instruction memory via Avalon switch fabric

• Instruction cache: Fast cache memory internal to the Nios II core

• Data master port: An Avalon master port that connects to data memory and peripherals via Avalon switch fabric

• Data cache: Fast cache memory internal to the Nios II core

• Tightly coupled instruction or data memory port-interface to fast memory outside the Nios II core


Memory and I/O Organization

Avalon Bus Master Port

Avalon Bus Slave Port S

M


Nios II Data Bus

Nios II with a Harvard architecture supports separate instruction and data buses. The data bus implemented as 32-bit Avalon master port connects to both memory and peripheral components.

Data bus performs two functions:

• Read data from memory or a peripheral when the processor executes a load instruction.

• Write data to memory or a peripheral when the processor executes a store instruction.

Memory-mapped I/O access Both data memory and peripherals are mapped into the address space of the data master port.


Nios-II Instruction Bus Instruction bus connects only to memory components.

It is implemented as a 32-bit Avalon master port. • Fetch instructions to be executed by the processor • No memory write • Always retrieves 32 bits of data

Shared Memory for Instructions and Data • Instruction and data master ports can share a single

memory that contains both instructions and data. • The overall Nios II processor system might present a

single, shared instruction and data bus to the outside world.


Nios-II Memory Support Nios II architecture supports tightly coupled memory that provides low latency access to on-chip memory.

Cache Memory

• The Nios II architecture supports cache memories on both the instruction master port (instruction cache) and the data master port (data cache).

• Cache memory resides on-chip. • Nios core might include one, both, or no cache memories. • Optimal cache configuration is application specific

For example: If a Nios II system includes only fast, on-chip memory then an instruction or data cache makes no sense.


Memory Address Map The address map for memories and peripherals in a Nios II processor system is design dependent.

Specify the address map at the system generation time.

There are three addresses that are part of the CPU Core • Reset address • Exception address • Break handler address

Reset Address ● We can select the memory module where the reset code

(boot loader) resides, and the location of reset address. ● Normally, a nonvolatile memory module is selected for the

reset code.


Memory Address Map Exception Address ● We can select a memory module where the exception

vector resides, and its location. ● In a typical system, you select a low-latency memory

module for the exception code. Break Location ● For Nios II cores containing a JTAG debug module. ● Memory Module is always the JTAG debug module. Offset

is fixed at 0x20, and Address is determined by the base address of the JTAG debug module.

● You cannot modify any of the Break Location fields.


NIOS Exception & Interrupt Control Exception Controller A simple, non-vectored exception control with single address exception handling capabilities. Integral Interrupt Controller • 32 level-sensitive interrupt request (IRQ) inputs, irq0

through irq31. • The software can enable and disable any interrupt

source individually through the ienable control register. • Software can also enable and disable interrupts globally

using the PIE bit of the status control register.


A Typical Nios II System


JTAG Debug Module

The Nios II architecture supports a JTAG debug module. Host-based tools communicate with the JTAG debug module: • Downloading programs to memory • Starting and stopping execution • Setting breakpoints and watch points • Analyzing registers and memory • Collecting real-time execution trace data • The debug module connects to the JTAG circuitry in an FPGA.

The debug module connects to signals inside the processor core (on the processor side).

It has non-maskable control over the processor under test. Its functionality can be reduced, or removed altogether.

All system resources visible to the processor in supervisor mode are available to the debug module.


JTAG Debug Module Levels

(2) Not including the dedicated JTAG pins on the Altera FPGA.


Instruction Set Format Three types of inst format: I-type, R-type, and J-type I-type instruction-word format contains an immediate value embedded within the instruction word. I-type instructions words contain: ● A 6-bit opcode field OP ● Two 5-bit register fields A and B ● A 16 bit immediate data field IMM16 In most cases, fields A and IMM16 specify the source operands, and field B specifies the destination register. IMM16 is considered signed except for logical operations and unsigned comparisons. I-type instructions include arithmetic and logical operations such as addi and andi; branch operations; load and store operations;


Nios-II Instruction Formats

R-Type instructions contain:

● A 6-bit opcode field OP ● Three 5-bit register fields A, B, and C ● An 11-bit opcode-extension field OPX

In most cases, fields A and B specify the source operands, and field C specifies the destination register. R-type instructions include arithmetic and logical operations.

J-type instructions contain: ● A 6-bit opcode field and a 26-bit immediate data field ● The only J-type instruction is call.


Nios-II Processor Cores Three Nios II cores are available:

Nios II/f “fast” core for high performance. It presents the most configuration options allowing us to fine-tune the processor performance.

Nios II/s “standard” core is designed for small size while maintaining performance.

Nios II/e “economy” core is designed to achieve the smallest possible core size. This core has a limited feature set, and many settings are not available when the Nios II/e core is selected.


Nios-II Multiply & Divide Settings Nios II cores offer different multiply or divide options.

Choose the best option to balance embedded multiplier or logic element (LE) usage, and performance.

Hardware Multiply setting provides the following options: ● Include embedded multipliers (e.g., the DSP blocks in

Stratix devices) in the arithmetic logic unit (ALU). ● Include LE-based multipliers in the ALU. ● Omit hardware and have multiply operations in software.

Hardware Divide setting includes LE-based divide hardware in the ALU that achieves much greater performance than emulated software divide operations.


Nios-II Core Performance and Features

(1) DMIPS performance for the Nios II/s and Nios II/f cores depends on the hardware multiply option. (2) Using the fastest hardware multiply option, and targeting a Stratix II FPGA in the fastest speed grade.


Nios-II Processor Core Features

(3) Multiply and shift performance depends on which hardware multiply option is used.


The Nios II/f Core ● Separate instruction and data caches. ● Can access up to 2 GBytes of external address space. ● Supports optional tightly coupled memory for

instructions and data. ● Employs a 6-stage pipeline to achieve maximum

DMIPS/MHz. ● Performs dynamic branch prediction. ● Provides hardware multiply, divide, and shift options to

improve arithmetic performance. ● Supports the addition of custom instructions. ● Supports the JTAG debug module. ● Supports optional JTAG debug module enhancements,

including hardware breakpoints and real-time trace.


Nios II/f Core

● The Nios II/f cache size for each is user-definable,

between 512 bytes and 64 Kbytes. ● It supports the bit-31 cache bypass method for accessing

I/O on the data master port. ● Addresses are 31 bits wide to accommodate the bit-31

cache bypass method. ● The Nios II/f core employs a 6-stage pipeline.


Nios II/s Core

Nios II/s “standard” core is designed for small core size. On-chip logic and memory resources are conserved at the expense of execution performance. ● Uses approximately 20% less logic than Nios II/f ● Its execution performance also drops by roughly 40%

Main design goals include ● Do not cripple performance for the sake of size. ● Remove hardware features that have the highest ratio of resource

usage to performance impact. ● Optimal core for cost-sensitive, medium-performance

applications. Main applications are with large amounts of code and/or data, such as systems running an operating system where performance is not the highest priority.


Overview: Nios II/s ● Has instruction cache (512 bytes to 64 Kbytes)

but no data cache ● Can access up to 2 GBytes of external address space ● Supports optional tightly coupled memory for instructions ● Performs static branch prediction. ● Does not support bit-31 data cache bypass ● Provides hardware multiply, divide, and shift options to

improve arithmetic performance ● Supports the addition of custom instructions ● Supports the JTAG debug module ● Supports optional JTAG debug module enhancements,

including hardware breakpoints and real-time trace ● Employs a 5-stage pipeline of

Fetch, Decode, Execute, Memory and Writeback


Nios II/e Core

Nios II/e “economy” core is designed to achieve the smallest possible core size. Singular design goal Reduce resource utilization, while maintaining compatibility with the Nios II instruction set architecture. The core is roughly half the size of the standard core, but the performance is substantially lower.


Overview: Nios II/e Core ● Executes at most one instruction per six clock cycles ● Can access up to 2 GBytes of external address space ● Supports the addition of custom instructions ● Supports the JTAG debug module ● Does not provide hardware support for potential

unimplemented instructions ● Has no instruction cache or data cache ● Does not perform branch prediction ● No hardware support for any of the potential

unimplemented instructions. ● Employs dedicated shift circuitry to perform shift and

rotate operations.


Nios- II Programming Model General-Purpose Registers ● Some registers have names recognized by the assembler.

The (r0) always returns the value 0 ● The ra register (r31) holds the return address used by

procedure calls and is implicitly accessed by call and ret instructions.

● C and C++ compilers use a common procedure-call convention, assigning specific meaning to registers r1 through r23 and r26 through r28.

Control Registers ● Six 32-bit control registers ● These registers are accessed differently than the

general-purpose registers.


Nios II Registers


Nios-II CPU Operating Modes The Nios II processor has two operating modes:

Normal Mode ● System and application code execute in normal mode. ● Registers bt (r25), ba (r30) & bstatus (ctl2) are not available.

Debug Mode ● Software debugging tools use it to implement breakpoints and

watch-points. ● System and application code never execute in debug mode.

Changing Modes ● The processor starts in normal mode after reset. ● It enters debug mode only as directed by debugging tools. ● System and application code have no control over when the

processor enters debug mode. Processor returns to its prior state on exiting from debug mode


MIPS32 4K CPU Soft Core Main blocks of the core are: Execution Unit, Multiply-Divide Unit (MDU), System Control Coprocessor (CP0), Memory Management Unit (MMU), Cache Controller, Bus I/F Unit (BIU), I-Cache, D-Cache, Enhanced JTAG Controller, Power Management


MIP32 4K Features

● 32-bit Address and Data Paths ● Programmable Cache Sizes (0 to 16 Kbytes) ● Supports for ScratchPad RAM Max 20-bit index (1M address) ● Supports Multiply-Divide Unit: 32x16 multiply per clock

32x32 multiply every other clock ● Power Control ● Supports EJTAG debug module: Test Access Port (TAP)

facilitates high-speed download of application code ● Employs a 5-stage pipeline of

Instruction, Execute, Memory, Align/Accum & Writeback ● CP0 is responsible for: Virtual-to-physical address translation, cache protocols, exception control system, processor’s diagnostics, operating mode selection (kernel vs. user mode) and enabling/disabling of interrupts.

Embedded Processors and CPU Cores - Ryerson Universitycourses/ee8205/lectures/... · Embedded Processors and CPU Cores EE8205: ... • Central processing unit ... CPU Pipelining What

Documents