Arm

Processor Architecture and Advanced RISC Machine

Prof. Anish Goel

Processor Architecure and ARM

von Neumann/Princeton Architecture Memory holds both data and instructions.

Central processing unit (CPU) fetches instructions from memory. Separate CPU and memory distinguishes stored-program computer.

CPU registers: program counter (PC), instruction register (IR), general-purpose registers, etc.

von Neumann machines also known as stored-program computers

Named after John von Neumann who wrote First Draft of a report on the EDVAC (Electronic Discrete Variable Automatic Computer), 1952

2


CPU + Memory

memoryCPU

PC

address

data

IRADD r5,r1,r3200

200

ADD r5,r1,r3

3


Harvard Architecture

CPU

PCdata memory

program memory

address

data

address

data

4


Princeton Arch. vs. Harvard Arch. From programmer’s perspective, general purpose computers appear to be

Princeton machines

However, modern high-performance CPUs are, at their heart, frequently designed in Harvard architecture, with added hardware outside the CPU to create the appearance of a Princeton design.

Harvard can’t use self-modifying code.

Harvard allows two simultaneous memory fetches.

Most DSPs use Harvard architecture for streaming data: greater memory bandwidth; more predictable bandwidth.

5


Instruction Set Architecture (ISA) Instruction set architecture (ISA) interface between hardware and software Express operations visible to the programmer or compiler

writer Portion of the computer visible to software

ISA is supported by: Organization of programmable storage Data type &data structures: encodings & representations Addressing modes for data and instructions Instruction formats Instruction/opcode set Exceptional conditions

6


Application Considerations in ISA Design Desktops

Emphasis on performance with integer and floating-point (FP) data types. Little regard for program size or power consumption

Servers Primarily used for databases, file servers, web applications & multi-user time-

sharing. Performance on integers & character strings is important. However, FP instructions are virtually in every server processor

Embedded systems Emphasis on low cost and low power small code size. Some instruction types,

eg., FP, may be optional to reduce chip costs.

7


Processor Internal Storage vs. ISA Internal storage type serves for the most basic differentiation

Classes of ISAs based on operand storage type Stack architecture: operands implicitly on top of stack Accumulator architecture: 1 operand implicitly in accumulator General-purpose register (GPR) architecture: only explicit operands Extended accumulator or special-purpose register architecture:

restrictions on using special registers

8


General-Purpose Register Architecture Three possible choices for GPR arch. Register-memory architecture: memory access can be part of

any instruction (memory operands) Register-register (load-store) architecture: only load & store

instructions can access memory Almost all designs after 1980

Memory-memory architecture: all operands in memory Not normally available nowadays

9


ISA Classification Based on internal storage in processor

Mem

ALU

Processor

TOS

ALU ALU ALU

(a) Stack (b) Accumulator (c) Register-memory (d) Register-register

10


Comparison of ISAs Code sequences for “C=A+B”

Implicit operands in Stack/Accumulator arch. Less flexibility of execution order in Stack arch.

Stack Accumulator Register-memory Register-register

Push A Load A Load R1, A Load R1, A

Push B Add B Add R3, R1, B Load R2, B

Add Store C Store R3, C Add R3, R1, R2

Pop C Store R3, C

11


Instruction Set Complexity Depends on: Number of instructions Instruction formats Data formats Addressing modes General-purpose registers (number and size) Flow-control mechanisms (conditionals, exceptions)

Instruction set characteristics Fixed vs. variable length. Addressing modes. Number of operands. Types of operands.

12


CISC vs. RISC Complex instruction set computer (CISC): many addressing modes;

can directly operate on operands in memory many operations. variable instruction length Examples: Intel x86 microprocessors and compatibles

Reduced instruction set computer (RISC): load/store;

operands in memory must be first loaded into register before any operation

fixed instruction length (in general) pipelinable instructions. examples: ARM, MIPS, Sun Sparc, PowerPC, …

13


Exploit ILP: Superscalar vs. VLIW RISC pipeline executes one instruction per clock cycle (usually).

Based on complex hardware design: superscalar machines issue/execute multiple instructions per clock cycle. Faster execution. More variability in execution times. More expensive CPU.

VLIW machines rely on sophisticated compiler to identify ILP and statically schedule parallel instructions

14


Finding Parallelism Independent operations can be performed in parallel:ADD r0, r0, r1

ADD r2, r2, r3

ADD r6, r4, r0

Register renaming:ADD r10, r0, r1

ADD r11, r2, r3

ADD r12, r4, r10

+ +

+

r0 r1 r2 r3

r0 r4

r6

r2

15


Order of Execution In-order: Instructions are issued/executed in the program order Machine stops issuing instructions when the next instruction

can’t be dispatched.

Out-of-order: Instructions are eligible for issue/execution once source

operands become available Machine will change order of instructions to keep dispatching. Substantially faster but also more complex.

16


What is VLIW? VLIW: very long instruction word A VLIW instruction consists of several operations to be

executed in parallel Parallel function units with shared register file:

register file

functionunit

functionunit

functionunit

functionunit

...

instruction decode and memory

17


VLIW Cluster Organized into clusters to accommodate available

register bandwidth:

cluster cluster cluster...

18


VLIW and Compilers VLIW requires considerably more sophisticated compiler

technology than traditional architectures---must be able to extract parallelism to keep the instructions full.

Many VLIWs have good compiler support.

Contemporary VLIW processors TriMedia media processors by NXP (formerly Philips

Semiconductors), SHARC DSP by Analog Devices, C6000 DSP family by Texas Instruments, and STMicroelectronics ST200 family based on the Lx architecture.

19


Static Scheduling

a b

c

d

e f

g

a b e

f c

d g

nop

nop

expressions instructions20


Limits in VLIW VLIW (at least the original forms) has several short-

comings that precluded it from becoming mainstream: VLIW instruction sets are not backward compatible between

implementations. As wider implementations (more execution units) are built, the instruction set for the wider machines is not backward compatible with older, narrower implementations.

Load responses from a memory hierarchy which includes CPU caches and DRAM do not give a deterministic delay of when the load response returns to the processor. This makes static scheduling of load instructions by the compiler very difficult.

21


EPIC EPIC = Explicitly parallel instruction computing.

Used in Intel/HP Merced (IA-64) machine.

Incorporates several features to allow machine to find, exploit increased parallelism. Each group of multiple software instructions is called a bundle.

Each of the bundles has information indicating if this set of operations is depended upon by the subsequent bundle.

A speculative load instruction is used as a type of data prefetch. A check load instruction also aids speculative loads by checking

that a load was not dependent on a previous store.

22


IA-64 Instruction Format Instructions are bundled with tag to indicate which

instructions can be executed in parallel:

tag instruction 1 instruction 2 instruction 3

128 bits

23


Assembly Language One-to-one with instructions (more or less). Basic features: One instruction per line. Labels provide names for addresses (usually in first column). Instructions often start in later columns. Columns run to end of line.

24


ARM Instruction Set ARM versions. ARM assembly language. ARM programming model. ARM data operations. ARM flow of control.

25


ARM Versions ARM architecture has been extended over several

versions. Latest version: ARM11 We will concentrate on ARM7.

26


ARM Assembly Language Example Fairly standard assembly language:label1 ADR r4,c

LDR r0,[r4] ; a commentADR r4,dLDR r1,[r4]SUB r0,r0,r1 ; comment

destination

27


Pseudo-ops Some assembler directives don’t correspond directly to

instructions: Define current address. Reserve storage. Constants.

28


ARM Instruction Set Format

From ARM710T datasheet29


ARM Data Types Word is 32 bits long.

Word can be divided into four 8-bit bytes.

ARM addresses can be 32 bits long.

Address refers to byte. Address 4 starts at byte 4.

Can be configured at power-up as either little- or big-endian mode.

30


Endianness Endianness: ordering of bytes within a larger object, e.g.,

word, i.e., how a large object is stored in memory 68000 is a BIG Endian processor

0x00..00

0xffffffff

Big Endian Little Endian

0123 0123

0x00..10

0x00..13

Memory

register register31


ARM Programming Model

r0r1r2r3r4r5r6r7

r8r9r10r11r12r13r14

r15 (PC)

CPSR

31 0

N Z C V

32


The Program Status Registers (CPSR and SPSRs)

Copies of the ALU status flags (latched if theinstruction has the "S" bit set).

N = Negative result from ALU flag.Z = Zero result from ALU flag.C = ALU operation Carried outV = ALU operation oVerflowed

* Interrupt Disable bits.I = 1, disables the IRQ.F = 1, disables the FIQ.

* T Bit (Architecture v4T only)T = 0, Processor in ARM stateT = 1, Processor in Thumb state

* Condition Code Flags

ModeN Z C V

2831 8 4 0

I F T

* Mode BitsM[4:0] define the processor mode.

33


Processor Modes The ARM has six operating modes: UserUser (16) (unprivileged mode under which most tasks run) FIQFIQ (17) (entered when a high priority (fast) interrupt is raised) IRQIRQ (18) (entered when a low priority (normal) interrupt is

raised) SupervisorSupervisor (19) (entered on reset and when a Software

Interrupt instruction is executed) AbortAbort (23) (used to handle memory access violations) UndefUndef (27) (used to handle undefined instructions)

ARM Architecture Version 4 adds a seventh mode: SystemSystem (31) (privileged mode using the same registers as user

mode)

34


Logical Instruction Arithmetic Instruction

Flag

Negative No meaning Bit 31 of the result has been set(N=‘1’) Indicates a negative number in

signed operations

Zero Result is all zeroes Result of operation was zero(Z=‘1’)

Carry After Shift operation Result was greater than 32 bits(C=‘1’) ‘1’ was left in carry flag

oVerflow No meaning Result was greater than 31 bits(V=‘1’) Indicates a possible corruption of

the sign bit in signed numbers

Condition Flags

35


Conditional Execution Most instruction sets only allow branches to be executed

conditionally. However by reusing the condition evaluation hardware, ARM

effectively increases number of instructions. All instructions contain a condition field which determines whether

the CPU will execute them. Non-executed instructions soak up 1 cycle.

Still have to complete cycle so as to allow fetching and decoding of following instructions.

This removes the need for many branches, which stall the pipeline (3 cycles to refill). Allows very dense in-line code, without branches. The Time penalty of not executing several conditional instructions

is frequently less than overhead of the branch or subroutine call that would otherwise be needed.

36


The Condition Field

2831 24 20 16 12 8 4 0

Cond

0000 = EQ - Z set (equal)0001 = NE - Z clear (not equal)0010 = HS / CS - C set (unsigned

higher or same)0011 = LO / CC - C clear

(unsigned lower)0100 = MI -N set (negative)0101 = PL - N clear (positive or

zero)0110 = VS - V set (overflow)0111 = VC - V clear (no overflow)1000 = HI - C set and Z clear

(unsigned higher)

1001 = LS - C clear or Z set (unsigned lower or same)

1010 = GE - N set and V set, or N clear and V clear (>or =)

1011 = LT - N set and V clear, or N clear and V set (>)

1100 = GT - Z clear, and either N set and V set, or N clear and V set (>)

1101 = LE - Z set, or N set and V clear,or N clear and V set (<, or =)

1110 = AL - always

1111 = NV - reserved.

37


Using and updating the Condition Field To execute an instruction conditionally, simply postfix it with

the appropriate condition: For example an add instruction takes the form:

ADD r0,r1,r2 ; r0 = r1 + r2 (ADDAL) To execute this only if the zero flag is set:

ADDEQ r0,r1,r2 ; If zero flag set then…; ... r0 = r1 + r2

By default, data processing operations do not affect the condition flags (apart from the comparisons where this is the only effect). To cause the condition flags to be updated, the S bit of the instruction needs to be set by postfixing the instruction (and any condition code) with an “S”. For example to add two numbers and set the condition flags:

ADDS r0,r1,r2 ; r0 = r1 + r2 ; ... and set flags

38


Data processing Instructions Largest family of ARM instructions, all sharing the same

instruction format. Contains:

Arithmetic operations Comparisons (no results - just set condition codes) Logical operations Data movement between registers

Remember, this is a load / store architecture These instruction only work on registers, NOTNOT memory.

They each perform a specific operation on one or two operands. First operand always a register - Rn Second operand sent to the ALU via barrel shifter.

We will examine the barrel shifter shortly.

39


Arithmetic Operations Operations are:

ADD operand1 + operand2 ADC operand1 + operand2 + carry SUB operand1 - operand2 SBC operand1 - operand2 + carry -1 RSB operand2 - operand1 RSC operand2 - operand1 + carry – 1

Syntax: <Operation>{<cond>}{S} Rd, Rn, Operand2

Examples ADD r0, r1, r2 SUBGT r3, r3, #1 RSBLES r4, r5, #5

40


Multiplication Instructions The Basic ARM provides two multiplication instructions. Multiply

MUL{<cond>}{S} Rd, Rm, Rs ; Rd = Rm * Rs Multiply Accumulate - does addition for free

MLA{<cond>}{S} Rd, Rm, Rs,Rn ; Rd = (Rm * Rs) + Rn Restrictions on use:

Rd and Rm cannot be the same register Can be avoid by swapping Rm and Rs around. This works because

multiplication is commutative. Cannot use PC.These will be picked up by the assembler if overlooked.

Operands can be considered signed or unsigned Up to user to interpret correctly.

41


Comparisons The only effect of the comparisons is to

UPDATE THE CONDITION FLAGSUPDATE THE CONDITION FLAGS. Thus no need to set S bit.

Operations are: CMP operand1 - operand2, but result not written CMN operand1 + operand2, but result not written TST operand1 AND operand2, but result not written TEQ operand1 EOR operand2, but result not written

Syntax: <Operation>{<cond>} Rn, Operand2

Examples: CMP r0, r1 TSTEQ r2, #5

42


Logical Operations Operations are:

AND operand1 AND operand2 EOR operand1 EOR operand2 ORR operand1 OR operand2 BIC operand1 AND NOT operand2 [ie bit clear]

Syntax: <Operation>{<cond>}{S} Rd, Rn, Operand2

Examples: AND r0, r1, r2 BICEQ r2, r3, #7 EORS r1,r3,r0

43


Data Movement Operations are:

MOV operand2 MVN NOT operand2

Note that these make no use of operand1. Syntax:

<Operation>{<cond>}{S} Rd, Operand2

Examples: MOV r0, r1 MOVS r2, #10 MVNEQ r1,#0

44


The Barrel Shifter The ARM doesn’t have actual shift instructions.

Instead it has a barrel shifter which provides a mechanism to carry out shifts as part of other instructions.

So what operations does the barrel shifter support?

45


Shifts left by the specified amount (multiplies by powers of two) e.g.

LSL #5 = multiply by 32

Barrel Shifter - Left Shift

Logical Shift Left (LSL)

DestinationCF 0

46


Logical Shift Right

• Shifts right by the specified amount (divides by powers of two) e.g.

LSR #5 = divide by 32

Arithmetic Shift Right

• Shifts right (divides by powers of two) and preserves the sign bit, for 2's complement operations. e.g.

ASR #5 = divide by 32

Barrel Shifter - Right Shifts

Destination CF

Destination CF

Logical Shift Right

Arithmetic Shift Right

...0

Sign bit shifted in

47


Barrel Shifter - Rotations Rotate Right (ROR)

• Similar to an ASR but the bits wrap around as they leave the LSB and appear as the MSB.

e.g. ROR #5

• Note the last bit rotated is also used as the Carry Out.

Rotate Right Extended (RRX)

• This operation uses the CPSR C flag as a 33rd bit.

• Rotates right by 1 bit. Encoded as ROR #0.

Destination CF

Rotate Right

Destination CF

Rotate Right through Carry

48


Barrel Shifter Barrel shifter: a hardware device that can shift or rotate a data word by any number of bits in

a single operation. It is implemented like a multiplexor, each output can be connected to any input depending on the shift distance.

ECE 692 L02-ISA.49


Using the Barrel Shifter: the Second Operand

* Immediate value• 8 bit number• Can be rotated right through

an even number of positions.

• Assembler will calculate rotate for you from constant.

Register, optionally with shift operation applied.

Shift value can be either be: 5 bit unsigned integer

Specified in bottom byte of another register.

Operand 1

Result

ALU

Barrel Shifter

Operand 2

50


Second Operand : Shifted Register The amount by which the register is to be shifted is

contained in either: the immediate 5-bit field in the instruction

NO OVERHEAD NO OVERHEAD Shift is done for free - executes in single cycle.

the bottom byte of a register (not PC) Then takes extra cycle to execute ARM doesn’t have enough read ports to read 3 registers at

once. Then same as on other processors where shift is

separate instruction. If no shift is specified then a default shift is applied: LSL

#0 i.e. barrel shifter has no effect on value in register.

51


Second Operand : Using a Shifted Register Using a multiplication instruction to multiply by a constant means first

loading the constant into a register and then waiting a number of internal cycles for the instruction to complete.

A more optimum solution can often be found by using some combination of MOVs, ADDs, SUBs and RSBs with shifts. Multiplications by a constant equal to a ((power of 2) ± 1) can be done in one

cycle.

Example: r0 = r1 * 5Example: r0 = r1 + (r1 * 4)

ADD r0, r1, r1, LSL #2 Example: r2 = r3 * 105

Example: r2 = r3 * 15 * 7Example: r2 = r3 * (16 - 1) * (8 - 1)

RSB r2, r3, r3, LSL #4 ; r2 = r3 * 15RSB r2, r2, r2, LSL #3 ; r2 = r2 * 7

52


ARM Load/Store Instructions LDR, LDRH, LDRB : load (half-word, byte) STR, STRH, STRB : store (half-word, byte) Addressing modes: register indirect : LDR r0,[r1] with second register : LDR r0,[r1,-r2] with constant : LDR r0,[r1,#4]

53


ARM ADR Pseudo-op Cannot refer to an address directly in an instruction. Generate value by performing arithmetic on PC. ADR pseudo-op generates instruction required to

calculate address:ADR r1,FOO

54


Additional addressing modes Base-plus-offset addressing:LDR r0,[r1,#16] Loads from location r1+16

Auto-indexing increments base register:LDR r0,[r1,#16]!

Post-indexing fetches, then does offset:LDR r0,[r1],#16 Loads r0 from r1, then adds 16 to r1.

55


Example: C Assignments C:

x = (a + b) - c;

Assembler:ADR r4,a ; get address for aLDR r0,[r4] ; get value of aADR r4,b ; get address for b, reusing r4LDR r1,[r4] ; get value of bADD r3,r0,r1 ; compute a+bADR r4,c ; get address for cLDR r2,[r4] ; get value of cSUB r3,r3,r2 ; complete computation of xADR r4,x ; get address for xSTR r3,[r4] ; store value of x

56


Example: C Assignment C:

y = a*(b+c);

Assembler:ADR r4,b ; get address for bLDR r0,[r4] ; get value of bADR r4,c ; get address for cLDR r1,[r4] ; get value of cADD r2,r0,r1 ; compute partial resultADR r4,a ; get address for aLDR r0,[r4] ; get value of aMUL r2,r2,r0 ; compute final value for yADR r4,y ; get address for ySTR r2,[r4] ; store y

57


Example: C Assignment C:

z = (a << 2) | (b & 15);

Assembler:ADR r4,a ; get address for a

LDR r0,[r4] ; get value of a

MOV r0,r0,LSL 2 ; perform shift

ADR r4,b ; get address for b

LDR r1,[r4] ; get value of b

AND r1,r1,#15 ; perform AND

ORR r1,r0,r1 ; perform OR

ADR r4,z ; get address for z

STR r1,[r4] ; store value for z

58


ARM Flow of Control All operations can be performed conditionally, testing

CPSR: EQ, NE, CS, CC, MI, PL, VS, VC, HI, LS, GE, LT, GT, LE

Branch operation:B #100

Can be performed conditionally.

59


Example: if Statement C:

if (a < b) { x = 5; y = c + d; } else x = c - d;

Assembler:; compute and test condition

ADR r4,a ; get address for a

LDR r0,[r4] ; get value of a

ADR r4,b ; get address for b

LDR r1,[r4] ; get value for b

CMP r0,r1 ; compare a < b

BGE fblock ; if a >= b, branch to false block

60


if Statement, cont’d.; true block

MOV r0,#5 ; generate value for xADR r4,x ; get address for xSTR r0,[r4] ; store xADR r4,c ; get address for cLDR r0,[r4] ; get value of cADR r4,d ; get address for dLDR r1,[r4] ; get value of dADD r0,r0,r1 ; compute yADR r4,y ; get address for ySTR r0,[r4] ; store yB after ; branch around false block

; false blockfblock ADR r4,c ; get address for c

LDR r0,[r4] ; get value of cADR r4,d ; get address for dLDR r1,[r4] ; get value for dSUB r0,r0,r1 ; compute a-bADR r4,x ; get address for xSTR r0,[r4] ; store value of x

after ...

61


Conditional Instruction Implementation; compute and test condition

ADR r4,a ; get address for aLDR r0,[r4] ; get value of aADR r4,b ; get address for bLDR r1,[r4] ; get value for bCMP r0,r1 ; compare a < b

; true blockMOVLT r0,#5 ; generate value for xADRLT r4,x ; get address for xSTRLT r0,[r4] ; store xADRLT r4,c ; get address for cLDRLT r0,[r4] ; get value of cADRLT r4,d ; get address for dLDRLT r1,[r4] ; get value of dADDLT r0,r0,r1 ; compute yADRLT r4,y ; get address for ySTRLT r0,[r4] ; store y

; false blockADRGE r4,c ; get address for cLDRGE r0,[r4] ; get value of cADRGE r4,d ; get address for dLDRGE r1,[r4] ; get value for dSUBGE r0,r0,r1 ; compute a-bADRGE r4,x ; get address for xSTRGE r0,[r4] ; store value of x

62


Example: switch Statement C:

switch (test) { case 0: … break; case 1: … }

Assembler:ADR r2,test ; get address for test

LDR r0,[r2] ; load value for test

ADR r1,switchtab ; load address for switch table

LDR r1,[r1,r0,LSL #2] ; index switch table

switchtab DCD case0

DCD case1

...

63


Example: FIR filter C:

for (i=0, f=0; i<N; i++)

f = f + c[i]*x[i];

Assembler; loop initiation code

MOV r0,#0 ; use r0 for I

MOV r8,#0 ; use separate index for arrays

ADR r2,N ; get address for N

LDR r1,[r2] ; get value of N

MOV r2,#0 ; use r2 for f

ADR r3,c ; load r3 with base of c

ADR r5,x ; load r5 with base of x

64


FIR filter, cont’.d; loop body

loop LDR r4,[r3,r8] ; get c[i]

LDR r6,[r5,r8] ; get x[i]

MUL r4,r4,r6 ; compute c[i]*x[i]

ADD r2,r2,r4 ; add into running sum

ADD r8,r8,#4 ; add one word offset to array index

ADD r0,r0,#1 ; add 1 to i

CMP r0,r1 ; exit?

BLT loop ; if i < N, continue

65


ARM Subroutine Linkage Branch and link instruction:BL foo

Copies current PC to r14.

To return from subroutine:MOV r15,r14

66


Summary All instructions are 32 bits long. Load/store architecture Data processing instructions act only on registers Specific memory access instructions with powerful auto-

indexing addressing modes.

Most instructions operate in single cycle. Some multi-register operations take longer.

All instructions can be executed conditionally.

67

Arm

Education

memorymemory architecture

registermemory architecture

architecture isa instruction

memory access

architecture isa interface

cpu memory address

stack accumulator architecture

loadstore architecture