ARM Introduction & Instruction Set Architecture Aleksandar Milenkovic E-mail: [email protected]Web: http://www.ece.uah.edu/~milenka 2 Outline ARM Architecture ARM Organization and Implementation ARM Instruction Set Thumb Instruction Set Architectural Support for System Development ARM Processor Cores Memory Hierarchy Architectural Support for Operating Systems ARM CPU Cores Embedded ARM Applications 3 ARM History ARM – Acorn RISC Machine (1983 – 1985) § Acorn Computers Limited, Cambridge, England ARM – Advanced RISC Machine 1990 § ARM Limited, 1990 § ARM has been licensed to many semiconductor manufacturers
24
Embed
ARM Introduction & Instruction Set Architecturemilenka/cpe626-04F/secure/l18_arm_3p1.pdf7 ARM instruction set ØLoad-store architecture §operands are inGPRs §load/store – only
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ARMIntroduction &
Instruction Set Architecture
Aleksandar MilenkovicE-mail: [email protected]: http://www. ece.uah.edu/~milenka
2
OutlineØ ARM ArchitectureØ ARM Organization and Implementation
Ø ARM Instruction SetØ Thumb Instruction Set
Ø Architectural Support for System DevelopmentØ ARM Processor Cores
Ø Memory HierarchyØ Architectural Support for Operating Systems
Ø ARM CPU CoresØ Embedded ARM Applications
3
ARM HistoryØ ARM – Acorn RISC Machine (1983 – 1985)§ Acorn Computers Limited, Cambridge, England
Ø ARM – Advanced RISC Machine 1990 § ARM Limited, 1990§ ARM has been licensed to many semiconductor
manufacturers
4
ARM’s visible registersØ User level§ 15 GPRs, PC,
CPSR (current program status register)
Ø Remaining registers are used for system-level programming and for handling exceptions
r13_und r14_und r14_ irq
r13_ irq
SPSR_und
r14_ abt r14_svc
user mode fiqmode
svcmode
abortmode
irqmode
undefinedmode
usable in user mode
system modes only
r13_ abt r13_svc
r8_fiqr9_fiq
r10_ fiqr11_fiq
SPSR_irq SPSR_abt SPSR_svc SPSR_fiqCPSR
r14_ fiqr13_ fiqr12_ fiq
r0r1r2r3r4r5r6r7r8r9r10r11r12r13r14r15 (PC)
5
ARM CPSR formatØ N (Negative), Z (Zero), C (Carry), V (oVerflow )Ø mode – control processor mode
Ø T – control instruction set§ T = 1 – instruction stream is 16-bit Thumb instructions§ T = 0 – instruction stream is 32-bit ARM instructions
Ø I F – interrupt enables
N Z C V unused mode
31 2827 8 7 6 5 4 0
I F T
6
ARM memory organizationØ Linear array of bytes numbered from 0
to 232 – 1Ø Data items§ bytes (8 bits)§ half-words (16 bits) – always
aligned to 2-byte boundaries (start at an even byte address)
§ words (32 bits) – always aligned to 4-byte boundaries (start at a byte address which is multiple of 4)
half-word4
word16
0123
4567
891011
byte0byte
12131415
16171819
20212223
byte1byte2
half-word14
byte3
byte6
address
bit 31 bit 0
half-word12
word8
7
ARM instruction setØ Load-store architecture§ operands are in GPRs§ load/store – only instructions that operate with memory
Ø Instructions§ Data Processing – use and change only register values§ Data Transfer – copy memory values into registers (load) or
copy register values into memory (store)§ Control Flow
o branch o branch-and-link –
save return address to resume the original sequenceo trapping into system code – supervisor calls
8
ARM instruction set (cont’d)Ø Three-address data processing instructionsØ Conditional execution of every instruction
Ø Powerful load/store multiple register instructionsØ Ability to perform a general shift operation and a general
ALU operation in a single instruction that executes in a single clock cycle
Ø Open instruction set extension through coprocessor instruction set, including adding new registers and data types to the programmer’s model
Ø Very dense 16-bit compressed representation of the instruction set in the Thumb architecture
9
I/O systemØ I/O is memory mapped § internal registers of peripherals (disk controllers, network
interfaces, etc) are addressable locations within the ARM’s memory map and may be read and written using the load-store instructions
Ø Peripherals may use either the normal interrupt (IRQ) or fast interrupt (FIQ) input§ normally most interrupt sources share the IRQ input, while
just one or two time-critical sources are connected to the FIQ input
Ø Some systems may include external DMA hardware to handle high-bandwidth I/O traffic
10
ARM exceptionsØ ARM supports a range of interrupts, traps, and supervisor calls – all
are grouped under the general heading of exceptionsØ Handling exceptions§ current state is saved by copying the PC into r14_exc and CPSR
into SPSR_exc (exc stands for exception type)§ processor operating mode is changed to the appropriate exception
mode§ PC is forced to a value between 0016 and 1C16, the particular value
depending on the type of exception § instruction at the location PC is forced to (the vector address)
usually contains a branch to the exception handler; the exception handler will use r13_exc, which is normally initialized to point to a dedicated stack in memory, to save some user registers
§ return: restore the user registers and then restore PC and CPSRatomically
11
ARM cross-development toolkitØ Software development§ tools developed by ARM
Limited§ public domain tools
(ARM back end for gcc C compiler)
Ø Cross-development§ tools run on different
architecture from one for which they produce code
assemblerC compiler
C source asmsource
.aof
C libraries
linker
.axf
ARMsd
debug
ARMulator development
system model
board
objectlibraries
12
OutlineØ ARM ArchitectureØ ARM Assembly Language ProgrammingØ ARM Organization and ImplementationØ ARM Instruction SetØ Architectural Support for High-level LanguagesØ Thumb Instruction SetØ Architectural Support for System DevelopmentØ ARM Processor CoresØ Memory HierarchyØ Architectural Support for Operating SystemsØ ARM CPU CoresØ Embedded ARM Applications
13
ARM Instruction SetØ Data Processing InstructionsØ Data Transfer Instructions
Ø Control flow Instructions
14
Data Processing InstructionsØ Classes of data processing instructions§ Arithmetic operations§ Bit-wise logical operations§ Register-movement operations§ Comparison operations
Ø Operands: 32-bits wide;there are 3 ways to specify operands§ come from registers§ the second operand may be a constant (immediate)§ shifted register operand
Ø Result: 32-bits wide, placed in a register§ long multiply produces a 64-bit result
ARM shift operationsØ LSL – Logical Shift LeftØ LSR – Logical Shift Right
Ø ASR – Arithmetic Shift RightØ ROR – Rotate Right
Ø RRX – Rotate Right Extended by 1 place
031
00000
LSL #5
031
00000
LSR #5
031
1 1111 1
ASR #5 , negative operand
031
00000 0
ASR #5 , positive operand
0 1
031
ROR #5
031
RRX
C
C C
18
Setting the condition codesØ Any DPI can set the condition codes (N, Z, V, and C) § for all DPIs except the comparison operations
a specific request must be made § at the assembly language level this request is indicated by
adding an `S` to the opcode§ Example (r3-r2 := r1-r0 + r3-r2)
Ø Arithmetic operations set all the flags (N, Z, C, and V)Ø Logical and move operations set N and Z§ preserve V and either preserve C when there is no shift
operation, or set C according to shift operation (fall off bit)
; carry out to C; ... add into high word
ADDS r2, r2, r0ADC r3, r3, r1
19
MultipliesØ Example (Multiply, Multiply -Accumulate)
Ø Note§ least significant 32-bits are placed in the result register,
the rest are ignored§ immediate second operand is not supported§ result register must not be the same
as the first source register§ if `S` bit is set the V is preserved and
the C is rendered meaninglessØ Example (r0 = r0 x 35)§ ADD r0, r0, r0, LSL #2 ; r0’ = r0 x 5
RSB r3, r3, r1 ; r0’’ = 7 x r0’
r4 := [r3 x r2 + r1] <31:0>MLA r4, r3, r2, r1r4 := [r3 x r2] <31:0>MUL r4, r3, r2
20
Data transfer instructionsØ Single register load and store instructions§ transfer of a data item (byte, half-word, word)
between ARM registers and memory
Ø Multiple register load and store instructions§ enable transfer of large quantities of data§ used for procedure entry and exit, to save/restore workspace
registers, to copy blocks of data around memoryØ Single register swap instructions§ allow exchange between a register and memory
in one instruction§ used to implement semaphores to ensure mutual exclusion
Note: any subset (or all) of the registers may be transferred with a single instruction
Note: the order of registers within the list is insignificant
Note: including r15 in the list will cause a change in the control flow
Multiple register data transfers
Ø Stack organizationsØ FA – full ascending Ø EA – empty ascendingØ FD – full descendingØ ED – empty descending
24
Multiple register transfer addressing modes
r5r1
r9’
r0r9
STMIA r9!, {r0,r1,r5}
100016
100c 16
101816
r1r5r9
STMDA r9!, {r0,r1,r5}
r0
r9’ 100016
100c 16
101816
r5r9
STMDB r9!, {r0,r1,r5}
r1
r0r9’ 100016
100c 16
101816
r5r1r0
r9’
r9
STMIB r9!, {r0,r1,r5}
100016
100c 16
101816
25
The mapping between the stack and block copy views
As ce ndi ng Des ce ndi ngFul l Em pt y Ful l Em pt y
Increm entB ef o re STMIB
STMFALDMIBLDMED
Aft er STMIASTMEA
LDMIALDMFD
Dec re me ntB ef o re LDMDB
LDMEASTMDBSTMFD
Aft er LDMDALDMFA
STMDASTMED
26
Control flow instructionsBranch Interpretation Normal uses B BAL
Unconditional Always
Always take this branch Always take this branch
BEQ Equa l Comparison equal or zero result BNE Not equal Comparison not equal or non-zero result BPL Plus Result positive or zero BMI Minus R e sult minus or negative BCC B L O
Carry clear L o w e r
Arithmetic operation did not give carry-out Unsigned comparison gave lower
BCS BHS
Carry set Higher or same
Arithmetic operation gave carry-out Unsigned comparison gave higher or same
BVC Overflow clear Signed integer operation; no overflow occurred BVS Overflow set Signed integer operation; overflow occurred B G T Greater than Signed integer comparison gave greater than BGE Greater or equal Signed integer comparison gave greater or
equal B L T Less than S igned integer comparison gave less than B L E Less or equal Signed integer comparison gave less than or
equal BHI Higher Unsigned comparison gave higher B L S Lower or same Unsigned comparison gave lower or same
27
Conditional executionØ Conditional execution to avoid branch instructions used to
skip a small number of non-branch instructionsØ Example
stage and r15 = PC + 4?Ø incompatibilities between 3-
stage and 5-stage implementations => unacceptable
Ø to avoid this 5-stage pipeline ARMs emulate the behavior of the older 3-stage designs
46
Data processing instruction datapath activity (Ex)
address register
increment
registersRd
Rn
PC
Rm
as ins.
as instruction
mult
data out data in i. pipe
(a) register – register operations
address register
increment
registersRd
Rn
PC
as ins.
as instruction
mult
data out data in i. pipe
[7:0]
(b) register – immediate operations
ØReg-RegØRd = Rn op RmØr15 = AR + 4
AR = AR + 4ØReg-ImmØRd = Rn op ImmØr15 = AR + 4
AR = AR + 4
47
STR (store register) datapath activity(Ex1, Ex2)
address register
increment
registersRn
PC
lsl#0
= A / A+ B / A - B
mult
data out data in i. pipe
[11:0]
address register
increment
registersRn
Rd
shifter
= A+ B / A - B
mult
PC
byte? data in i. pipe
(a) 1s t cycle – compute address (b) 2nd cycle – store data & auto-index
ØCompute address (Ex1)ØAR = Rn op DispØr15 = AR + 4
ØStore data (Ex2)ØAR = PCØmem[AR] =
Rd<x:y>ØIf autoindexing
=>Rn = Rn +/- 4
48
The first two (of three) cycles of a branch instruction
address register
increment
registersPC
lsl#2
= A + B
mult
data out data in i. pipe
[23:0]
address register
increment
registersR14
PC
shifter
= A
mult
data out data in i. pipe
(a) 1s t cycle – compute branch target (b) 2nd cycle – save return address
Third cycle: do a small correction to the value stored in the link register in order that it points to directly at the instruction which follows the branch?
ARM ImplementationØ Datapath§ RTL (Register Transfer Level)
Ø Control unit§ FSM (Finite State Machine)
50
2-phase non-overlapping clock schemeØ Most ARMs do not operate on edge-sensitive registersØ Instead the design is based around
2-phase non-overlapping clocks which are generated internally from a single clock signal
Ø Data movement is controlled by passing the data alternatively through latches which are open during phase 1 or latches during phase 2
1 clock cycle
phase 1
phase 2
51
ARM datapath timingØ Register read§ Register read buses – dynamic, precharged during phase 2§ During phase 1 selected registers discharge the read buses
which become valid early in phase 1Ø Shift operation§ second operand passes through barrel shifter
Ø ALU operation§ ALU has input latches which are open in phase 1,
allowing the operands to begin combining in ALU as soon as they are valid, but they close at the end of phase 1 so that the phase 2 precharge does not get through to the ALU
§ ALU processes the operands during the phase 2, producing the valid output towards the end of the phase
§ the result is latched in the destination register at the end of phase 2
52
ARM datapath timing (cont’d)
read bus valid
shift out valid
ALU out
shift time
ALU time
registerwrite time
registerreadtime
ALU operandslatched
phase 1
ph ase 2
prechargeinvalidatesbuses
Minimum Datapath Delay = Register read time + Shifter Delay + ALU Delay+ Register write set-up time + Phase 2 to phase 1 non-overlap time
53
The original ARM1 ripple-carry adderØ Carry logic: use CMOS AOI (And-O r-Invert) gateØ Even bits use circuit show below
Ø Odd bits use the dual circuit with inverted inputs and outputs and AND and OR gates swapped around
The ARM2 ALU logic for one result bitØ ALU functions§ data operations (add, sub, ...)§ address computations for memory accesses§ branch target computations§ bit-wise logical
operations§ ...
ALUbus
432105
NBbus
NAbus
carrylogic
fs:
G
P
56
ARM2 ALU function codes
fs 5 f s4 f s 3 f s2 f s1 f s 0 ALU o ut put0 0 0 1 0 0 A an d B0 0 1 0 0 0 A an d not B0 0 1 0 0 1 A x or B0 1 1 0 0 1 A p lus no t B p lus carry0 1 0 1 1 0 A p lus B p lus carry1 1 0 1 1 0 no t A plus B p lus carry0 0 0 0 0 0 A0 0 0 0 0 1 A o r B0 0 0 1 0 1 B0 0 1 0 1 0 no t B0 0 1 1 0 0 zer o
57
The ARM6 carry-select adder schemeØ Compute sums of
various fields of the wordfor carry -in of zero and carry -in of one
Ø Final result is selected by using the correct carry -in value to control a multiplexor
sum[31:16]sum[15:8]sum[7:4]sum[3:0]
s s+1
a,b[31:28]a,b[3:0]
+ +, +1c
+, +1
mux
mux
mux
Worst case: O(log2[word width]) gates long
Note: Be careful! Fan-out on some of these gates is high so direct comparison with previous schemes is not applicable.
58
The ARM6 ALU organizationØ Not easy to merge the arithmetic and logic functions =>
a separate logic unit runs in parallel with the adder, and multiplexor selects the output
The cross-bar switch barrel shifterØ Shifter delay is critical since it contributes directly to the
datapath cycle timeØ Cross-bar switch matrix (32 x 32)
Ø Principle for 4x4 matrix
in[0]
in[1]
in[2]
in[3]
out[0] out[1] out[2] out[3]
no shiftright 1right 2right 3
left 1
left 2
left 3
61
The cross-bar switch barrel shifter (cont’d)Ø Precharged logic is used =>
each switch is a single NMOS transistorØ Precharging sets all outputs to logic 0, so those which are
not connected to any input during switching remain at 0 giving the zero filling required by the shift semantics
Ø For rotate right, the right shift diagonal is enabled + complementary shift left diagonal (e. g., ‘right 1’ + ‘left 3’)
Ø Arithmetic shift right:use sign-extension => separate logic is used to decode the shift amount and discharge those outputs appropriately
62
Multiplier designØ All ARMs apart form the first prototype have included
support for integer multiplication§ older ARM cores include low-cost multiplication hardware
that supports only the 32-bit result multiply and multiply -accumulate§ recent ARM cores have high-performance multiplication
hardware and support 64-bit result multiply andmultiply -accumulate
Ø Low cost implementation§ Use the datapath iteratively, employing the barrel shifter
and ALU to generate 2-bit product in each clock cycle§ use early termination to stop the iterations when there are
no more ones in the multiply register
63
The 2-bit multiplication algorithm, Nth cycleØ Control settings for the Nth cycle of the multiplicationØ Use existing shifter and ALU + additional hardware§ dedicated two-bits-per-cycle shift register for the multiplier
and a few gates for the Booth’s algorithm control logic(overhead is a few per cent on the area of ARM core)
Carry - in Mult i pli e r Shi ft AL U Ca rry - o ut0 x 0 LSL # 2N A + 0 0
x 1 LSL # 2N A + B 0x 2 LSL # (2N + 1 ) A – B 1x 3 LSL # 2N A – B 1
1 x 0 LSL # 2N A + B 0x 1 LSL # (2N + 1 ) A + B 0x 2 LSL # 2N A – B 1x 3 LSL # 2N A + 0 1
64
High speed multiplicationØ Where multiplication performance is very important,
more hardware resources must be dedicated § in some embedded systems the ARM core is used to perform
real-time digital signal processing (DSP) –DSP programs are typically multiplication intensive
Ø Use intermediate results which include partial sums and partial carries § Carry -save adders are used for this
Ø These two binary results are added together at the end of multiplication§ The main ALU is used for this
65
Carry-propagate (a) and carry-save (b) adder structuresØ Carry propagate adder takes two conventional (irredundant) binary
numbers as inputs and produces a binary sum Ø Carry save adder takes one binary and one redundant (partial sum and
partial carry) input and produces a sum in redundant binary representation (sum and carry)
+A B Cin
Cout S(a) +
A B Cin
Cout S+
A B Cin
Cout S+
A B Cin
Cout S
+A B Cin
Cout S(b) +
A B Cin
Cout S+
A B Cin
Cout S+
A B Cin
Cout S
66
ARM high-speed multiplier organizationØ CSA has 4 layers of adders each handling 2 multiplier bits
=> multiply 8-bits per clock cycleØ Partial sum and carry are cleared at the beginning
or initialized to accumulate a valueØ Multiplier is shifted right 8-bits
per cycle in the ‘Rs’ registerØ Carry sum and carry
are rotated right 8 bits per cycleØ Performance: up to 4 clock cycles
(early termination is possible)Ø Complexity: 160 bits in shift registers,
128 bits of carry -save adder logic (up to 10% of simpler cores)