Page 1
1
ARM Architecture & NEON
Ian Rickards
Stanford University 28 Apr 2010
2
A brief history of ARM
First ARM prototype came alive on 26-Apr-1985, 3um technology, 24800 transistors
50mm2, consumed 120mW of power Acorn’s commercial ARM2 processor: 8-MHz, 26-bit addressing, 3-stage pipeline
ARM founded in October 1990, separate company (Apple had 43% stake) ARM610 for Newton in 1992, ARM7TDMI for Nokia in 1994
3
ARM 25 years later: Cortex-A9 MP
1-4 way MP with optimized MESI
16KB, 32KB, 64KB I & D caches
128KB-8MB L2
Multi-issue, Speculation, Renaming, OOO
High performance FPU option
NEON SIMD option
Thumb-2
AXI bus
Gatecount: 500K (32KB I/D L1’s), 600K (core), 500K (NEON)
40G “Low Power” macro: ~5mm2, 800MHz, 0.5W
40G “High Performance” macro: ~7mm2 2GHz (typ), 2W
4
Cortex-A9 Processor Microarchitecture
Introduces out-of-order instruction issue and completion
Register renaming to enable execution speculation
Non-blocking memory system with load-store forwarding
Fast loop mode in instruction pre-fetch to lower power consumption
Page 2
5
Cortex-A9 MPCore Multicore Structure
Cache-2-Cache
Transfers
Snoop
Filtering
Generalized
Interrupt Control
and Distribution
Snoop Control Unit (SCU)
Timers
Advanced Bus Interface Unit
Accelerator
Coherence
Port
FPU/NEON
Cortex-A9 CPU
Instruction
Cache
Data
Cache
TRACE FPU/NEON
Cortex-A9 CPU
Instruction
Cache
Data
Cache
TRACE FPU/NEON
Cortex-A9 CPU
Instruction
Cache
Data
Cache
TRACE FPU/NEON
Cortex-A9 CPU
Instruction
Cache
Data
Cache
TRACE
L2 Cache Controller (PL310) 128K-8MB
Primary AMBA 3 64bit Interface Optional 2nd I/F with Address Filtering
Configurable Between 1 and 4 CPUs with optional NEON and/or Floating-point Unit
Design flexibility over memory throughput and latency
Secure and Virtualization aware interrupt and IPI communications
Coherent access to processor caches from accelerators and DMA
Hardware Coherence for Cache, MMU and TLB maintenance operations
Flexible configuration and power-aware interrupt controller
6
Hard Macro Configuration and Floorplan
Osprey configuration includes level 2 cache controller and Cortex A9 integration level
Top level includes Coresight PTM, CTI and CTM
Implementation using r1p1 version of Cortex A9
Dual core
32k I$ and D$
NEON present on both cores
PTM interface present
128 interrupts
ACP present
Two AXI master ports
Level 2 cache memories external (interface exposed) Elba top level floorpan
falcon_cpu floorplan
7
Why is ARM looking at “G” processes?
“G” can achieve around double the MHz than “LP”
Active power is lower on “G” than “LP”
Example, Push 40LP to 800MHz, to compare with 800MHz MID macro
The estimated LP numbers correlate to an accelerated implementation of an A8
G is close in terms of power if lowered to same performance as on LP.
G can scale much higher in terms of performance than LP.
Key requirement is “run and power off” quickly
8
2-3x faster
clock
Understanding power
Fundamental power parameters
Average power => battery life
Thermal Power sustained power @ max performance
Traditional LP process
40G process
Osprey
GUI updates music
web page render
power off power off power
off
Power
Power
Power
Page 3
9
A9_PL310
A9_PL310_noram
“Osprey macro”
Power Domains
HiP and MID macros have same power
domains
Both use distributed coarse grain power
switches
Power plan for CPUs is symmetric
A9 core and its L1 is power gated in
lockstep
Note that all power domains are only ON
or OFF, there is no hardware retention
mode
Software routine enables retention to RAM
Data Engine 0
Data Engine 1
A9 CORE 0 + 32K I/D
A9 CORE 1 + 32K I/D
SCU + PL310_noram
PT
M/D
ebug
L2 Cache RAM
512/1024KB
10
Single-thread Coremarks/MHz
Single-thread performance is key for GUI based applications
0.00 0.50 1.00 1.50 2.00 2.50 3.00
74K
1004K
Cortex-A8
Cortex-A9
Atom
2.30
2.33
2.72
2.95
1.85
11
Floating Point Performance
Intel
12
Higher Flash Actionscript from A9
Page 4
13
ARM Architecture evolution
Some not-entirely-RISC features
LDM / STM
Full predicated execution (ADDEQ r0, r1, r2)
Carefully designed with customer/partner input considering gatecount
Thumb
16-bit instruction set (mostly using r0-r7) selected for compiler requirements
Design goals: performance from 16-bit wide ROM, codesize
Thumb-2 in Cortex-A extends original Thumb (allows 32-bit/16-bit mix)
Beneficial today – better performance from small caches
Jazelle
CPU mode allows direct execution of Java bytecodes
~60% of Java bytecodes directly executed by datapath
Top of Java stack stored in registers
Widely used in Nokia & DoCoMo handsets
…
14
Dummies’ guide to Si implementation
Basic Fab tech
65nm, 40nm, 32nm, 28nm, etc.
G vs. LP technology
40G is 0.9V process, 40LP is 1.1V process
Much lower leakage with LP, but half the performance
Intermediate “LPG” from TSMC too! Island of G within LP
Vt’s – each Vt requires additional mask step
HVt – lower leakage, but slower
RVt – regular Vt
LVt – faster, but high leakage esp. at high temperature
Cell library track size
9-track, 12-track, 15-track (bigger => more powerful)
Backed off implementation vs. pushed implementation
High-K metal Gate
Clock gating
Well biasing…
15
VFPv3
ARM Architecture Evolution
Jazelle®
VFPv2
SIMD
Thumb®-2
NEON™ Adv SIMD
TrustZone™
Thumb-EE
Thumb-2 Only
V5 V6 V7 A&R V7 M
Improved Media and
DSP
Low Cost MCU
Key Technology
Additions by Architecture Generation
Execution Environments:
Improved memory use
Key Technology
Additions by Architecture Generation
ARM9
ARM10
ARM11
16
NEON is a wide SIMD data processing architecture
Extension of the ARM instruction set
32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide)
NEON Instructions perform “Packed SIMD” processing
Registers are considered as vectors of elements of the same data type
Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single prec. float
Instructions perform the same operation in all lanes
What is NEON?
Dn
Dm
Dd
Lane
Source Registers Source
Registers
Operation
Destination Register
Elements Elements Elements
Page 5
17
NEON natively supports a set of common data types
Integer and Fixed-Point; 8-bit, 16-bit, 32-bit and 64-bit
32-bit Single-precision Floating-point
Data types are represented using a bit-size and format letter
Data Types
.S8
.U8
.P8
.I8 .8
.S16
.U16
.P16
.I16 .16
.S32
.U32
.F32
.I32 .32
.S64
.U64 .I64 .64
Signed, Unsigned Integers;
Polynomials
32-bit Signed, Unsigned
Integers; Floats
8/16-bit Signed, Unsigned Integers;
Polynomials
64-bit Signed, Unsigned Integers;
18
Registers
NEON provides a 256-byte register file
Distinct from the core registers
Extension to the VFPv2 register file (VFPv3)
Two explicitly aliased views
32 x 64-bit registers (D0-D31)
16 x 128-bit registers (Q0-Q15)
Enables register trade-off
Vector length
Available registers
Also uses the summary flags in the VFP FPSCR
Adds a QC integer saturation summary flag
No per-lane flags, so ‘carry’ handled using wider result (16bit+16bit -> 32-bit)
Q0
Q1
Q15
:
D0
D1
D2
D3
:
D30
D31
19
Vectors and Scalars
Registers hold one or more elements of the same data type
Vn can be used to reference either a 64-bit Dn or 128-bit Qn register
A register, data type combination describes a vector of elements
Some instructions can reference individual scalar elements
Scalar elements are referenced using the array notation Vn[x]
Array ordering is always from the least significant bit
64-bit 128-bit
I64 D0
D7
Dn
63 0
S32 S32 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8
Qn
127 0
F32 F32 F32 F32 Q0
Q7
F32 F32 F32 F32 Q0
Q0[0] Q0[1] Q0[2] Q0[3]
20
NEON in Audio
FFT: 256-point, 16-bit signed complex numbers
FFT is a key component of AAC, Voice/pattern recognition etc.
Hand optimized assembler in both cases
Extreme example: FFT in ffmpeg: 12x faster
C code -> handwitten asm
Scalar -> vector processing
Unpipelined FPU -> pipelined NEON single precision FPU
FFT time No NEON
(v6 SIMD asm)
With NEON
(v7 NEON asm)
Cortex-A8 500MHz
Actual silicon
15.2 us 3.8 us
(x 4.0 performance)
Page 6
21
How to use NEON
OpenMAX DL library Library of common codec components and signal processing routines
Status: Released on http://www.arm.com/products/esd/openmax_home.html
Vectorizing Compilers Exploits NEON SIMD automatically with existing source code
Status: Released (in RVDS 3.1 Professional and later)
Status: Codesourcery 2007q3 gcc and later
C Instrinsics C function call interface to NEON operations
Supports all data types and operations supported by NEON
Status: Released (in RVDS 3.0+ and Codesourcery 2007q3 gcc)
Assembler For those who really want to optimize at the lowest level
Status: Released (in RVDS 3.0+ & Codesourcery 2007q3 gcc/gas)
22
For NEON instruction reference
Official NEON instruction Set reference is “Advanced SIMD” in
ARM Architecture Reference Manual v7 A & R edition
Available to partners & www.arm.com request system
23
ARM RVDS & gcc vectorising compiler
armcc generates better NEON code
(gcc can use Q-regs with ‘-mvectorize-with-neon-quad’ )
|L1.16|
VLD1.32 {d0,d1},[r0]!
SUBS r3,r3,#1 VLD1.32 {d2,d3},[r1]!
VADD.I32 q0,q0,q1
VST1.32 {d0,d1},[r2]!
BNE |L1.16|
armcc -S --cpu cortex-a8
-O3 -Otime --vectorize test.c
int a[256], b[256], c[256];
foo () { int i;
for (i=0; i<256; i++){
a[i] = b[i] + c[i];
} }
gcc -S -O3 -mcpu=cortex-a8
-mfpu=neon -ftree-vectorize -ftree-vectorizer-verbose=6
test.c
.L2:
add r1, r0, ip
add r3, r0, lr add r2, r0, r4
add r0, r0, #8
cmp r0, #1024
fldd d7, [r3, #0]
fldd d6, [r2, #0] vadd.i32 d7, d7, d6
fstd d7, [r1, #0]
bne .L2
24
Intrinsics
Include intrinsics header file
#include <arm_neon.h>
Use special NEON data types which correspond to D and Q registers, e.g.
int8x8_t D-register containing 8x 8-bit elements
int16x4_t D-register containing 4x 16-bit elements
int32x4_t Q-register containing 4x 32-bit elements
Use special intrinsics versions of NEON instructions
vin1 = vld1q_s32(ptr);
vout = vaddq_s32(vin1, vin2);
vst1q_s32(vout, ptr);
Strongly typed!
Use vreinterpret_s16_s32( ) to change the type
Page 7
25
NEON in opensource Bluez – official Linux Bluetooth protocol stack
NEON sbc audio encoder
Pixman (part of cairo 2D graphics library)
Compositing/alpha blending
X.Org, Mozilla Firefox, fennec, & Webkit browsers
e.g. fbCompositeSolidMask_nx8x0565neon 8x faster using NEON
ffmpeg – libavcodec
LGPL media player used in many Linux distros
NEON Video: MPEG-2, MPEG-4 ASP, H.264 (AVC), VC-1, VP3, Theora
NEON Audio: AAC, Vorbis, WMA
x264 – Google Summer Of Code 2009
GPL H.264 encoder – e.g. for video conferencing
Android – NEON optimizations
Skia library, S32A_D565_Opaque 5x faster using NEON
Available in Google Skia tree from 03-Aug-2009
Eigen2 linear algebra library
Ubuntu 09.04 – supports NEON
NEON versions of critical shared-libraries
26
Many different levels of parallelism
Multi-issue parallelism
NEON SIMD parallelism
Multi-core parallelism
27
ffmpeg (libavcodec) performance
git.ffmpeg.org
snapshot 21-Sep-09
YouTube HQ video decode
480x270, 30fps
Including AAC audio
Real silicon measurements
OMAP3 Beagleboard
ARM A9TC
NEON ~2x overall
performance
28
Scalability with SMP on Cortex-A9
Page 8
29
NEON optimization example
30
Skia library S32A_D565_Opaque
Size Reference
C
Google v6
asm
NEON
asm
RVDS
60 100% 128% 24% 64%
64 100% 128% 22% 68%
68 100% 127% 23% 63%
980 100% 73% 23% 58%
986 100% 73% 23% 58%
31
Processing code vmovn.u16 d4, q12
vshr.u16 q11, q12, #5
vshr.u16 q10, q12, #6+5
vmovn.u16 d5, q11
vmovn.u16 d6, q10
vshl.u8 d4, d4, #3
vshl.u8 d5, d5, #2
vshl.u8 d6, d6, #3
vmovl.u8 q14, d31
vmovl.u8 q13, d31
vmovl.u8 q12, d31
vmvn.8 d30, d3
vmlal.u8 q14, d30, d6
vmlal.u8 q13, d30, d5
vmlal.u8 q12, d30, d4
vshr.u16 q8, q14, #5
vshr.u16 q9, q13, #6
vaddhn.u16 d6, q14, q8
vshr.u16 q8, q12, #5
vaddhn.u16 d5, q13, q9
vqadd.u8 d6, d6, d0
vaddhn.u16 d4, q12, q8
vqadd.u8 d6, d6, d0
vqadd.u8 d5, d5, d1
vqadd.u8 d4, d4, d2
vshll.u8 q10, d6, #8
vshll.u8 q3, d5, #8
vshll.u8 q2, d4, #8
vsri.u16 q10, q3, #5
vsri.u16 q10, q2, #11
32
Cortex-A8 TRM
Page 9
33
Quick review of NEON instructions
34
VLD1, VST1 provide standard array access
An array of structures containing a single component is a basic array
List can contain 1, 2, 3 or 4 consecutive registers
Transfer multiple consecutive 8, 16, 32 or 64-bit elements
Multiple 1-Element Structure Access
x4 x5 x6 x7
x0 x1 x2 x3 D3
D4 x7
x5
x4
x3
x2
x1
x0
x6
[R1]
+2
+4
+6
+8
:
+10
+12
+14
VST1.16 {D3,D4}, [R1]
x0 x1 x2 x3 D7 x3
x2
x1
x0 [R4]
+2
+4
+6
:
VLD1.16 {D7}, [R4], R3
+R3
35
Addition: Basic
NEON supports various useful forms of basic addition
Normal Addition - VADD, VSUB
Floating-point
Integer (8-bit to 64-bit elements)
64-bit and 128-bit registers
Long Addition - VADDL, VSUBL
Promotes both inputs before operation
Signed/unsigned (8-bit to 32-bit source elements)
Wide Addition - VADDW, VSUBW
Promotes one input before operation
Signed/unsigned (8-bit 32-bit source elements)
VADD.I16 D0, D1, D2
VSUB.F32 Q7, Q1, Q4
VADD.I8 Q15, Q14, Q15
VSUB.I64 D0, D30, D5
VADDW.U8 Q1, Q7, D8
VSUBW.S16 Q8, Q1, D5
VADDL.U16 Q1, D7, D8
VSUBL.S32 Q8, D1, D5
36
Example – adding all lanes
Input in Q0 (D0 and D1)
u16 input values
Now Q0 contains 4x u32 values
(with 15 headroom bits)
Reducing/folding operation
needs 1 bit of headroom
VPADDL.U32 D0, D0
VPADD.U32 D0, D0, D1
VPADDL.U16 Q0, Q0
DO
DO
DO
DO
DO
DO
D1
D1
D1
Page 10
37
Exercise 2 - summing a vector
+
+
+
+
+
+
+
+
+
+
DO
DO
DO
D1
+
+
+
+
+
+
38
Some NEON clever features
39
Uses byte indexes to control byte look up in a table
Table is a list of 1,2,3 or 4 adjacent registers
VTBL : out of range indexes generate 0 result
VTBX : out of range indexes leave destination unchanged
Data Movement: Table Lookup
{D1,D2} b
D3 0 8 26 13 8 4 11
D0
0 a c g h j k m o p d e i l n
d a i 0 n i e l
f
3
VTBL.8 D0, {D1, D2}, D3
40
All treat memory as an array of structures (AoS)
SIMD registers are treated as structure of arrays (SoA)
Enables interleaving/de-interleaving for efficient SIMD processing
Transfer up to 256-bits in a single instruction
Three forms of Element Load Store instructions are provided
Forms distinguished by type of register list provided
Multiple Structure Access e.g. {D0, D1}
Single Structure Access e.g. {D0[2], D1[2]}
Single Structure Load to all lanes e.g. {D0[], D1[]}
Element Load Store Instructions
x0 y0 z0 x1 y1 z1 x2 y2 z2 x3
3-element structure element
Page 11
41
VLD2, VST2 provide access to multiple 2-element structures
List can contain 2 or 4 registers
Transfer multiple consecutive 8, 16, or 32-bit 2-element structures
Multiple 2-Element Structure Access
VLD2.16 {D2,D3}, [R1]
y0 y1 y2 y3
x0 x1 x2 x3 D2
D3
x3
x2
x1
x0
y0
y1
y2
y3
[R1]
+2
+4
+6
+8
:
+10
+12
+14
VLD2.16 {D0,D1,D2,D3}, [R3]!
x3
x2
x1
x0
y0
y1
y2
[R3]
+2
+4
+6
+8
:
+10
+12
x7
y7
+28
+30
!
:
x4 x5 x6 x7
x0 x1 x2 x3 D0
D1
y4 y5 y6 y7
y0 y1 y2 y3 D2
D3
42
VLD3/4, VST3/4 provide access to 3 or 4-element structures
Lists contain 3/4 registers; optional space for building 128-bit vectors
Transfer multiple consecutive 8, 16, or 32-bit 3/4-element structures
Multiple 3/4-Element Structure Access
x2
x1
x0
y0
z0
y1
y3
z1
z3
D3
D4
[R1]
+2
+4
+6
+8
:
+10
+12
+20
VST3.16 {D3,D4,D5}, [R1]
y0 y1 y2 y3
x0 x1 x2 x3
z0 z1 z2 z3 D5 +22
:
x2
x1
x0
y0
z0
y1
y3
z1
z3
D3
D4
[R1]
+2
+4
+6
+8
:
+10
+12
+20
VLD3.16 {D0,D2,D4}, [R1]!
y0 y1 y2 y3
+22
:
D0
D1
x0 x1 x2 x3
z0 z1 z2 z3
D2
!
43
Logical
NEON supports bitwise logical operations
VAND, VBIC, VEORR, VORN, VORR
Bitwise logical operation
Independent of data type
64-bit and 128-bit registers
VBIT, VBIF, VBSL
Bitwise multiplex operations
Insert True, Insert False, Select
3 versions overwrite different registers
64-bit and 128-bit registers
Used with masks to provide selection
VAND D0, D0, D1
VORR Q0, Q1, Q15
VEOR Q7, Q1, Q15
VORN D15, D14, D1
VBIC D0, D30, D2
0 1 0 1 1 0
D0
D1
D2
D1
VBIT D1, D0, D2
44
Alignment hints on NEON load/store
NEON data load/store: VLDn/VSTn
Full unaligned support for NEON data access
Instruction contains ‘alignment hint’ which permits implementations to be faster when
address is aligned and hint is specified.
Usage: base address specified as [<Rn>:<align>]
Note it is a programming error to specify hint, but use incorrectly aligned address
Alignment hint can be :64, :128, :256 (bits) depending on number of D-regs
ARM ARM uses “@” but this is not recommended in source code
GNU gas currently only accepts “[Rn,:128]” syntax – note extra “,”
Applies to both Cortex-A8 and Cortex-A9 (see TRM for detailed instruction timing)
VLD1.8 {D0}, [R1:64]
VLD1.8 {D0,D1}, [R4:128]!
VLD1.8 {D0,D1,D2,D3}, [R7:256]!, R2
Page 12
45
Dual issue [Cortex-A8 only]
NEON can dual issue NEON in the following circumstances
No register operand/result dependencies
NEON data processing (ALU) instruction
NEON load/store or NEON byte permute instruction or MRC/MCR
VLDR/VSTR, VLDn/VSTn, VMOV, VTRN, VSWP, VZIP, VUZIP, VEXT, VTBL,
VTBX
Also can dual-issue NEON with ARM instructions
VLD1.8 {D0}, [R1]!
VMLAL.S8 Q2, D3, D2
VLD1.8 {D0}, [R1]!
SUBS r12, r12, #1
VEXT.8 D0, D1, D2, #1
SUBS r12, r12, #1
46
Thank you!
ARM Architecture has evolved with a balance of pure RISC
and customer driven input
NEON offers a clean architecture targeted at compiler code
generation, offering
Unaligned access
Structure load/store operations
Dual D-register/Q-register view to optimize register bank
Balance of performance vs. gatecount
Cortex-A9 and ARM’s hard macros provide a scalable low-
power solution that is suitable for a wide range of high-
performance consumer applications