Lect.10.arm soc.4 neon

1

ARM Architecture & NEON

Ian Rickards

Stanford University 28 Apr 2010

2

A brief history of ARM

First ARM prototype came alive on 26-Apr-1985, 3um technology, 24800 transistors

50mm2, consumed 120mW of power Acorn’s commercial ARM2 processor: 8-MHz, 26-bit addressing, 3-stage pipeline

ARM founded in October 1990, separate company (Apple had 43% stake) ARM610 for Newton in 1992, ARM7TDMI for Nokia in 1994

3

ARM 25 years later: Cortex-A9 MP

1-4 way MP with optimized MESI

16KB, 32KB, 64KB I & D caches

128KB-8MB L2

Multi-issue, Speculation, Renaming, OOO

High performance FPU option

NEON SIMD option

Thumb-2

AXI bus

Gatecount: 500K (32KB I/D L1’s), 600K (core), 500K (NEON)

40G “Low Power” macro: ~5mm2, 800MHz, 0.5W

40G “High Performance” macro: ~7mm2 2GHz (typ), 2W

4

Cortex-A9 Processor Microarchitecture

Introduces out-of-order instruction issue and completion

Register renaming to enable execution speculation

Non-blocking memory system with load-store forwarding

Fast loop mode in instruction pre-fetch to lower power consumption

5

Cortex-A9 MPCore Multicore Structure

Cache-2-Cache

Transfers

Snoop

Filtering

Generalized

Interrupt Control

and Distribution

Snoop Control Unit (SCU)

Timers

Advanced Bus Interface Unit

Accelerator

Coherence

Port

FPU/NEON

Cortex-A9 CPU

Instruction

Cache

Data

Cache

TRACE FPU/NEON

Cortex-A9 CPU

Instruction

Cache

Data

Cache

TRACE FPU/NEON

Cortex-A9 CPU

Instruction

Cache

Data

Cache

TRACE FPU/NEON

Cortex-A9 CPU

Instruction

Cache

Data

Cache

TRACE

L2 Cache Controller (PL310) 128K-8MB

Primary AMBA 3 64bit Interface Optional 2nd I/F with Address Filtering

Configurable Between 1 and 4 CPUs with optional NEON and/or Floating-point Unit

Design flexibility over memory throughput and latency

Secure and Virtualization aware interrupt and IPI communications

Coherent access to processor caches from accelerators and DMA

Hardware Coherence for Cache, MMU and TLB maintenance operations

Flexible configuration and power-aware interrupt controller

6

Hard Macro Configuration and Floorplan

Osprey configuration includes level 2 cache controller and Cortex A9 integration level

Top level includes Coresight PTM, CTI and CTM

Implementation using r1p1 version of Cortex A9

Dual core

32k I$ and D$

NEON present on both cores

PTM interface present

128 interrupts

ACP present

Two AXI master ports

Level 2 cache memories external (interface exposed) Elba top level floorpan

falcon_cpu floorplan

7

Why is ARM looking at “G” processes?

“G” can achieve around double the MHz than “LP”

Active power is lower on “G” than “LP”

Example, Push 40LP to 800MHz, to compare with 800MHz MID macro

The estimated LP numbers correlate to an accelerated implementation of an A8

G is close in terms of power if lowered to same performance as on LP.

G can scale much higher in terms of performance than LP.

Key requirement is “run and power off” quickly

8

2-3x faster

clock

Understanding power

Fundamental power parameters

Average power => battery life

Thermal Power sustained power @ max performance

Traditional LP process

40G process

Osprey

GUI updates music

web page render

power off power off power

off

Power

Power

Power

9

A9_PL310

A9_PL310_noram

“Osprey macro”

Power Domains

HiP and MID macros have same power

domains

Both use distributed coarse grain power

switches

Power plan for CPUs is symmetric

A9 core and its L1 is power gated in

lockstep

Note that all power domains are only ON

or OFF, there is no hardware retention

mode

Software routine enables retention to RAM

Data Engine 0

Data Engine 1

A9 CORE 0 + 32K I/D

A9 CORE 1 + 32K I/D

SCU + PL310_noram

PT

M/D

ebug

L2 Cache RAM

512/1024KB

10

Single-thread Coremarks/MHz

Single-thread performance is key for GUI based applications

0.00 0.50 1.00 1.50 2.00 2.50 3.00

74K

1004K

Cortex-A8

Cortex-A9

Atom

2.30

2.33

2.72

2.95

1.85

11

Floating Point Performance

Intel

12

Higher Flash Actionscript from A9

13

ARM Architecture evolution

Some not-entirely-RISC features

LDM / STM

Full predicated execution (ADDEQ r0, r1, r2)

Carefully designed with customer/partner input considering gatecount

Thumb

16-bit instruction set (mostly using r0-r7) selected for compiler requirements

Design goals: performance from 16-bit wide ROM, codesize

Thumb-2 in Cortex-A extends original Thumb (allows 32-bit/16-bit mix)

Beneficial today – better performance from small caches

Jazelle

CPU mode allows direct execution of Java bytecodes

~60% of Java bytecodes directly executed by datapath

Top of Java stack stored in registers

Widely used in Nokia & DoCoMo handsets

…

14

Dummies’ guide to Si implementation

Basic Fab tech

65nm, 40nm, 32nm, 28nm, etc.

G vs. LP technology

40G is 0.9V process, 40LP is 1.1V process

Much lower leakage with LP, but half the performance

Intermediate “LPG” from TSMC too! Island of G within LP

Vt’s – each Vt requires additional mask step

HVt – lower leakage, but slower

RVt – regular Vt

LVt – faster, but high leakage esp. at high temperature

Cell library track size

9-track, 12-track, 15-track (bigger => more powerful)

Backed off implementation vs. pushed implementation

High-K metal Gate

Clock gating

Well biasing…

15

VFPv3

ARM Architecture Evolution

Jazelle®

VFPv2

SIMD

Thumb®-2

NEON™ Adv SIMD

TrustZone™

Thumb-EE

Thumb-2 Only

V5 V6 V7 A&R V7 M

Improved Media and

DSP

Low Cost MCU

Key Technology

Additions by Architecture Generation

Execution Environments:

Improved memory use

Key Technology

Additions by Architecture Generation

ARM9

ARM10

ARM11

16

NEON is a wide SIMD data processing architecture

Extension of the ARM instruction set

32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide)

NEON Instructions perform “Packed SIMD” processing

Registers are considered as vectors of elements of the same data type

Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single prec. float

Instructions perform the same operation in all lanes

What is NEON?

Dn

Dm

Dd

Lane

Source Registers Source

Registers

Operation

Destination Register

Elements Elements Elements

17

NEON natively supports a set of common data types

Integer and Fixed-Point; 8-bit, 16-bit, 32-bit and 64-bit

32-bit Single-precision Floating-point

Data types are represented using a bit-size and format letter

Data Types

.S8

.U8

.P8

.I8 .8

.S16

.U16

.P16

.I16 .16

.S32

.U32

.F32

.I32 .32

.S64

.U64 .I64 .64

Signed, Unsigned Integers;

Polynomials

32-bit Signed, Unsigned

Integers; Floats

8/16-bit Signed, Unsigned Integers;

Polynomials

64-bit Signed, Unsigned Integers;

18

Registers

NEON provides a 256-byte register file

Distinct from the core registers

Extension to the VFPv2 register file (VFPv3)

Two explicitly aliased views

32 x 64-bit registers (D0-D31)

16 x 128-bit registers (Q0-Q15)

Enables register trade-off

Vector length

Available registers

Also uses the summary flags in the VFP FPSCR

Adds a QC integer saturation summary flag

No per-lane flags, so ‘carry’ handled using wider result (16bit+16bit -> 32-bit)

Q0

Q1

Q15

:

D0

D1

D2

D3

:

D30

D31

19

Vectors and Scalars

Registers hold one or more elements of the same data type

Vn can be used to reference either a 64-bit Dn or 128-bit Qn register

A register, data type combination describes a vector of elements

Some instructions can reference individual scalar elements

Scalar elements are referenced using the array notation Vn[x]

Array ordering is always from the least significant bit

64-bit 128-bit

I64 D0

D7

Dn

63 0

S32 S32 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8

Qn

127 0

F32 F32 F32 F32 Q0

Q7

F32 F32 F32 F32 Q0

Q0[0] Q0[1] Q0[2] Q0[3]

20

NEON in Audio

FFT: 256-point, 16-bit signed complex numbers

FFT is a key component of AAC, Voice/pattern recognition etc.

Hand optimized assembler in both cases

Extreme example: FFT in ffmpeg: 12x faster

C code -> handwitten asm

Scalar -> vector processing

Unpipelined FPU -> pipelined NEON single precision FPU

FFT time No NEON

(v6 SIMD asm)

With NEON

(v7 NEON asm)

Cortex-A8 500MHz

Actual silicon

15.2 us 3.8 us

(x 4.0 performance)

21

How to use NEON

OpenMAX DL library Library of common codec components and signal processing routines

Status: Released on http://www.arm.com/products/esd/openmax_home.html

Vectorizing Compilers Exploits NEON SIMD automatically with existing source code

Status: Released (in RVDS 3.1 Professional and later)

Status: Codesourcery 2007q3 gcc and later

C Instrinsics C function call interface to NEON operations

Supports all data types and operations supported by NEON

Status: Released (in RVDS 3.0+ and Codesourcery 2007q3 gcc)

Assembler For those who really want to optimize at the lowest level

Status: Released (in RVDS 3.0+ & Codesourcery 2007q3 gcc/gas)

22

For NEON instruction reference

Official NEON instruction Set reference is “Advanced SIMD” in

ARM Architecture Reference Manual v7 A & R edition

Available to partners & www.arm.com request system

23

ARM RVDS & gcc vectorising compiler

armcc generates better NEON code

(gcc can use Q-regs with ‘-mvectorize-with-neon-quad’ )

|L1.16|

VLD1.32 {d0,d1},[r0]!

SUBS r3,r3,#1 VLD1.32 {d2,d3},[r1]!

VADD.I32 q0,q0,q1

VST1.32 {d0,d1},[r2]!

BNE |L1.16|

armcc -S --cpu cortex-a8

-O3 -Otime --vectorize test.c

int a[256], b[256], c[256];

foo () { int i;

for (i=0; i<256; i++){

a[i] = b[i] + c[i];

} }

gcc -S -O3 -mcpu=cortex-a8

-mfpu=neon -ftree-vectorize -ftree-vectorizer-verbose=6

test.c

.L2:

add r1, r0, ip

add r3, r0, lr add r2, r0, r4

add r0, r0, #8

cmp r0, #1024

fldd d7, [r3, #0]

fldd d6, [r2, #0] vadd.i32 d7, d7, d6

fstd d7, [r1, #0]

bne .L2

24

Intrinsics

Include intrinsics header file

#include <arm_neon.h>

Use special NEON data types which correspond to D and Q registers, e.g.

int8x8_t D-register containing 8x 8-bit elements

int16x4_t D-register containing 4x 16-bit elements

int32x4_t Q-register containing 4x 32-bit elements

Use special intrinsics versions of NEON instructions

vin1 = vld1q_s32(ptr);

vout = vaddq_s32(vin1, vin2);

vst1q_s32(vout, ptr);

Strongly typed!

Use vreinterpret_s16_s32( ) to change the type

25

NEON in opensource Bluez – official Linux Bluetooth protocol stack

NEON sbc audio encoder

Pixman (part of cairo 2D graphics library)

Compositing/alpha blending

X.Org, Mozilla Firefox, fennec, & Webkit browsers

e.g. fbCompositeSolidMask_nx8x0565neon 8x faster using NEON

ffmpeg – libavcodec

LGPL media player used in many Linux distros

NEON Video: MPEG-2, MPEG-4 ASP, H.264 (AVC), VC-1, VP3, Theora

NEON Audio: AAC, Vorbis, WMA

x264 – Google Summer Of Code 2009

GPL H.264 encoder – e.g. for video conferencing

Android – NEON optimizations

Skia library, S32A_D565_Opaque 5x faster using NEON

Available in Google Skia tree from 03-Aug-2009

Eigen2 linear algebra library

Ubuntu 09.04 – supports NEON

NEON versions of critical shared-libraries

26

Many different levels of parallelism

Multi-issue parallelism

NEON SIMD parallelism

Multi-core parallelism

27

ffmpeg (libavcodec) performance

git.ffmpeg.org

snapshot 21-Sep-09

YouTube HQ video decode

480x270, 30fps

Including AAC audio

Real silicon measurements

OMAP3 Beagleboard

ARM A9TC

NEON ~2x overall

performance

28

Scalability with SMP on Cortex-A9

29

NEON optimization example

30

Skia library S32A_D565_Opaque

Size Reference

C

Google v6

asm

NEON

asm

RVDS

60 100% 128% 24% 64%

64 100% 128% 22% 68%

68 100% 127% 23% 63%

980 100% 73% 23% 58%

986 100% 73% 23% 58%

31

Processing code vmovn.u16 d4, q12

vshr.u16 q11, q12, #5

vshr.u16 q10, q12, #6+5

vmovn.u16 d5, q11

vmovn.u16 d6, q10

vshl.u8 d4, d4, #3

vshl.u8 d5, d5, #2

vshl.u8 d6, d6, #3

vmovl.u8 q14, d31

vmovl.u8 q13, d31

vmovl.u8 q12, d31

vmvn.8 d30, d3

vmlal.u8 q14, d30, d6



vshr.u16 q8, q14, #5

vshr.u16 q9, q13, #6

vaddhn.u16 d6, q14, q8

vshr.u16 q8, q12, #5


vqadd.u8 d6, d6, d0


vqadd.u8 d6, d6, d0

vqadd.u8 d5, d5, d1

vqadd.u8 d4, d4, d2

vshll.u8 q10, d6, #8

vshll.u8 q3, d5, #8

vshll.u8 q2, d4, #8

vsri.u16 q10, q3, #5

vsri.u16 q10, q2, #11

32

Cortex-A8 TRM

33

Quick review of NEON instructions

34

VLD1, VST1 provide standard array access

An array of structures containing a single component is a basic array

List can contain 1, 2, 3 or 4 consecutive registers

Transfer multiple consecutive 8, 16, 32 or 64-bit elements

Multiple 1-Element Structure Access

x4 x5 x6 x7

x0 x1 x2 x3 D3

D4 x7

x5

x4

x3

x2

x1

x0

x6

[R1]

+2

+4

+6

+8

:

+10

+12

+14

VST1.16 {D3,D4}, [R1]

x0 x1 x2 x3 D7 x3

x2

x1

x0 [R4]

+2

+4

+6

:

VLD1.16 {D7}, [R4], R3

+R3

35

Addition: Basic

NEON supports various useful forms of basic addition

Normal Addition - VADD, VSUB

Floating-point

Integer (8-bit to 64-bit elements)

64-bit and 128-bit registers

Long Addition - VADDL, VSUBL

Promotes both inputs before operation

Signed/unsigned (8-bit to 32-bit source elements)

Wide Addition - VADDW, VSUBW

Promotes one input before operation

Signed/unsigned (8-bit 32-bit source elements)

VADD.I16 D0, D1, D2

VSUB.F32 Q7, Q1, Q4

VADD.I8 Q15, Q14, Q15

VSUB.I64 D0, D30, D5

VADDW.U8 Q1, Q7, D8

VSUBW.S16 Q8, Q1, D5

VADDL.U16 Q1, D7, D8

VSUBL.S32 Q8, D1, D5

36

Example – adding all lanes

Input in Q0 (D0 and D1)

u16 input values

Now Q0 contains 4x u32 values

(with 15 headroom bits)

Reducing/folding operation

needs 1 bit of headroom

VPADDL.U32 D0, D0

VPADD.U32 D0, D0, D1

VPADDL.U16 Q0, Q0

DO

DO

DO

DO

DO

DO

D1

D1

D1

37

Exercise 2 - summing a vector

+

+

+

+

+

+

+

+

+

+

DO

DO

DO

D1

+

+

+

+

+

+

38

Some NEON clever features

39

Uses byte indexes to control byte look up in a table

Table is a list of 1,2,3 or 4 adjacent registers

VTBL : out of range indexes generate 0 result

VTBX : out of range indexes leave destination unchanged

Data Movement: Table Lookup

{D1,D2} b

D3 0 8 26 13 8 4 11

D0

0 a c g h j k m o p d e i l n

d a i 0 n i e l

f

3

VTBL.8 D0, {D1, D2}, D3

40

All treat memory as an array of structures (AoS)

SIMD registers are treated as structure of arrays (SoA)

Enables interleaving/de-interleaving for efficient SIMD processing

Transfer up to 256-bits in a single instruction

Three forms of Element Load Store instructions are provided

Forms distinguished by type of register list provided

Multiple Structure Access e.g. {D0, D1}

Single Structure Access e.g. {D0[2], D1[2]}

Single Structure Load to all lanes e.g. {D0[], D1[]}

Element Load Store Instructions

x0 y0 z0 x1 y1 z1 x2 y2 z2 x3

3-element structure element

41

VLD2, VST2 provide access to multiple 2-element structures

List can contain 2 or 4 registers

Transfer multiple consecutive 8, 16, or 32-bit 2-element structures

Multiple 2-Element Structure Access

VLD2.16 {D2,D3}, [R1]

y0 y1 y2 y3

x0 x1 x2 x3 D2

D3

x3

x2

x1

x0

y0

y1

y2

y3

[R1]

+2

+4

+6

+8

:

+10

+12

+14

VLD2.16 {D0,D1,D2,D3}, [R3]!

x3

x2

x1

x0

y0

y1

y2

[R3]

+2

+4

+6

+8

:

+10

+12

x7

y7

+28

+30

!

:

x4 x5 x6 x7

x0 x1 x2 x3 D0

D1

y4 y5 y6 y7

y0 y1 y2 y3 D2

D3

42

VLD3/4, VST3/4 provide access to 3 or 4-element structures

Lists contain 3/4 registers; optional space for building 128-bit vectors

Transfer multiple consecutive 8, 16, or 32-bit 3/4-element structures

Multiple 3/4-Element Structure Access

x2

x1

x0

y0

z0

y1

y3

z1

z3

D3

D4

[R1]

+2

+4

+6

+8

:

+10

+12

+20

VST3.16 {D3,D4,D5}, [R1]

y0 y1 y2 y3

x0 x1 x2 x3

z0 z1 z2 z3 D5 +22

:

x2

x1

x0

y0

z0

y1

y3

z1

z3

D3

D4

[R1]

+2

+4

+6

+8

:

+10

+12

+20

VLD3.16 {D0,D2,D4}, [R1]!

y0 y1 y2 y3

+22

:

D0

D1

x0 x1 x2 x3

z0 z1 z2 z3

D2

!

43

Logical

NEON supports bitwise logical operations

VAND, VBIC, VEORR, VORN, VORR

Bitwise logical operation

Independent of data type


VBIT, VBIF, VBSL

Bitwise multiplex operations

Insert True, Insert False, Select

3 versions overwrite different registers


Used with masks to provide selection

VAND D0, D0, D1

VORR Q0, Q1, Q15

VEOR Q7, Q1, Q15

VORN D15, D14, D1

VBIC D0, D30, D2

0 1 0 1 1 0

D0

D1

D2

D1

VBIT D1, D0, D2

44

Alignment hints on NEON load/store

NEON data load/store: VLDn/VSTn

Full unaligned support for NEON data access

Instruction contains ‘alignment hint’ which permits implementations to be faster when

address is aligned and hint is specified.

Usage: base address specified as [<Rn>:<align>]

Note it is a programming error to specify hint, but use incorrectly aligned address

Alignment hint can be :64, :128, :256 (bits) depending on number of D-regs

ARM ARM uses “@” but this is not recommended in source code

GNU gas currently only accepts “[Rn,:128]” syntax – note extra “,”

Applies to both Cortex-A8 and Cortex-A9 (see TRM for detailed instruction timing)

VLD1.8 {D0}, [R1:64]

VLD1.8 {D0,D1}, [R4:128]!

VLD1.8 {D0,D1,D2,D3}, [R7:256]!, R2

45

Dual issue [Cortex-A8 only]

NEON can dual issue NEON in the following circumstances

No register operand/result dependencies

NEON data processing (ALU) instruction

NEON load/store or NEON byte permute instruction or MRC/MCR

VLDR/VSTR, VLDn/VSTn, VMOV, VTRN, VSWP, VZIP, VUZIP, VEXT, VTBL,

VTBX

Also can dual-issue NEON with ARM instructions

VLD1.8 {D0}, [R1]!

VMLAL.S8 Q2, D3, D2

VLD1.8 {D0}, [R1]!

SUBS r12, r12, #1

VEXT.8 D0, D1, D2, #1

SUBS r12, r12, #1

46

Thank you!

ARM Architecture has evolved with a balance of pure RISC

and customer driven input

NEON offers a clean architecture targeted at compiler code

generation, offering

Unaligned access

Structure load/store operations

Dual D-register/Q-register view to optimize register bank

Balance of performance vs. gatecount

Cortex-A9 and ARM’s hard macros provide a scalable low-

power solution that is suitable for a wide range of high-

performance consumer applications

Lect.10.arm soc.4 neon

Documents

lockstepptmdebug cortexa9

terms of power

lp average power

applicationsdomains

andpower power

g low power macro

terms of performance

lp numbers