1 ARMv7-A Architecture Overview David Brash Architecture Program Manager, ARM Ltd.
Nov 18, 2014
1
ARMv7-A Architecture
Overview
David Brash
Architecture Program Manager, ARM Ltd.
2
ARM Architecture roadmap
Announced Aug-2010:
• Large PhysAddr Extn
• PA => 40-bits
•Virtualization Extn
(v8 builds on these)
3
ARMv7-A registers
R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13 (SP)
R14 (LR)
R0 R0 R0 R0 R0R0
SP_svc
LR_svc
SP_irq
LR_irq
SP_und
LR_und
SP_fiq
LR_fiq
SP_abt
LR_abt
R8_fiq
R9_fiq
R10_fiq
R11_fiq
R12_fiq
SP_hyp
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R1
R2
R3
R4
R5
R6
R7
LR_(BL)
CPSR bits• Flags: NZCVQ
• IT[7:0] status bits
• GE[3:0] Adv SIMD flags
• J (Jazelle)
• T (Thumb)
• E (endian)
• I, A, F masks
• M[4:0] (mode)
SPSR_svc SPSR_abt SPSR_und SPSR_irq SPSR_hyp
ELR_hyp
SPSR_fiq
CPSR
4
Classic ARM MMU
32-bit physical address space
2-level translation tables
Pointed to by TTBR0 (user mappings) and TTBR1 (kernel mappings
but with restrictions to the user/kernel memory split)
32-bit page table entries
1st level contains 4096 entries (4 pages for PGD)
1MB section per entry or
Pointer to a 2nd level table
Implementation-defined 16MB supersections
2nd level contains 256 entries pointing to 4KB page each
1KB per 2nd level page table
ARMv6/v7 introduced TEX remapping
Memory type becomes a 3-bit index
5
Classic ARM MMU (cont’d)
Other features
XN (eXecute Never) bit
Different memory types: Normal (cacheable and non-cacheable),
Device, Strongly Ordered
Shareability attributes for SMP systems
ASID-tagged TLB (ARMv6 onwards)
Avoids TLB flushing at context switch
8-bit ASID value assigned to an mm_struct
Dynamically allocated (there can be more than 256 processes)
6
Classic ARM MMU Limitations
Only 32-bit physical address space
Growing market requiring more than 4GB physical address space
(both RAM and peripherals)
Supersections can be used to allow up to 40-bit addresses using
16MB sections (implementation-defined feature)
Prior to ARMv6, not a direct link between access permissions
and Linux PTE bits
Simplified permission model introduced with ARMv6 but not used by
Linux
2nd level page table does not fill a full 4K page
ARM Linux workarounds
Separate array for the Linux PTE bits
1st level entry consists of two 32-bit locations pointing to 2KB 2nd level
page table entries
7
The ARMv7-A Virtualization Extensions
Popek and Goldberg summarized the concept in 1974:
Equivalence/Fidelity
A program running under the hypervisor should exhibit a behaviour
essentially identical to that demonstrated when running on an
equivalent machine directly.
Resource control / Safety
The hypervisor should be in complete control of the virtualized
resources.
Efficiency/Performance
A statistically dominant fraction of machine instructions must be
executed without hypervisor intervention.
"Formal Requirements for Virtualizable Third Generation Architectures". Communications of the ACM 17
8
ARM hypervisor support philosophy
Virtual machine (VM) scheduling and resource sharing
New Hyp mode for Hypervisor execution
Minimise Hypervisor intervention for “routine” GuestOS tasks
Guest OS page table management
Interrupt control
Guest OS Device Drivers
Syndrome support for trapping key instructions
GuestOS load/store emulation
Privileged control instructions
System instructions (MRS, MSR) to read/write key registers
Virtualized ID register management
9
ARM virtualization extension - modes
A new privilege layer – hypervisor mode
Guest OS kernel given same privilege structure as for a non-
virtualized environment
Can run the same instructions
Hypervisor mode has higher privilege
VMM can control a wide range of OS accesses to hardware
User Mode
(Non-privileged)
Supervisor Mode
(Privileged)
Hyp Mode
(More Privileged)
Guest Operating System1
App2App1
Guest Operating System2
App2App1
Virtual Machine Monitor (VMM) or
Hypervisor
10
Hypervisor mode and security extensions
Hypervisor mode applies to normal world
Secure world supports a single virtual machine
Monitor mode controls transition between worlds
Guest Operating System1
App2App1
Guest Operating System2
App2App1
Virtual Machine Monitor (VMM) or
Hypervisor
TrustZone Monitor
Secure World OS
Trusted App2Trusted App1
11
ARMv7-A Virtualization - key features 1
Trap and control support (HYP mode)
Rich set of trap options (TLB/cache ops, ID groups, instructions)
Syndrome register support
EC: exception class (instr type, I/D abort into/within HYP)
IL: instruction length (0 == 16-bit; 1 == 32-bit)
ISS: instruction specific syndrome (instr fields, reason code, ...)
3
1
2
6
2
5
2
4
0
EC I
L
ISS
12
ARMv7-A Virtualization - key features 2
Dedicated Exception Link Register (ELR)
Stores preferred return address on exception entry
New instruction – ERET – for exception return from HYP mode
Other modes overload exception model and procedure call LRs
R14 used by exception entry BL and BLX instructions
Address translation
13
Two layers of address translation
Stage 1 translation
owned by guest OS
Virtual address map Intermediate physical
address map
Real System Physical
address map
Stage 2 translation
owned by the VMM
Hardware has 2-stage
memory translation
Tables from Guest OS
translate VA to IPA
Second set of tables from
VMM translate IPA to PA
Allows aborts to be routed to
appropriate software layer
14
LPAE – Stage 1 (VA => {I}PA)
Managed by the OS
64-bit descriptors, 512 entries per table
4KB table size == 4KB page size
1 or 2 Translation Table Base Registers
3 levels of table supported
up to 9 address bits per level
2 bits at Level 1
9 bits at Levels 2 and 3
Page Table Entry:
32-b
it V
irtu
al
Addre
ss
TTBR1 space
TTBR0space
Not mapped
(Fault)
232- TTCR.T1SZ
232-TTCR.T0SZ
0
232
T1SZ ≤ T0SZ
If T1SZ == T0SZ == 0 then TTBR0 always used
31:30 VA Bits <29:21> VA Bits <20:12> VA Bits <11:0>
L1 Level 2 table index Level 3 table (page) index Page offset address
6
3
5
2
4
0
3
0
1
2
2 1 0
Upper attributes SBZ Address out SBZ Lower attributes
Valid
Block/TableBlock/page
• Memory type
• Cache policy
• Shareability, AF, nG, NS flags
• Access permissions
•Table override bits (NS, XN, PXN, AP[1:0])
• Block/Page XN, PXN bits
• Block/Page 16x entry contiguous hint
• IGNORED bits
15
LPAE – Stage 2 (IPA => PA)
Managed by the VMM / hypervisor
Same table walk scheme as stage 1, now up to 40-bit input address:
2x contiguous 4KB tables allowed at L1; support 240 address space
Up to 1-16 contiguous tables allowed at L2; support 230 - 234 address space
1x Translation Table Base register (VTBR)
Page table entry:
6
3
5
2
4
0
3
0
1
2
2 1 0
Upper attributes SBZ Address out SBZ Lower attributes
Valid
Block/Table
Block / page only
• XN bit
• 16x entry contiguous hint
• IGNORED bits
Block/page
• Memory type
• Cache policy
• Shareability, AF bit
• R/W Access permissions
3
9
3
4
3
0
2
1
1
2
0
IPA Bits <39:30> IPA Bits <29:21> IPA Bits <20:12> IPA Bits <11:0>
IPA Bits <11:0> IPA Bits <33;21> IPA Bits <20:12> IPA Bits <11:0>
16
Linux Support for
ARM LPAE
Catalin Marinas
ELC Europe 2011
17
ARM LPAE Features
40-bit physical addresses (1TB)
40-bit intermediate physical addresses (guest PA space)
3-level translation tables
Pointed to by TTBR0 (user mappings) and TTBR1 (kernel mappings)
Not as restrictive on user/kernel memory split (can use 3:1)
With 1GB kernel mapping, the 1st level is skipped
64-bit entries in each level
1st level contains 4 entries (stage 1 translation)
1GB section or
Pointer to 2nd level table
2nd level contains 512 entries (4KB in total)
2MB section or
Pointer to 3rd level
18
ARM LPAE Features (cont’d)
3rd level contains 512 entries (4KB)
Each addressing a 4KB range
Possibility to set a contiguity flag for 16 consecutive pages
LDRD/STRD (64-bit load/store) instructions are atomic on
ARM processors supporting LPAE
Only the simplified page permission model is supported
No kernel RW and user RO combination
Domains are no longer present (they have already been
removed in ARMv7 Linux)
Additional spare bits to be used by the OS
Dedicated bits for user, read-only and access flag (young)
settings
19
ARM LPAE Features (cont’d)
ASID is part of the TTBR0 register
Simpler context switching code (no need to deal with speculative TLB
fetching with the wrong ASID)
The Context ID register can be used solely for debug/trace
Additional permission control
PXN – Privileged eXecute Never
SCTLR.WXN, SCTLR.UWXN – prevent execution from writable
locations (the latter only for user accesses)
APTable – restrict permissions in subsequent page table levels
XNTable, PXNTable – override XN and PXN bits in subsequent page
table levels
New registers for the memory region attributes
MAIR0, MAIR1 – 32-bit Memory Attribute Indirection Registers
8 memory types can be configured at a time
20
ARM LPAE Features
1GB block
Table ptr
1st level
table
VA[31:30]TTBRx
4KB
2MB block
Table ptr
2nd level
table
VA[29:21]
4KB
PTE
PTE
3rd level
table
VA[20:12]
4KB
64-bit entry 64-bit entry64-bit entry
PGD PMD PTE
21
Linux and ARM LPAE
Linux + ARM LPAE has the same memory layout as the
classic MMU implementation
Described in Documentation/arm/memory.txt
0..TASK_SIZE – user space
PAGE_OFFSET-16M..PAGE_OFFSET-2M – module space
PAGE_OFFSET-2M..PAGE_OFFSET – highmem mappings
Highmem is already supported by the classic MMU
Memory beyond 4G is only accessible via highmem
Page mapping functions use pfn (32-bit variable, PAGE_SHIFT == 12,
maximum 44-bit physical addresses)
Original ARM kernel port assumed 32-bit physical addresses
LPAE redefines phys_addr_t, dma_addr_t as u64 (generic typedefs)
New ATAG_MEM64 defined for specifying the 64-bit memory layout
22
Linux and ARM LPAE (cont’d)
Hard-coded assumptions about 2 levels of page tables
PGDIR_(SHIFT|SIZE|MASK) references converted to PMD_*
swapper_pg_dir extended to cover both 1st and 2nd levels of
page tables
1 page for PGD (only 4 entries used for stage 1 translations)
4 pages for PMD
init_mm.pgd points to swapper_pg_dir
TTBR0 used for user mappings and always points to PGD
TTBR1 used for kernel mapping:
3:1 split – TTBR1 points to 4th page of PMD (2 levels only, 1GB)
Classic MMU does not allow the use of TTBR1 for the 3:1 split
2:2 split – TTBR1 points to 3rd PGD entry
1:3 split – TTBR1 points to 1st PGD entry
23
Linux and ARM LPAE (cont’d)
The page table definitions have been separated into
pgtable*-2level.h and pgtable*-3level.h files
Few PTE bits shared between classic and LPAE definitions
Negated definitions of L_PTE_EXEC and L_PTE_WRITE to match the
corresponding hardware bits
Memory types are the same and they represent an index in the TEX
remapping registers (PRRR/NMRR or MAIR0/MAIR1)
The proc-v7.S file has been duplicated into proc-v7lpae.S
Different register setup for TTBRx and TEX remapping (MAIRx)
Simpler cpu_v7_set_pte_ext (1:1 mapping between hardware and
software PTE bits)
Simpler cpu_v7_switch_mm (ASID switched with TTBR0)
Current ARM code converted to pgtable-nopud.h
Not using „nopmd‟ with classic MMU
24
Linux and ARM LPAE (cont’d)
Lowmem is mapped using 2MB sections in the 2nd level table
PGD and PMD only allocated from lowmem
PTE tables can be allocated from highmem
Exception handling
Different IFSR/DFSR registers structure and exception numbering –
arch/arm/mm/fault.c modified accordingly
Error reporting includes PMD information as well
PGD allocation/freeing
Kernel PGD entries copied to the new PGD during pgd_alloc()
Modules and pkmap entries added to the PMD during fault handling
Identity mapping (PA == VA)
Required for enabling or disabling the MMU – secondary CPU
booting, CPU hotplug, power management, kexec
25
Linux and ARM LPAE (cont’d)
Uses pgd_alloc() and pgd_free()
When PHYS_OFFSET > PAGE_OFFSET, kernel PGD entries may be
overridden
swapper_pg_dir entries marked with an additional bit –L_PGD_SWAPPER
pgd_free() skips such entries during clean-up
26
Current Status and Future
Initial development done on software models
Tested on real hardware (FPGA and silicon)
Parts of the LPAE patch set already in mainline
Mainly preparatory patches, not core functionality
Aiming for full support in Linux 3.3
Hardware supporting LPAE
ARM Cortex-A15, Cortex-A7 processors
TI OMAP5 (dual-core ARM Cortex-A15)
Other developments
KVM support for Cortex-A15 – implemented by Christoffer Dall at
Columbia University
Uses the Virtualisation extensions together with the LPAE stage 2
translations
27
Reference
ARM Architecture Reference Manual rev C
Currently beta, not publicly available yet
Specifications publicly available on ARM Infocenter
http://infocenter.arm.com/
ARM architecture -> Reference Manuals -> ARMv7-AR LPA
Virtualisation Extensions
Linux patches – ARM architecture development tree
Hosts the latest ARM architecture developments before patches are
merged into the mainline kernel
git://github.com/cmarinas/linux.git
When the kernel.org accounts are back
git://git.kernel.org/pub/scm/linux/cmarinas/linux-arm-arch.git
28
ARM Architecture
System Features
29
Interrupt Controller
ARM standardising on the “Generic Interrupt Controller” [GIC]
Supports the ARMv7 Security Extension
Interrupt groups support (Nonsecure) IRQ and (Secure) FIQ split
Distributor and cpu interface functionality
GICv2 adds support for a virtualized cpu interface
VMM manages physical interface & queues entries for each VM.
GuestOS (VM) configures, acknowledges, processes and completes
interrupts directly. Traps to VMM where necessary.
Hypervisor can virtualize FIQs from the physical (Nonsecure, IRQ)
interface.
30
Interrupt Handling – GICv2
Secure handler
(monitor mode)
or
OS FIQ handler
OS IRQ handler
GuestOS IRQ handler
GuestOS FIQ handler
VirtualCpu InterfaceN
• cpu-specific control + state
• Interrupt ACK / END (retire)
Virtual Grp0: FIQ
Virtual Grp1: IRQ
Hypervisor – virtualized distributor
• Grp1 physical => Grp1/Grp0 virtual
Virtual interrupt management
• interrupt queues (list registers)
Virtualization support
* ARM Architecture Security Extns
NEW
Distributor - control and configuration
• Assignment (≤8 cpu interfaces)
• Interrupt grouping (Grp0, Grp1)
Grp0: (Secure*, FIQ)
Grp1: IRQ always
Cpu Interface0
• cpu-specific control + state
• Interrupt ACK / END (retire)
Grp0: (Secure*, FIQ)
Grp1: IRQ always
GICv1
Cpu InterfaceN
• cpu-specific control + state
• Interrupt ACK / END (retire)
• software
• private (per cpu) peripheral
• shared peripheral
Interrupt sources:
31
Generic Timer
Shared “always on” counter
Fast read access for reliable
distribution of time.
≥ ComparedValue or ≤ 0 timer
event configuration support
Maskable interrupts
Offset capability for virtual time
Event stream support
Hypervisor trap support
SoC
Counter
Power
Controller
CPU Cluster1
CPU0
Timer0
CPU1
Timer1
Counter Interface
Memory Interconnect and Memory Controller
Shared Cache
$ $
Timer Distribution
Bus
Peripheral Bus
Always Powered
Power DomainGIC
CPU Cluster0
CPU0
Timer0
CPU1
Timer1
Counter Interface
Shared Cache
$ $
GIC
Syste
m e
ve
nts
Implementation example:
32
System MMU
Local Coherence Bus (no snooping on bus)
“Event pulse” enabling next cycle notifications
System MMU architecture translation options from pre-set tables No Translation
VA -> IPA (Stage 1) or IPA->PA (Stage 2) only
VA->PA (Stage1 and Stage2)
Can share page tables with ARM cores – relate contexts to VMIDs/ASIDs
L1
I-cache
L1
D-cache
Core0
L1
I-cache
L1
D-cache
CoreN
L2 cache
System interconnect
DMA
System MMU System Memory
33
Quad
Cortex-A15
Quad
Cortex-A15
AMBA 4 Cache Coherent Interconnect
CCI-400
I/O
device
MMU-400
DMC-400
Dynamic Memory Controller
AXI Network
Interconnect
Slaves Slaves
AXI Network
Interconnect
LCDVideo
DDR3/
LPDDR2
DDR3/
LPDDR2
PHY
GIC-400Mali
Graphics
PHY
MMU-400 MMU-400
Completing the SoCIPA
PA
Intermediate Physical Address
(IPA)
The address the OS believes
it‟s accessing
Physical Address (PA)
The actual address the VMM
assigns
34
To summarize: ARMv7-A additions
A natural progression of well established features applied to
an architecture long associated with low power.
„Classic‟ uses and solutions will apply to ARM markets:
Low cost/low power server opportunities
Application OS + RTOS/microkernel smart mobile devices
Feature split by „Manufacturers‟ versus User‟ VM environment
Automotive, home, ...
Opportunities for new use cases?
Today‟s entrepreneurs/ideas
=> tomorrow‟s major businesses
35
Thank you