1 Power Optimization Through Many-Core Multiprocessing Delivering High Performance in a Low Power World ChipEx2012 Haydn Povey Marketing Director – Implementation & Security ARM Processor Division May 2, 2012
Dec 13, 2014
1
Power Optimization Through Many-Core Multiprocessing
Delivering High Performance in a Low Power World
ChipEx2012Haydn Povey
Marketing Director – Implementation & SecurityARM Processor Division
May 2, 2012
2
Billions of Connected Devices
Performance expectations continue to increase exponentially but power
efficiency and scalability are becoming formidable challenges
ABI Research, IDC, Gartner and ARM forecasts
Form FactorTAM(m)
2015Mobile Phones 1,750
Media players 300Mobile Computers 750Desktop PCs 150Digital TV/STB 500Automotive Infotainment 100Other* 450
Total 4 billion
*Includes PND, photo-frames, etc
May 2, 2012
3
Functionality
Up to 1980s
Mainframes/mini
Functionality
$
1990s
The PC
Functionality
Power × $
2000s
Notebooks
Functionality
Energy×$
2010s
MobileComputing
Historic Technology Drivers
May 2, 2012
5
Limitations with Multiprocessing
Cost of offering the peak single thread performance on each CPU quickly exceedschassis thermal limits
System and softwarebottlenecks limit overall scalability
Single die integrationoffered some roadmap
May 2, 2012
6
Evolution to Many-Core Base theorem
Simpler and smaller processor designs require exponentially less energy to accomplish same amount of compute as a more complex and larger processor design.
“Approximate rule of thumb” To increase performance 50% you double the power and area cost of
the processor design
Quickly reaches point of diminishing returns
May 2, 2012
7
Challenge of Many-Core Many-core definition
Use ‘lots’ of smaller, more efficient processors to achieve a higher aggregate performance than can be reached through multiprocessing
Smaller processors are not capable of executing the same single thread as a higher performance processor in the same time – so can’t execute existing applications effectively
Many threads can not easily be decomposed into simpler smaller tasks so as to benefit from multiprocessing on the smaller processor
Software development challenge
May 2, 2012
8
Software Data Decomposition
Split large quantity of DATA into smaller chunks that can
be operated in parallel
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
Each data item is independent
TASK
TASK
TASK
TASK
TASK
May 2, 2012
9
Software Task Decomposition
Functionally independent tasks can be executed
concurrently
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
Each task item is functionally independent
TASK TASK TASK TASK TASK TASK TASK TASKTASK
TASK TASKTASK
TASK TASKTASK
TASK TASKTASK
TASK TASKTASK
May 2, 2012
10
Example:Real Time Video Encoding
Functional Block Partitioning Functional blocks are serially dependent
But temporary independent
Distribute different functional blocks across available processors Split into defined functional threads Uses passing of data blocks between threads
to allocate work
Requires code changes and fine tuning
AnalogueVideo Sampling
RemoveInter-FrameRedundancy
RemoveIntra-FrameRedundancy
Quantise Samples
Run-LengthCompress
BufferStore
MotionCompensation
(Simplified MPEG encoding functional block diagram)
CPU0 CPU1
CPU2
CPU3
TIME
May 2, 2012
11
Strategy Focus: The Thermal Wall SOC sustained power is limited in mobile devices by thermals;
1.5W to 2W with low-cost POP and stacked memories
3W without stacked memories
May 2, 2012
Pow
er
Time
Un-managed Max Power (@Tjmax)
Burst for responsiveness(e.g. Browsing)
Sustained performance (e.g. HD Video Record, Gaming)
Power Optimised Low End (e.g. e-Mail, Voice, MP3)
T >= Tjmax, Tskin
Managed Sustained Power
Tj >= Tmax Tj < Tmax
“Opportunistic Residency”
Responsiveness is a must
Complex active management is needed
12
Applying Nominal Use Case Typical Day for Smartphone User
90 min voice calling
60 min email / social networking
30 min reading web
50 min angry birds / other gaming
90 min jogging while listening to music and logging GPS co-ordinates
10 min video recording
7 hrs sleep with music alarm clock
OS typically executing ~28 active processes
Apps synching in background
May 2, 2012
13
Use Case Measurements
May 2, 2012
14
Use Case Conclusion
May 2, 2012
Profiled CPU States
Minutes % of CPU Active
Deep Sleep 1186 n/a
200MHz 154 60%
500 MHz 69 27%
800 MHz 18 7%
1000 MHz 4 2%
1200 MHz 10 4%
If the phone was ARM big.LITTLE™ enabled...
Active CPU time
12% big
88% LITTLE
15
Big.LITTLE Processing
May 2, 2012
Multiprocessing Capable Many core Benefits
16
“big” Processor – Cortex-A15 ARM Cortex™-A15 Processor
3.5+ DMIPS/MHz
1-4 core MPCore™ configurable
Advanced Capabilities Full ARMv7A architecture
Thumb®-2, TrustZone®, VFP, NEON™
Virtualization, large address extensions
AMBA® 4 ACE™ coherency
High Performance Targeting 1.5GHz mobile implementation on 28nm
Hard Macro Quad-core Implementation @ 2GHz on 28HPM process
May 2, 2012
17
“LITTLE” Processor – Cortex-A7 ARM Cortex-A7 Processor
“LITTLE” to Cortex-A15 “big”
1-4 core MPCore configurable
Same Architectural Capabilities Full ARMv7A architecture
Thumb-2, TrustZone, VFP, NEON
Virtualization, large address extensions
AMBA 4 ACE Coherency
ISA identical to Cortex-A15 processor
High Performance Up to 1.2GHz for mobile implementation on 28nm
May 2, 2012
18
Comparison of big.LITTLE Pipelines
May 2, 2012
19
Performance Comparison
May 2, 2012
20
Power Efficiency Comparison
May 2, 2012
21
Software Use Models Big.LITTLE Task Migration – One CPU active
Migrate between Cortex-A15 and Cortex-A7 depending on performance requirements
Big.LITTLE MP – Both CPUs can be active Allocate threads that need high-performance to cortex-A15
Allocate threads that don’t require high performance to Cortex-A7 for best energy efficiency
AMBA 4 hardware coherency between Cortex-A-15 and Cortex-A7
May 2, 2012
22
Task Migration Mechanics
May 2, 2012
23
CCI-400 Cache Coherent Interconnect
CCI-400 2+3 (x3)
2 full AMBA 4 ACE slave interfaces
+3 ACE-Lite I/O coherent slave interfaces
x3 master interfaces
CCI interfaces:
AMBA 4 ACE and ACE-Lite manage all coherency, sharability and barriers
AMBA 4 compliant, 128-bit single layer at up to ½ Cortex-A15 frequency
May 2, 2012
CoreLink™ CCI-400 Cache Coherent Interconnect128 bit @ up to 0.5 Cortex-A15 frequency
Quad Cortex-A7
CoherentI/O
device
128b
Mali-T604Graphics
ADB-400 ADB-400
128b 128b
MMU-400 MMU-400
128b 128b
ACE
ACE ACE-Lite + DVM
ACE-LiteACE-LiteACE-Lite
ACE-Lite
NIC-400
Other Slaves
Other Slaves
128b
NIC-400
LCDDMA
Quad Cortex-
A15
128b
ACE
ACE
AXI4
AXI4
Configurable: AXI4/AXI3/AHB/APB
Configurable: AXI4/AXI3/AHB
GIC-400
ACE-Lite + DVM ACE-Lite + DVM
128b
MMU-400ADB-400 ADB-400
DMC-400
DDR3/2LPDDR2/3
ACE-LiteACE-Lite
PHYPHY
DDR3/2LPDDR2/3
24
Summary Multiprocessing enables the scaling of today’s application to
grow while maintaining single thread performance Addresses nicely the multi-tasking of stacked usage scenarios
Many-core brings the energy advantages of simpler and smaller processor but with the challenge of software complexity and lack of backwards compatibility with respect to single thread performance
The big.LITTLE processing as delivered by the ARM Cortex-A15 and Cortex-A7 offers both the performance and compatibility advantages of Multiprocessing along with the power efficiency and scalability advantages of many-core processing
May 2, 2012