1 Many-core back to the future Matt Horsnell ARM Research and Development
1
Many-core back to the future
Matt Horsnell
ARM Research and Development
2
Outline
Introduction
How we got to multi-core
Focus on multi-core
Evolution to many-core?
Predicting the future microprocessor?
Why I believe it’s a great time to be a computer architect
Wrap-up
3
About me
MEng Computer Science, 2003
Final year project: multi-core processors (David May)
PhD Computer Science, 2007.
“A Chip Multi-Cluster architecture with locality aware task-distribution.”
Research Associate, 2008. Object-based Hardware Transaction Memory.
Software Engineer, 2010 Clock-tree synthesis and clock-concurrent optimization.
Research and Development
Architecture and micro-architecture.
4
ARM
The leading silicon IP company
Leading 32 bit RISC architecture
Leading Physical IP
~8Bn ARM chips shipped in 2011
Design and license CPUs, GPUs
and associated IP > 920 CPU licences
> 275 licensees
~1500 designs/year use ARM physical IP
2400 people world-wide >30 locations
LSE (FTSE100) and NASDAQ, >£10Bn
*Image from FastCompany’s 50 most innovative companies 2011 (#12)
5
ARM – Business model
Technology energy-efficient chips
2-3 years to design a processor
2-3 designs introduced per year
Range of design points
Range of end-markets
Partners select a design
License fee to access design
3-4 years integrating into SoC
Royalty fee per chip
20+ years reuse
6
ARM - 1mm3 to 1km3
*NSF and University of Wisconsin-Madison
*University of Michigan
1mm3 1km3
10c $1000
CONFIDENTIAL22
2002: Ubiquitous Environments
A CB
8.75mm3 system
2 x solar cell 0.18µm CMOS
Cortex™-M3 near-threshold
12µAh Li-ion battery
Battery Solar Cells
Processor, SRAM and PMU
Cortex-M0; 65¢ University of Michigan
*NXP Cortex-M0 * Fujitsu Calypso Cortex-R4 * Samsung Exynos Cortex-A15 * ARM Cortex-A57
* Samsung Galaxy S3
* Google Nexus 10
7
ARM - Partnership Momentum
920 Licenses, 300 Partners, ~50% Shipping
ARM is growing into new markets and product categories
ARM® Architecture is the number one by volume
2002 1 Billion Cumulative
2010 25+ Billion Cumulative
2020 150+ Billion Cumulative
8
ARM – Research and Development
Focus on technologies impacting +10 years from now
product pipeline is typically 7 years
Always interesting and challenging work
research ideas become products
can make an influence on the whole industry
9
MULTI-CORE The state of micro-architecture, or how we got to
10
Moore’s Law
doubling the number of transistors, economically placed on a chip, every 2 years
doubling performance every 2 years
1985 1990 1995 2000 2005 2010
1.5 um
1.0 um
0.68 um
0.50 um
0.35 um
0.25 um
0.18 um
0.13 um
90 nm
65 nm
45 nm
32 nm
Intel 4004 2300 transistors
108 KHz
Intel Pentium 3.1M transistors
66 MHz
Intel Corei7 1.4B transistors
3GHz
Feat
ure
Siz
e
*data sourced from http://cpudb.stanford.edu/
11
Dennard Scaling
1974 Robert Dennard at IBM “MOSFETs continue to function as voltage-controlled switches while all key figures of merit such as layout density, operating speed, and energy efficiency improve provided geometric dimensions, voltages, and doping concentrations are consistently scaled to maintain the same electric field.”
30% dimension shrink 50% area shrink
40% performance increase
maintain a 30% supply voltage drop
~50% power reduction
40% faster, 2x transistors, constant power
12
Frequency
1985 1990 1995 2000 2005 2010
10
32
100
316
1000
3162
10000
Freq
uen
cy (
MH
z)
100x
*data sourced from http://cpudb.stanford.edu/
13
Performance vs. Dennard
1.5 1.0 0.68 0.50 0.35 0.25 0.18 0.13 0.09 0.065 0.045 0.032
Feature Size (um)
1
10000
1000
100
10 2 orders
<2 orders
*data sourced from http://cpudb.stanford.edu/
14
Micro-architecture gains
On-die cache, pipelining
Super-scalar OOO-Speculative Deep pipeline, Replay,
Trace-cache
Back to non-deep pipeline
Incr
ease
(X
)
0
4
3
2
1
* Borkar et. al, The future of the microprocessor, ACM Comms 2011.
15
Pollack’s rule
1 10 100 1000
Normalized Core Area
No
rmal
ized
Per
form
ance
1
3
10
32
100
*data sourced from http://cpudb.stanford.edu/
16
End of Dennard Scaling
Classic scaling ended at 130nm
Innovation needed to keep driving Moore’s Law
Materials and Lithography
Parameter
(scale factor = a)
Classic
Scaling
Current
Scaling
Dimensions 1/a 1/a
Voltage 1/a ~1
Current 1/a 1/a
Capacitance (A/t) 1/a >1/a
Power/Circuit (V.I) 1/a2 1/a
Power Density (VI/A) 1 a
Delay Circuit 1/a ~1
17
Power
infeasible to economically dissipate heat beyond 800mW/mm2
1985 1990 1995 2000 2005 2010 10
20
40
60
400
600 800
80 100
200
1000
2000
mW
/mm
2
Hot Plate
Nuclear reactor
*data sourced from http://cpudb.stanford.edu/
18
Memory
Typically 100s of cycles for a memory access to DRAM
Hidden to date by caches and bandwidth
DRAM density increases with Moore’s Law
Speed increases far slower
1980 1990 2000 2010
1980 1990 2000 2010 1
10
100
1000
10000
1
10
100 C
PU
Clo
cks/
DR
AM
Lat
ency
* Borkar et. al, The future of the microprocessor, ACM Comms 2011.
19
Cache
energy concerns and inefficient μ-arch led to more cache for efficiency
cache sizes increased slowly
decreasing die area given to $
most transistors core μ-arch
1um 0.5um 0.25um 0.13um 65nm
1um 0.5um 0.25um 0.13um 65nm
On
-die
cac
he
%
of
tota
l die
are
a
60
50
40
30
20
10
0
1000
1000
100
10
1
On
-die
cac
he
(KB
)
* Borkar et. al, The future of the microprocessor, ACM Comms 2011.
20
Trend Summary
Process scaling continues to follow Moore’s Law
ITRS suggests 7nm in 2024, Intel 5nm ~2020
Power budget remains constant
Frequency
frequency increases stopped in 2005
20 years of increasing beyond Dennard scaling increased power
Micro-architectural techniques
ILP speculation increases power
practical pipelining limits <FO4 delays increases power
Design complexity
√Performance
Memory wall remains so larger caches
21
MULTI-CORE A focus on
22
Multi-core
Performance increases drive the microprocessor industry
performance enables new applications
performance enables new markets
performance enables new form factors
Power fundamentally prevents more frequency scaling
performance must come from parallelism
ILP already exploited in single core microprocessors
Must exploit thread-level parallelism*
Multiple cores on the same die
gave the industry a new road map
Moore’s Law now applied to doubling cores every 2 years
23
MPCore Power
Single CPU
unused processors ‘turned off’
1CPU @ 260MHz, consumes ~160mW
Dual-CPU (same MHz, same Vt)
Same workload
single-threaded, concurrency from OS only
could lower MHz (V2 power saving)
Lower power in dual-CPU at same MHz
Reduced context switching
Increase in cache effectiveness
With threaded code, MP offers more performance at lower MHz
24
ARM MPCORE
ARM roadmap introduced ARM11 MPCore 2003, shipped 2005.
ARM-Cortex A-class MPCore roadmap : A8, A9, A5, A15
128-bit AMBA 4
Quad Cortex-A15 MPCore
A15
Processor Coherency (SCU) Up to 4MB L2 cache
A15 A15 A15
CoreLink CCI-400 Cache Coherent Interconnect
128-bit AMBA 4
IO c
oh
eren
t d
evic
es
MMU-400
Quad Cortex-A15 MPCore
A15
Processor Coherency (SCU) Up to 4MB L2 cache
A15 A15 A15
System MMU
25
Cortex-A15 MPCore
Cortex-A15 uses the similar MP model to Cortex-A9/A5
Data side cache coherence maintained by Snoop Control Unit (SCU)
Two ACE master ports and optional ACP slave interface supporting coherent data transfers to non-cached external devices.
Integrated GIC Interrupt controller and timer-watchdog units
Tightly integrated L2 cache
STB
DPU PFU
DCU Main TLB
BIU
ICU
DuTLB IuTLB
ETM
v7 Debug
APB
CP15
STB
DPU PFU
DCU Main TLB
BIU
ICU
DuTLB IuTLB
ETM
v7 Debug
APB
CP15
STB
DPU PFU
DCU Main TLB
BIU
ICU
DuTLB IuTLB
ETM
v7 Debug
APB
CP15
STB
DPU PFU
DCU Main TLB
BIU
ICU
DuTLB IuTLB
ETM
v7 Debug
APB
CP15
AXI
Snoop Controller Unit
Timer
ACPGIC
Timer Timer Timer
Cortex-A5 Core
Duplicate Tags DDI
Cortex-A5 Core Cortex-A5 Core Cortex-A5 Core
IRQ[n]
A15 A15 A15 A15
ACE
L2 Cache
26
Snoop Control Unit
Self-sizing intelligent block to ‘join’ multiple CPU together
Manages the impact and control over support of the coherence protocol
Support cache-2-cache transference, monitors for migratory lines
Arbitrates multiple CPU’s across 1 or 2 load balanced system AXI bus
Manages adaptive power down and interfaces with system power controller
Decodes and manages private peripheral access
Runs at CPU frequency
27
New Capabilities in the Cortex-A15
Full compatibility with the Cortex-A9/A5/A8
Supporting the ARMv7 Architecture
Addition of Virtualization Extension (VE)
Run multiple OS binary instances simultaneously
Isolates multiple work environments and data
Supporting Large Physical Addressing Extensions (LPAE)
Ability to use up to 1TB of physical memory
With AMBA 4 System Coherency (AMBA-ACE)
Other cached devices can be coherent with processor
Many-core multiprocessor scalability
Basis of concurrent big.LITTLE Processing
28
MANY-CORE? Evolution to
29
Evolution to Many-Core
Base theorem
Simpler and smaller processor designs require far less energy to accomplish same amount of compute as a more complex and larger processor design.
“Approximate rule of thumb” held within ARM
To increase performance 50% you double the power and area cost of the processor design
Quickly reaches point of diminishing returns
30
Micro-architecture trade-offs
Performance
Pow
er
High-leakage power at low performance points
High switching power at high performance points
big core
small core
ideal high dynamic range core
31
Strategy Focus: The Thermal Wall SOC sustained power is limited in mobile devices by thermals;
1.5W to 2W with low-cost POP and stacked memories
3W without stacked memories
Po
we
r
Time
Un-managed Max Power (@Tjmax)
Burst for responsiveness
(e.g. Browsing)
Sustained performance
(e.g. HD Video Record, Gaming)
Power Optimised Low End
(e.g. e-Mail, Voice, MP3)
T >= Tjmax, Tskin
Managed Sustained Power
Tj >= Tmax Tj < Tmax
“Opportunistic Residency”
Responsiveness is a must
Complex active management is needed
32
Use Case
Typical day for heavy smartphone user
90 mins voice calls
60 mins email
30 mins reading web
30 mins watching HW-accelerated video
50 mins playing Angry Birds or similar
90 mins jogging, listening to MP3s and logging GPS coordinates
10 mins video recording/photo capture
7 hrs sleep with music alarm clock
OS typically executing ~28 active processes
background synchronizing
33
Use Case Measurements
34
Use Case Conclusion
35
Multiprocessing Capable Many-core Benefits
36
ARM’s “big” processor
Cortex-A15 Processor
announced September 2010
1-4 Core MP configurable
Dual-cores shipped Oct’12
Advanced Capabilities
Full ARMv7A architecture
Thumb®-2, Trustzone®, VFP, Neon
Virtualization, LPAE
AMBA® 4 ACE™ Coherency
High Performance
Up to 1.5GHz for mobile on 28nm
• Shipped Oct’12 • Samsung Exynos 5250 • 1.7GHz
• Products
• Nexus10 • Samsung Chromebook
37
ARM “LITTLE” Processor
Cortex-A7 Processor
announced October 2011
1-4 Core MP configurable
Same Advanced Capabilities
Full ARMv7A architecture
Thumb®-2, Trustzone®, VFP, Neon
Virtualization, LPAE
AMBA® 4 ACE™ Coherency
ISA identical to Cortex-A15
High Performance
Up to 1.2GHz in mobile
• Cortex A-7 & Cortex-A15 testchip taped out in Q4’11
• Performance efficiency
exceeded expectations
38
Comparison of big.LITTLE Pipelines
Cortex-A7 Pipeline
Focused on energy efficiency
8-11 stages, in-order, limited dual-issue
Cortex-A15 Pipeline
Focused on efficient peak performance
15+ stages, out-of-order, multi-issue
39
Size Matters
Large silicon area costs:
Less die per wafer
Higher yield impact from silicon imperfections
Higher leakage power whenever power applied
Typically contains more transistor raising dynamic switching power
Not necessarily providing increased performance if gates are required to support architectural complexity rather than instruction execution
Single Core Cortex-A7
(incl. NEON, FPU, 32kB L1)
0.45 mm2
in 28nm Device with comparable performance
ARM’s LITTLE processor
40
Performance Comparison
41
Power Efficiency Comparison
42
Extending DVFS
DVFS sweep over entire operational voltage range of ARM’s first big.LITTLE processor pair
43
Software Use Models
big.LITTLE switching – one CPU active
switch between A15 and A7 depending on performance requirements
big.LITTLE MP – both CPUs can be active
allocate threads that need high-performance to A15
allocate threads that don’t need high-performance, but benefit from best energy efficiency to A7
AMBA 4 hardware coherency between A15 and A7
44
MICROPROCESSOR Predicting the future
45
A word on prediction…
1989 – Microprocessors circa 2000 – IEEE Spectrum, Gelsinger, P., Intel.
2000 50M transistors, 250MHz
no. of transistors 20% over, frequency 2x under, performance 4-8x under
1996 – The future of micro-processors – IEEE Micro, Yu, A., Intel.
2006 350M transistors, 4GHz
no. of transistors 2x over, frequency 5% over
predicted the power wall without breakthrough voltage scaling
understandably missed the trend to mobile computing
2005 – The future of micro-processors – ACM Queue, Olukotun et. al, Stanford.
no predictions just a discussion of CMPs
2011 – The future of micro-processors – Comms. ACM, Borkar et. al, Intel.
Moore’s law continues
μ-arch goes beyond homogeneous parallelism, exploit heterogeneity, exploit custom logic
software must be able to take advantage of it
Various – by 2010-40 we’ll have 100, 1000, 10,000, 100,000 cores
46
Evolution of Mobile Performance
ARM + MP + MP
+ MP ARM
MHz & uARCH
+ MP ARM
MHz Architecture U-Architecture
MHz Architecture U-Architecture MP
ARM ARM
MHz Architecture U-Architecture MP Multi-Performance GPGPU
MHz & uARCH
+ MP ARM ARM ARM
GPU HW
MHz Architecture U-Architecture MP Multi-Performance GPGPU Heterogeneity* Domain Specific Off-Load
Cloud
Future?
47
Near Future Smartphone
22MP
4K
13Wh
Battery 16GB
100GB/s
Ext Display
4K
240fps
5 G wireless
Native Display
1080p
120fps
Peak: 2.5Gb/s
Avg: 400Mb/s
Peak: 10Gb/s
Peak: 1Gb/s Wir
ele
ss
Bas
eb
an
d
Dis
pla
y
Ima
ge P
roc
/
rec
og
nit
ion
H.2
65
Vid
eo
Dec
& E
nc
Pa
ck
et
Pro
ce
ssin
g
Su
bsys
tem
512GB
Power
Manager Memory Interface
Graphics &
Composition
Subsystem
Applications
Subsystem
48
System scaling
Today Near Future Increase Notes
Cellular 20Mbps 400Mbps 20X
Wifi 300Mbps 10Gbps 30X
Display 720P 4K 17X Assumes constant complexity - 2-4X more
Video 720P H.264 4K H.265 34X-102X H.256 complexity
Battery 5.7W 13W 2.2X
??X
Input Output
20-30x 17X – 68x
Compute
How does compute scale in a power and thermal constrained environment?
49
Year
Node 45nm
2008
Area-1 1
Peak freq 1
Power 1
22nm
2014
4
1.6
1
16
11nm
2020
2.4
0.6
Exploitable Si (in 45nm power budget)
25%
(4 x 1)-1 = 25%
10%
(16 x 0.6)-1 = 10%
Source: ITRS 2008 Lack of power scaling severely limits
the complexity of systems!
Source: ITRS 2008
Dark Silicon
50
The Many-core wall
Figure 1: Overview of the models and the methodology
mance of any application for “any” chip topology for CPU-like and GPU-like multicore performance.
• DevM ⇥CorM: Pareto frontiers at future technology nodes;any performance improvements for future cores will come
only at the cost of area or power as defined by these curves.
• CmpM⇥DevM⇥CorM and an exhaustivestate-spacesearch:
maximum multicore speedups for future technology nodeswhile enforcing area, power, and benchmark constraints.
The results from this study provide detailed best-case multicore
performance speedups for future technologies considering real ap-plications from thePARSEC benchmark suite [5]. Our resultseval-
uating thePARSEC benchmarksand our upper-bound analysiscon-firm the following intuitive arguments:
i) Contrary to conventional wisdom on performance improve-mentsfrom using multicores, over fivetechnology generations, only
7.9⇥average speedup ispossible using ITRS scaling.ii) While transistor dimensions continue scaling, power limita-
tions curtail the usable chip fraction. At 22 nm (i.e. in 2012), 21%
of the chip will be dark and at 8 nm, over 50% of the chip will notbe utilized using ITRS scaling.
iii) Neither CPU-like nor GPU-like multicore designs are suffi-cient to achieve the expected performance speedup levels. Radical
microarchitectural innovationsarenecessary to alter thepower/per-formance Pareto frontier to deliver speed-ups commensurate with
Moore’s Law.
2. OVERVIEWFigure 1 shows how this paper combines models and empirical
measurements to project multicore performance and chip utiliza-tion. There are three components used in our approach:
Device scaling model (DevM): We build a device-scaling modelthat provides thearea, power, and frequency scaling factorsat tech-
nology nodes from 45 nm to 8 nm. We consider ITRS Roadmapprojections [19] and conservativescaling parameters from Borkar’s
recent study [7].Core scaling model (CorM): The core-level model provides the
maximum performance that asingle-core can sustain for any given
area. Further, it provides theminimum power (or energy) that mustbe consumed to sustain this level of performance. To quantify, we
measure thecoreperformance in termsof SPECmark. Weconsiderempirical data from a large set of processors and use curve fitting
to obtain the Pareto-optimal frontiers for single-core area/perfor-mance and power/performance tradeo↵s.
Multicore scaling model (CmpM): We model two mainstreamclassesof multicoreorganizations, multi-coreCPUsand many-thread
GPUs, which represent two extreme points in the threads-per-corespectrum. TheCPU multicoreorganization represents Intel Nehalem-
like, heavy-weight multicoredesignswith fast cachesand high single-thread performance. The GPU multicore organization represents
NVIDIA Tesla-like lightweight cores with heavy multithreadingsupport and poor single-thread performance. For each multicore
organization, we consider four topologies: symmetric, asymmet-ric, dynamic, and composed (also called “ fused” in the literature).
Symmetric Multicore: The symmetric, or homogeneous, multicoretopology consists of multiple copies of the same core operating at
the same voltage and frequency setting. In a symmetric multicore,the resources, including the power and the area budget, are shared
equally across all cores.Asymmetric Multicore: The asymmetric multicore topology con-sists of one large monolithic core and many identical small cores.
The design leverages the high-performing large core for the serialportion of code and leverages the numerous small cores as well as
the large core to exploit the parallel portion of code.Dynamic Multicore: The dynamic multicore topology is a varia-
tion of the asymmetric multicore topology. During parallel codeportions, the large core is shut down and, conversely, during the
serial portion, thesmall cores are turned o↵ and the code runs onlyon the large core [8, 27].
Composed Multicore: The composed multicore topology consistsof a collection of small cores that can logically fuse together to
compose a high-performance large core for the execution of theserial portion of code [18, 20]. In either serial or parallel cases, the
large core or the small cores are used exclusively.Table 1 outlines the design space we explore and explains the
roles of the cores during serial and parallel portions of applica-
* Eseilzadeh et el, ISCA’11
51
The Many-core wall?
Eseilzadeh et el predict the end of multi-core scaling at the 16nm node – as early as 2014
assumes ITRS scaling assumes conservative scaling
* Eseilzadeh
et el, ISCA
’11
52
Increasing Heterogeneity
big.LITTLE
How far can the specialization of micro-architecture improve energy efficiency within a common instruction set architecture?
GP/GPU
Exposing the compute capability of GPU through a general purpose language
OpenCL available on mobile parts (Samsung Chromebook, Nexus 10)
HSA foundation
My biggest question....
How can the benefits of homogeneity in a programming environment be maintained with this increasing heterogeneity ?
53
ASIC vs. General Purpose
ASIC efficiency far greater than [Hameed et al, ISCA’10]
100-1000x more energy efficient
50x more performance
Difficult to identify targets in the general case
beyond the obvious candidates (audio, video, crypto, packet)
what granularity to target?
QScores – target multiple general purpose computations [Venkatesh et. al, MICRO’11]
Integration
how to offload, how to handle contention
static compile time target, or dynamic
raises the software bar even higher…
54
a word on software…
Finding thread level parallelism is hard.
Performance portable
ensuring optimal performance becomes exponentially harder
does this finally mandate runtimes and virtual machines?
* Blake et. al, ISC
A’1
0
55
and there’s more…
Fault tolerance
how to design for and overcome hard or soft faults
Leakage avoidance
system designed for power off
software designed to enable more power off
Bandwidth
feeding more and more cores requires huge off-chip bandwidth
power and latency concerns
56
COMPUTER ARCHITECT Why I believe its an interesting time to be a
57
Energy proportional computing
Reality is a finite (fixed) energy budget
need to re-evaluate architecture and implementation
90/10 rule becomes 10x10
special purpose function acceleration
Energy proportional computing becomes the goal
in the near term multi-core will likely become many-core
many-core will certainly be heterogeneous
identifying accelerators becomes necessary
Software agnostic
performance portable, interfaces, parallel code
An new era of rapid dynamics within computer architecture “solutions” will change at a much quicker tempo
58
Something cool – transaction elimination
Bandwidth to memory = Power
Loosely speaking power budget for GPU is ~1W
150pJ to read/write a byte from memory
2x32 LPDDR2 peaks at 4-8GB/s
150pJ * 8GB = 1.2 W
Mali-T604 transaction elimination
compute a checksum/hash for each completed tile
write out tile only if checksum changes
trades-off extra compute for reduction of bandwidth
59
Memory – transaction elimination
* http://blogs.arm.com/multimedia/780-how-low-can-you-go-building-low-power-low-bandwidth-arm-mali-gpus/
60
Conclusion
Power constrained micro-architecture is challenging long held design principles
Although processes will continue to scale may not be economically sound to increase transistors on a chip
unless they can be put to good work
The multi-core era is already heterogeneous
more heterogeneity – not just in the compute
over-provisioning of transistors makes accelerators likely
Innovation to reduce power at all levels in the micro-architecture
compute, interconnect, software, silicon
61
Fin
Questions?
Always looking for good candidates http://arm.com/about/careers/index.php
ARM University program http://arm.com/support/university/index.php
Cortex-A programming guide* http://bit.ly/WGADUc
62
BACK-UP
63
FO4 delay
1985 1990 1995 2000 2005 2010
140
120
100
80
60
40
20
0
FO4
/cyc
le