Many-core back to the future - University of Cambridgerdm34/horsnell.pdf · P rocessor, S R AM and P MU ... 1974 Robert Dennard at IBM ... Data side cache coherence maintained by

1

Many-core back to the future

Matt Horsnell

ARM Research and Development

2

Outline

Introduction

How we got to multi-core

Focus on multi-core

Evolution to many-core?

Predicting the future microprocessor?

Why I believe it’s a great time to be a computer architect

Wrap-up

3

About me

MEng Computer Science, 2003

Final year project: multi-core processors (David May)

PhD Computer Science, 2007.

“A Chip Multi-Cluster architecture with locality aware task-distribution.”

Research Associate, 2008. Object-based Hardware Transaction Memory.

Software Engineer, 2010 Clock-tree synthesis and clock-concurrent optimization.

Research and Development

Architecture and micro-architecture.

4

ARM

The leading silicon IP company

Leading 32 bit RISC architecture

Leading Physical IP

~8Bn ARM chips shipped in 2011

Design and license CPUs, GPUs

and associated IP > 920 CPU licences

> 275 licensees

~1500 designs/year use ARM physical IP

2400 people world-wide >30 locations

LSE (FTSE100) and NASDAQ, >£10Bn

*Image from FastCompany’s 50 most innovative companies 2011 (#12)

5

ARM – Business model

Technology energy-efficient chips

2-3 years to design a processor

2-3 designs introduced per year

Range of design points

Range of end-markets

Partners select a design

License fee to access design

3-4 years integrating into SoC

Royalty fee per chip

20+ years reuse

6

ARM - 1mm3 to 1km3

*NSF and University of Wisconsin-Madison

*University of Michigan

1mm3 1km3

10c $1000

CONFIDENTIAL22

2002: Ubiquitous Environments

A CB

8.75mm3 system

2 x solar cell 0.18µm CMOS

Cortex™-M3 near-threshold

12µAh Li-ion battery

Battery Solar Cells

Processor, SRAM and PMU

Cortex-M0; 65¢ University of Michigan

*NXP Cortex-M0 * Fujitsu Calypso Cortex-R4 * Samsung Exynos Cortex-A15 * ARM Cortex-A57

* Samsung Galaxy S3

* Google Nexus 10

7

ARM - Partnership Momentum

920 Licenses, 300 Partners, ~50% Shipping

ARM is growing into new markets and product categories

ARM® Architecture is the number one by volume

2002 1 Billion Cumulative

2010 25+ Billion Cumulative

2020 150+ Billion Cumulative

8

ARM – Research and Development

Focus on technologies impacting +10 years from now

product pipeline is typically 7 years

Always interesting and challenging work

research ideas become products

can make an influence on the whole industry

9

MULTI-CORE The state of micro-architecture, or how we got to

10

Moore’s Law

doubling the number of transistors, economically placed on a chip, every 2 years

doubling performance every 2 years

1985 1990 1995 2000 2005 2010

1.5 um

1.0 um

0.68 um

0.50 um

0.35 um

0.25 um

0.18 um

0.13 um

90 nm

65 nm

45 nm

32 nm

Intel 4004 2300 transistors

108 KHz

Intel Pentium 3.1M transistors

66 MHz

Intel Corei7 1.4B transistors

3GHz

Feat

ure

Siz

e

*data sourced from http://cpudb.stanford.edu/

11

Dennard Scaling

1974 Robert Dennard at IBM “MOSFETs continue to function as voltage-controlled switches while all key figures of merit such as layout density, operating speed, and energy efficiency improve provided geometric dimensions, voltages, and doping concentrations are consistently scaled to maintain the same electric field.”

30% dimension shrink 50% area shrink

40% performance increase

maintain a 30% supply voltage drop

~50% power reduction

40% faster, 2x transistors, constant power

12

Frequency

1985 1990 1995 2000 2005 2010

10

32

100

316

1000

3162

10000

Freq

uen

cy (

MH

z)

100x


13

Performance vs. Dennard

1.5 1.0 0.68 0.50 0.35 0.25 0.18 0.13 0.09 0.065 0.045 0.032

Feature Size (um)

1

10000

1000

100

10 2 orders

<2 orders


14

Micro-architecture gains

On-die cache, pipelining

Super-scalar OOO-Speculative Deep pipeline, Replay,

Trace-cache

Back to non-deep pipeline

Incr

ease

(X

)

0

4

3

2

1

* Borkar et. al, The future of the microprocessor, ACM Comms 2011.

15

Pollack’s rule

1 10 100 1000

Normalized Core Area

No

rmal

ized

Per

form

ance

1

3

10

32

100


16

End of Dennard Scaling

Classic scaling ended at 130nm

Innovation needed to keep driving Moore’s Law

Materials and Lithography

Parameter

(scale factor = a)

Classic

Scaling

Current

Scaling

Dimensions 1/a 1/a

Voltage 1/a ~1

Current 1/a 1/a

Capacitance (A/t) 1/a >1/a

Power/Circuit (V.I) 1/a2 1/a

Power Density (VI/A) 1 a

Delay Circuit 1/a ~1

17

Power

infeasible to economically dissipate heat beyond 800mW/mm2

1985 1990 1995 2000 2005 2010 10

20

40

60

400

600 800

80 100

200

1000

2000

mW

/mm

2

Hot Plate

Nuclear reactor


18

Memory

Typically 100s of cycles for a memory access to DRAM

Hidden to date by caches and bandwidth

DRAM density increases with Moore’s Law

Speed increases far slower

1980 1990 2000 2010

1980 1990 2000 2010 1

10

100

1000

10000

1

10

100 C

PU

Clo

cks/

DR

AM

Lat

ency


19

Cache

energy concerns and inefficient μ-arch led to more cache for efficiency

cache sizes increased slowly

decreasing die area given to $

most transistors core μ-arch

1um 0.5um 0.25um 0.13um 65nm

1um 0.5um 0.25um 0.13um 65nm

On

-die

cac

he

%

of

tota

l die

are

a

60

50

40

30

20

10

0

1000

1000

100

10

1

On

-die

cac

he

(KB

)


20

Trend Summary

Process scaling continues to follow Moore’s Law

ITRS suggests 7nm in 2024, Intel 5nm ~2020

Power budget remains constant

Frequency

frequency increases stopped in 2005

20 years of increasing beyond Dennard scaling increased power

Micro-architectural techniques

ILP speculation increases power

practical pipelining limits <FO4 delays increases power

Design complexity

√Performance

Memory wall remains so larger caches

21

MULTI-CORE A focus on

22

Multi-core

Performance increases drive the microprocessor industry

performance enables new applications

performance enables new markets

performance enables new form factors

Power fundamentally prevents more frequency scaling

performance must come from parallelism

ILP already exploited in single core microprocessors

Must exploit thread-level parallelism*

Multiple cores on the same die

gave the industry a new road map

Moore’s Law now applied to doubling cores every 2 years

23

MPCore Power

Single CPU

unused processors ‘turned off’

1CPU @ 260MHz, consumes ~160mW

Dual-CPU (same MHz, same Vt)

Same workload

single-threaded, concurrency from OS only

could lower MHz (V2 power saving)

Lower power in dual-CPU at same MHz

Reduced context switching

Increase in cache effectiveness

With threaded code, MP offers more performance at lower MHz

24

ARM MPCORE

ARM roadmap introduced ARM11 MPCore 2003, shipped 2005.

ARM-Cortex A-class MPCore roadmap : A8, A9, A5, A15

128-bit AMBA 4

Quad Cortex-A15 MPCore

A15

Processor Coherency (SCU) Up to 4MB L2 cache

A15 A15 A15

CoreLink CCI-400 Cache Coherent Interconnect

128-bit AMBA 4

IO c

oh

eren

t d

evic

es

MMU-400

Quad Cortex-A15 MPCore

A15

Processor Coherency (SCU) Up to 4MB L2 cache

A15 A15 A15

System MMU

25

Cortex-A15 MPCore

Cortex-A15 uses the similar MP model to Cortex-A9/A5

Data side cache coherence maintained by Snoop Control Unit (SCU)

Two ACE master ports and optional ACP slave interface supporting coherent data transfers to non-cached external devices.

Integrated GIC Interrupt controller and timer-watchdog units

Tightly integrated L2 cache

STB

DPU PFU

DCU Main TLB

BIU

ICU

DuTLB IuTLB

ETM

v7 Debug

APB

CP15

STB

DPU PFU

DCU Main TLB

BIU

ICU

DuTLB IuTLB

ETM

v7 Debug

APB

CP15

STB

DPU PFU

DCU Main TLB

BIU

ICU

DuTLB IuTLB

ETM

v7 Debug

APB

CP15

STB

DPU PFU

DCU Main TLB

BIU

ICU

DuTLB IuTLB

ETM

v7 Debug

APB

CP15

AXI

Snoop Controller Unit

Timer

ACPGIC

Timer Timer Timer

Cortex-A5 Core

Duplicate Tags DDI

Cortex-A5 Core Cortex-A5 Core Cortex-A5 Core

IRQ[n]

A15 A15 A15 A15

ACE

L2 Cache

26

Snoop Control Unit

Self-sizing intelligent block to ‘join’ multiple CPU together

Manages the impact and control over support of the coherence protocol

Support cache-2-cache transference, monitors for migratory lines

Arbitrates multiple CPU’s across 1 or 2 load balanced system AXI bus

Manages adaptive power down and interfaces with system power controller

Decodes and manages private peripheral access

Runs at CPU frequency

27

New Capabilities in the Cortex-A15

Full compatibility with the Cortex-A9/A5/A8

Supporting the ARMv7 Architecture

Addition of Virtualization Extension (VE)

Run multiple OS binary instances simultaneously

Isolates multiple work environments and data

Supporting Large Physical Addressing Extensions (LPAE)

Ability to use up to 1TB of physical memory

With AMBA 4 System Coherency (AMBA-ACE)

Other cached devices can be coherent with processor

Many-core multiprocessor scalability

Basis of concurrent big.LITTLE Processing

http://www.arm.com/images/Eagle_New_Look_Chip-600.jpg

28

MANY-CORE? Evolution to

29

Evolution to Many-Core

Base theorem

Simpler and smaller processor designs require far less energy to accomplish same amount of compute as a more complex and larger processor design.

“Approximate rule of thumb” held within ARM

To increase performance 50% you double the power and area cost of the processor design

Quickly reaches point of diminishing returns

30

Micro-architecture trade-offs

Performance

Pow

er

High-leakage power at low performance points

High switching power at high performance points

big core

small core

ideal high dynamic range core

31

Strategy Focus: The Thermal Wall SOC sustained power is limited in mobile devices by thermals;

1.5W to 2W with low-cost POP and stacked memories

3W without stacked memories

Po

we

r

Time

Un-managed Max Power (@Tjmax)

Burst for responsiveness

(e.g. Browsing)

Sustained performance

(e.g. HD Video Record, Gaming)

Power Optimised Low End

(e.g. e-Mail, Voice, MP3)

T >= Tjmax, Tskin

Managed Sustained Power

Tj >= Tmax Tj < Tmax

“Opportunistic Residency”

Responsiveness is a must

Complex active management is needed

32

Use Case

Typical day for heavy smartphone user

90 mins voice calls

60 mins email

30 mins reading web

30 mins watching HW-accelerated video

50 mins playing Angry Birds or similar

90 mins jogging, listening to MP3s and logging GPS coordinates

10 mins video recording/photo capture

7 hrs sleep with music alarm clock

OS typically executing ~28 active processes

background synchronizing

33

Use Case Measurements

34

Use Case Conclusion

35

Multiprocessing Capable Many-core Benefits

36

ARM’s “big” processor

Cortex-A15 Processor

announced September 2010

1-4 Core MP configurable

Dual-cores shipped Oct’12

Advanced Capabilities

Full ARMv7A architecture

Thumb®-2, Trustzone®, VFP, Neon

Virtualization, LPAE

AMBA® 4 ACE™ Coherency

High Performance

Up to 1.5GHz for mobile on 28nm

• Shipped Oct’12 • Samsung Exynos 5250 • 1.7GHz

• Products

• Nexus10 • Samsung Chromebook

37

ARM “LITTLE” Processor

Cortex-A7 Processor

announced October 2011

1-4 Core MP configurable

Same Advanced Capabilities

Full ARMv7A architecture

Thumb®-2, Trustzone®, VFP, Neon

Virtualization, LPAE

AMBA® 4 ACE™ Coherency

ISA identical to Cortex-A15

High Performance

Up to 1.2GHz in mobile

• Cortex A-7 & Cortex-A15 testchip taped out in Q4’11

• Performance efficiency

exceeded expectations

38

Comparison of big.LITTLE Pipelines

Cortex-A7 Pipeline

Focused on energy efficiency

8-11 stages, in-order, limited dual-issue

Cortex-A15 Pipeline

Focused on efficient peak performance

15+ stages, out-of-order, multi-issue

39

Size Matters

Large silicon area costs:

Less die per wafer

Higher yield impact from silicon imperfections

Higher leakage power whenever power applied

Typically contains more transistor raising dynamic switching power

Not necessarily providing increased performance if gates are required to support architectural complexity rather than instruction execution

Single Core Cortex-A7

(incl. NEON, FPU, 32kB L1)

0.45 mm2

in 28nm Device with comparable performance

ARM’s LITTLE processor

40

Performance Comparison

41

Power Efficiency Comparison

42

Extending DVFS

DVFS sweep over entire operational voltage range of ARM’s first big.LITTLE processor pair

43

Software Use Models

big.LITTLE switching – one CPU active

switch between A15 and A7 depending on performance requirements

big.LITTLE MP – both CPUs can be active

allocate threads that need high-performance to A15

allocate threads that don’t need high-performance, but benefit from best energy efficiency to A7

AMBA 4 hardware coherency between A15 and A7

44

MICROPROCESSOR Predicting the future

45

A word on prediction…

1989 – Microprocessors circa 2000 – IEEE Spectrum, Gelsinger, P., Intel.

2000 50M transistors, 250MHz

no. of transistors 20% over, frequency 2x under, performance 4-8x under

1996 – The future of micro-processors – IEEE Micro, Yu, A., Intel.

2006 350M transistors, 4GHz

no. of transistors 2x over, frequency 5% over

predicted the power wall without breakthrough voltage scaling

understandably missed the trend to mobile computing

2005 – The future of micro-processors – ACM Queue, Olukotun et. al, Stanford.

no predictions just a discussion of CMPs

2011 – The future of micro-processors – Comms. ACM, Borkar et. al, Intel.

Moore’s law continues

μ-arch goes beyond homogeneous parallelism, exploit heterogeneity, exploit custom logic

software must be able to take advantage of it

Various – by 2010-40 we’ll have 100, 1000, 10,000, 100,000 cores

46

Evolution of Mobile Performance

ARM + MP + MP

+ MP ARM

MHz & uARCH

+ MP ARM

MHz Architecture U-Architecture

MHz Architecture U-Architecture MP

ARM ARM

MHz Architecture U-Architecture MP Multi-Performance GPGPU

MHz & uARCH

+ MP ARM ARM ARM

GPU HW

MHz Architecture U-Architecture MP Multi-Performance GPGPU Heterogeneity* Domain Specific Off-Load

Cloud

Future?

47

Near Future Smartphone

22MP

4K

13Wh

Battery 16GB

100GB/s

Ext Display

4K

240fps

5 G wireless

Native Display

1080p

120fps

Peak: 2.5Gb/s

Avg: 400Mb/s

Peak: 10Gb/s

Peak: 1Gb/s Wir

ele

ss

Bas

eb

an

d

Dis

pla

y

Ima

ge P

roc

/

rec

og

nit

ion

H.2

65

Vid

eo

Dec

& E

nc

Pa

ck

et

Pro

ce

ssin

g

Su

bsys

tem

512GB

Power

Manager Memory Interface

Graphics &

Composition

Subsystem

Applications

Subsystem

48

System scaling

Today Near Future Increase Notes

Cellular 20Mbps 400Mbps 20X

Wifi 300Mbps 10Gbps 30X

Display 720P 4K 17X Assumes constant complexity - 2-4X more

Video 720P H.264 4K H.265 34X-102X H.256 complexity

Battery 5.7W 13W 2.2X

??X

Input Output

20-30x 17X – 68x

Compute

How does compute scale in a power and thermal constrained environment?

49

Year

Node 45nm

2008

Area-1 1

Peak freq 1

Power 1

22nm

2014

4

1.6

1

16

11nm

2020

2.4

0.6

Exploitable Si (in 45nm power budget)

25%

(4 x 1)-1 = 25%

10%

(16 x 0.6)-1 = 10%

Source: ITRS 2008 Lack of power scaling severely limits

the complexity of systems!

Source: ITRS 2008

Dark Silicon

50

The Many-core wall

Figure 1: Overview of the models and the methodology

mance of any application for “any” chip topology for CPU-like and GPU-like multicore performance.

• DevM ⇥CorM: Pareto frontiers at future technology nodes;any performance improvements for future cores will come

only at the cost of area or power as defined by these curves.

• CmpM⇥DevM⇥CorM and an exhaustivestate-spacesearch:

maximum multicore speedups for future technology nodeswhile enforcing area, power, and benchmark constraints.

The results from this study provide detailed best-case multicore

performance speedups for future technologies considering real ap-plications from thePARSEC benchmark suite [5]. Our resultseval-

uating thePARSEC benchmarksand our upper-bound analysiscon-firm the following intuitive arguments:

i) Contrary to conventional wisdom on performance improve-mentsfrom using multicores, over fivetechnology generations, only

7.9⇥average speedup ispossible using ITRS scaling.ii) While transistor dimensions continue scaling, power limita-

tions curtail the usable chip fraction. At 22 nm (i.e. in 2012), 21%

of the chip will be dark and at 8 nm, over 50% of the chip will notbe utilized using ITRS scaling.

iii) Neither CPU-like nor GPU-like multicore designs are suffi-cient to achieve the expected performance speedup levels. Radical

microarchitectural innovationsarenecessary to alter thepower/per-formance Pareto frontier to deliver speed-ups commensurate with

Moore’s Law.

2. OVERVIEWFigure 1 shows how this paper combines models and empirical

measurements to project multicore performance and chip utiliza-tion. There are three components used in our approach:

Device scaling model (DevM): We build a device-scaling modelthat provides thearea, power, and frequency scaling factorsat tech-

nology nodes from 45 nm to 8 nm. We consider ITRS Roadmapprojections [19] and conservativescaling parameters from Borkar’s

recent study [7].Core scaling model (CorM): The core-level model provides the

maximum performance that asingle-core can sustain for any given

area. Further, it provides theminimum power (or energy) that mustbe consumed to sustain this level of performance. To quantify, we

measure thecoreperformance in termsof SPECmark. Weconsiderempirical data from a large set of processors and use curve fitting

to obtain the Pareto-optimal frontiers for single-core area/perfor-mance and power/performance tradeo↵s.

Multicore scaling model (CmpM): We model two mainstreamclassesof multicoreorganizations, multi-coreCPUsand many-thread

GPUs, which represent two extreme points in the threads-per-corespectrum. TheCPU multicoreorganization represents Intel Nehalem-

like, heavy-weight multicoredesignswith fast cachesand high single-thread performance. The GPU multicore organization represents

NVIDIA Tesla-like lightweight cores with heavy multithreadingsupport and poor single-thread performance. For each multicore

organization, we consider four topologies: symmetric, asymmet-ric, dynamic, and composed (also called “ fused” in the literature).

Symmetric Multicore: The symmetric, or homogeneous, multicoretopology consists of multiple copies of the same core operating at

the same voltage and frequency setting. In a symmetric multicore,the resources, including the power and the area budget, are shared

equally across all cores.Asymmetric Multicore: The asymmetric multicore topology con-sists of one large monolithic core and many identical small cores.

The design leverages the high-performing large core for the serialportion of code and leverages the numerous small cores as well as

the large core to exploit the parallel portion of code.Dynamic Multicore: The dynamic multicore topology is a varia-

tion of the asymmetric multicore topology. During parallel codeportions, the large core is shut down and, conversely, during the

serial portion, thesmall cores are turned o↵ and the code runs onlyon the large core [8, 27].

Composed Multicore: The composed multicore topology consistsof a collection of small cores that can logically fuse together to

compose a high-performance large core for the execution of theserial portion of code [18, 20]. In either serial or parallel cases, the

large core or the small cores are used exclusively.Table 1 outlines the design space we explore and explains the

roles of the cores during serial and parallel portions of applica-

* Eseilzadeh et el, ISCA’11

51

The Many-core wall?

Eseilzadeh et el predict the end of multi-core scaling at the 16nm node – as early as 2014

assumes ITRS scaling assumes conservative scaling

* Eseilzadeh

et el, ISCA

’11

52

Increasing Heterogeneity

big.LITTLE

How far can the specialization of micro-architecture improve energy efficiency within a common instruction set architecture?

GP/GPU

Exposing the compute capability of GPU through a general purpose language

OpenCL available on mobile parts (Samsung Chromebook, Nexus 10)

HSA foundation

My biggest question....

How can the benefits of homogeneity in a programming environment be maintained with this increasing heterogeneity ?

53

ASIC vs. General Purpose

ASIC efficiency far greater than [Hameed et al, ISCA’10]

100-1000x more energy efficient

50x more performance

Difficult to identify targets in the general case

beyond the obvious candidates (audio, video, crypto, packet)

what granularity to target?

QScores – target multiple general purpose computations [Venkatesh et. al, MICRO’11]

Integration

how to offload, how to handle contention

static compile time target, or dynamic

raises the software bar even higher…

54

a word on software…

Finding thread level parallelism is hard.

Performance portable

ensuring optimal performance becomes exponentially harder

does this finally mandate runtimes and virtual machines?

* Blake et. al, ISC

A’1

0

55

and there’s more…

Fault tolerance

how to design for and overcome hard or soft faults

Leakage avoidance

system designed for power off

software designed to enable more power off

Bandwidth

feeding more and more cores requires huge off-chip bandwidth

power and latency concerns

56

COMPUTER ARCHITECT Why I believe its an interesting time to be a

57

Energy proportional computing

Reality is a finite (fixed) energy budget

need to re-evaluate architecture and implementation

90/10 rule becomes 10x10

special purpose function acceleration

Energy proportional computing becomes the goal

in the near term multi-core will likely become many-core

many-core will certainly be heterogeneous

identifying accelerators becomes necessary

Software agnostic

performance portable, interfaces, parallel code

An new era of rapid dynamics within computer architecture “solutions” will change at a much quicker tempo

58

Something cool – transaction elimination

Bandwidth to memory = Power

Loosely speaking power budget for GPU is ~1W

150pJ to read/write a byte from memory

2x32 LPDDR2 peaks at 4-8GB/s

150pJ * 8GB = 1.2 W

Mali-T604 transaction elimination

compute a checksum/hash for each completed tile

write out tile only if checksum changes

trades-off extra compute for reduction of bandwidth

59

Memory – transaction elimination

* http://blogs.arm.com/multimedia/780-how-low-can-you-go-building-low-power-low-bandwidth-arm-mali-gpus/

60

Conclusion

Power constrained micro-architecture is challenging long held design principles

Although processes will continue to scale may not be economically sound to increase transistors on a chip

unless they can be put to good work

The multi-core era is already heterogeneous

more heterogeneity – not just in the compute

over-provisioning of transistors makes accelerators likely

Innovation to reduce power at all levels in the micro-architecture

compute, interconnect, software, silicon

61

Fin

Questions?

Always looking for good candidates http://arm.com/about/careers/index.php

ARM University program http://arm.com/support/university/index.php

Cortex-A programming guide* http://bit.ly/WGADUc

62

BACK-UP

63

FO4 delay

1985 1990 1995 2000 2005 2010

140

120

100

80

60

40

20

0

FO4

/cyc

le

Many-core back to the future - University of Cambridgerdm34/horsnell.pdf · P rocessor, S R AM and P MU ... 1974 Robert Dennard at IBM ... Data side cache coherence maintained by

Documents