Memory Centric High Performance Compu=ng Panel · © 2018 Arm Limited Memory Centric High Performance Compu=ng Panel Jonathan Beard 11 November 2018

© 2018 Arm Limited

Memory Centric High Performance

Compu=ng PanelJonathan Beard

11 November 2018

© 2018 Arm Limited 2

Trends

End of Frequency Scaling

?End of Moore’s Law

Power keeps going up

More cores/socket

data source: hGps://goo.gl/bb6wZW

https://goo.gl/bb6wZW


Specializa)on drives performance…But at a cost

Adapted/modified from original figure courtesy of Dilip Vasudevan (LBL)

CPU CPUM

em In

terfa

ce

Bus

CPU

Mem

Inte

rface

Bus

CPU

CPU

CPUGP

U / D

SP

CPU

Mem

Inte

rface

Bus

CPU

ACC

GPU

/ DSP

ACC

ACC

ACC

ACC

ACC

CPU

VM

Inte

rface

Bus

CPU

CMOS Acc 1

CMOS Acc 2

NVM

L1

Cach

e

CMOS Acc 3

PoP

NVM

In

terfa

ce

cnFET Acc

PCIe

Stac

ked

SRAM

Towards Extreme Heterogeneity

Today5 years ago 0-3 years3-10 years

Cost to build/adopt/runLow ExtremeHighLow

Ac)ve Memory

Today – Mobile/Client


More cores

●● ● ● ● ●● ● ● ● ●●●● ●●●●●●●●● ●●●●●●●●●●●●●●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●●●●

●●

●●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●●

1980 1990 2000 2010

1

510

50100

Year

Avg.Co

res/

Socket

The sunk cost• More wiresThe preventable cost• Data movement • Programmer burden

data source: h?ps://goo.gl/bb6wZW

https://goo.gl/bb6wZW


Energy of Data MovementCost of obtaining 8B from DRAM for a single DP FLOP

20 22132

388

888

2088

20 22132 132 132

1442

0

500

1000

1500

2000

2500

DP FLOP Register L1D Cache Access L2 Cache Access 20mm of interconnect DDR4

Pico

joul

es

Cumulative Near-memory Accelerator Opportunity

>30% Energy savings per opera<on if caches can be cut for some applica<ons


Rela%vity Locality is, from a certain point of view.

L1

Core

L2

L3 SCM*PE

Data X

Observa%on: Manipula%ng distance by moving processor can expand locality and decrease latency for some workloads

Long distance

Short distance

*Storage Class Memory - NVM


Rela%vity

L1

Core

L2

L3 SCM*PE

Data

X

Observa%on: Manipula%ng distance by moving processor can expand locality and decrease latency for some workloads

184 cycles

100 cycles

45% reduc%on in latency if no reuse

*Storage Class Memory - NVM

Locality is, from a certain point of view.


Rela%vity Applica3on examples where reuse doesn’t ma<er

2 threads

Host MB/s NUCD MB/s

025005000 0 2500 5000

Benchmark

Linked-list

B-Tree

Hash Table

Degree Centrality

Near-HBM (MB/s)Host (MB/s)

4 threads

1 thread

Big out-of-order

coresL1 L2

CoherentNetwork

L3

MC DDR3

HBM

Tiny Cores 1-4

Memory Controllers

xbar

Gem5 Simula3on Configura3on

Bandwidth Scaling


Memory Access LatencyProcessor to Memory speed (clock cycles) ra<o

Parallelism to maintain the same 3me to solu3on

Latency rela3ve to core

Parallelism


Scalable parallelism

CMOS Acc 1

NVM

CMOS Acc 2

Cach

e

Code

⚙⚙⚙

CPU

Thousands of mini-threads

Bus

Tens of mini-

threads


There’s no such thing as processing in memoryChallenge: breaking the seman@c barrier

Biggest problem with processing in memory is that we call it processing in memory.

These are just accelerators….


Interface standardsExplosion of interfaces…can we make them boring?

Bus

CMOS Acc 3

CMOS Acc 1M

RAM

CMOS Acc 2

L1

Cach

e

FuncIon Common Interface

All Accelerators

Code

⚙⚙⚙

CPU

Unlock innova2on on both sides of interface! – Minimize so:ware disrup2on, maximize innova2on pace

Challenge• Make it easy to build accelerators

while making it easy to program / debug them


Interface standards

BusCode ⚙⚙⚙ CPU

Challenge• Locality-based targe>ng

Data never rests

L1 L2

L3

Memory Interface

SCM

Time to offload

How long will the data stay put


Lost in transla+onWe need virtualiza7on!

Bus

CMOS Acc 3

CMOS Acc 1M

RAM

CMOS Acc 2

L1

Cach

e

IOMMUVirtually addressed

Task

This works for a few course grained accelerators

Imagine your processor of 64 cores sharing only a few of these

IOMMU is effec+vely shared TLB across many high throughput cores. Is this a problem?Challenge• Increase transla7on reach for all cores• Give all cores access to robust

virtualiza7on infrastructure


L1

Data Movement We solved one problem, but created another…

Core

Data X

Time 7: Load X

L2

L3

Data X

Data X

Time 0 Time 5

Wasted

Used

GUPS: 80%CoMD: 50%mcb: 40%LULESH: 20%DGEMM: 10%

L1D Cache Line U3lized

Cache Line Wasted

versus

Linked list – 12.5% u3liza3onGiven just L1/2 traffic, we moved an extra 112 Bytes!

Challenge• Data layout transforma3on is cri3cal• Where to transform, how to program, and how

best to virtualize are the biggest ques3ons…


Keeping dark bandwidth coherent Hmmm….maybe there’s a be9er way.

Bus

CMOS Acc 3

CMOS Acc 1N

VM

CMOS Acc 2

L1

Cach

e

CPU

CPU

CPU

CPU

request broadcast / ack / send / ack / complete

• Dark Bandwidth blows up transfers, moving one cache line actually moves far more than that!

• With every move comes coherence traffic, oRen lots.

• SynchronizaTon also takes Tme…

CMOS Acc 3

CMOS Acc 1N

VM

CMOS Acc 2

L1

Cach

e

CPU

CPU

CPU

CPU

Bus

Queueing Accelerator

pushstash/pop

Challenge• Improve performance / transparency

of communica<ons between all processing elements.

• Do we really need coherence? (likely not always)


System ArchitectureHow to make boring

CMOS Acc 3

CMOS Acc 1N

VM

CMOS Acc 2

L1

Cach

e

Bus

Queueing Accelerator

Accelerated asynchronous dataflow communica5ons

stash/pop

Code

⚙⚙⚙

CPU

Thousands of mini-threads

Task Queue /

Scheduler Accelerators virtualized, near-bare metal performance.

Easy programming using both dataflow and standard procedural styles within the same system.

Programmable Gather / ScaUer


CPU

VM

Inte

rface

Bus

CPU

CMOS Acc 1

CMOS Acc 2

NVM

L1

Cach

e

CMOS Acc 3

PoP

NVM

In

terfa

ce

Non-CMOS Acc

PCIe

Stac

ked

SRAM

Ac#ve Memory

CPU

BusCMOS Acc 3

CMOS Acc 1 NV

M

CMOS Acc 2

L1

Cach

e

1: Efficiency of data movement, logic is cheap, movement is expensive – future systems must capitalize on both data with reuse and streaming data, Dark Bandwidth must be avoided

2: Mul#ple drivers / compilers / soDware stacks are mul#-million-dollar efforts for each vendor – will developers even adopt? –Reducing cost is a huge disruptor!

3: Communica#ons / scalability of cores is not good with current coherence methods – but specializa#on and more cores are the future, Post-Moore.

4: Virtualiza#on and transla#on for accelerators is an aDerthought at the moment, extending the virtual memory model eases programming – but can we do more?

Grand Challenges

1919

Thank You!Danke!Merci!��!��!Gracias!Kiitos!

© 2018 Arm Limited

Memory Centric High Performance Compu=ng Panel · © 2018 Arm Limited Memory Centric High Performance Compu=ng Panel Jonathan Beard 11 November 2018

Documents