Top Banner
1 DNN Accelerator Architectures ISCA Tutorial (2019) Website: http://eyeriss.mit.edu/ tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen
64

DNN Accelerator Architectures

Feb 08, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DNN Accelerator Architectures

1

DNN Accelerator Architectures

ISCA Tutorial (2019)Website: http://eyeriss.mit.edu/tutorial.html

Joel Emer, Vivienne Sze, Yu-Hsin Chen

Page 2: DNN Accelerator Architectures

2

Highly-Parallel Compute ParadigmsTemporal Architecture

(SIMD/SIMT)Spatial Architecture

(Dataflow Processing)

Register File

Memory Hierarchy

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

Control

Memory Hierarchy

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

Page 3: DNN Accelerator Architectures

3

Memory Access is the Bottleneck

ALUfilter weightfmap activation

partial sum updated partial sum

Memory Read Memory WriteMAC*

* multiply-and-accumulate

Page 4: DNN Accelerator Architectures

4

Memory Access is the Bottleneck

ALU

Memory Read Memory WriteMAC*

* multiply-and-accumulate

DRAM DRAM

• Example: AlexNet [NIPS 2012] has 724M MACs à 2896M DRAM accesses required

Worst Case: all memory R/W are DRAM accesses

Page 5: DNN Accelerator Architectures

5

Leverage Local Memory for Data Reuse

ALU

Memory Read Memory WriteMAC*

MemDRAM DRAMMem

Extra levels of local memory hierarchySmaller, but Faster and more Energy-Efficient

Page 6: DNN Accelerator Architectures

6

Types of Data Reuse in DNNConvolutional Reuse

CONV layers only(sliding window)

Filter Input Fmap

ActivationsFilter weights

Reuse:

Page 7: DNN Accelerator Architectures

7

Types of Data Reuse in DNNConvolutional Reuse Fmap Reuse

CONV layers only(sliding window)

CONV and FC layers

Filter Input FmapFilters

2

1

Input Fmap

ActivationsFilter weights

Reuse: ActivationsReuse:

Page 8: DNN Accelerator Architectures

8

Types of Data Reuse in DNNFilter ReuseConvolutional Reuse Fmap Reuse

CONV layers only(sliding window)

CONV and FC layers CONV and FC layers(batch size > 1)

Filter Input FmapFilters

2

1

Input Fmap Filter

2

1

Input Fmaps

ActivationsFilter weights

Reuse: ActivationsReuse: Filter weightsReuse:

Page 9: DNN Accelerator Architectures

9

Types of Data Reuse in DNNFilter ReuseConvolutional Reuse Fmap Reuse

CONV layers only(sliding window)

CONV and FC layers CONV and FC layers(batch size > 1)

Filter Input FmapFilters

2

1

Input Fmap Filter

2

1

Input Fmaps

ActivationsFilter weights

Reuse: ActivationsReuse: Filter weightsReuse:

If all data reuse is exploited, DRAM accesses in AlexNet can be reduced from 2896M to 61M (best case)

Page 10: DNN Accelerator Architectures

10

Leverage Parallelism for Higher Performance

Memory WriteMAC

DRAM DRAMALU

Memory Read

ALU

ALU

MemMem

Page 11: DNN Accelerator Architectures

11

Leverage Parallelism for Spatial Data Reuse

Memory WriteMAC

DRAM DRAMALU

Memory Read

ALU

ALU

MemMem

Page 12: DNN Accelerator Architectures

12

Spatial Architecture for DNN

ProcessingElement (PE)

Global Buffer (100s – 1000s kB)

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DRAM

Control

Reg File 1 – 10 kB

On-Chip Network (NoC)• Global Buffer to PE• PE to PE

Page 13: DNN Accelerator Architectures

13

Multi-Level Low-Cost Data Access

DRAM GlobalBuffer PE

PE PE

ALU fetch data to run a MAC here

ALU

Buffer ALU

RF ALU

Normalized Energy Cost*

200×6×

PE ALU 2×1×1× (Reference)

DRAM ALU

1s – 10s kB

100s – 1000s kB

NoC: 100s – 1000s PEs

* measured from a commercial 65nm process

Page 14: DNN Accelerator Architectures

14

Multi-Level Low-Cost Data Access

ALU

Buffer ALU

RF ALU

Normalized Energy Cost*

200×6×

PE ALU 2×1×1× (Reference)

DRAM ALU

* measured from a commercial 65nm process

1s – 10s kB

100s – 1000s kB

NoC: 100s – 1000s PEs

A Dataflow is required to maximally exploit data reuse with the low-cost memory hierarchy and parallelism

Page 15: DNN Accelerator Architectures

15

Dataflow Taxonomy

• Output Stationary (OS)

• Weight Stationary (WS)

• Input Stationary (IS)

[Chen et al., ISCA 2016]

Page 16: DNN Accelerator Architectures

16

• Minimize partial sum R/W energy consumption− maximize local accumulation

• Broadcast/Multicast filter weights and reuseactivations spatially across the PE array

Output Stationary (OS)

Global Buffer

P0 P1 P2 P3 P4 P5 P6 P7

Activation Weight

PEPsum

Page 17: DNN Accelerator Architectures

17

OS Example: ShiDianNao

Top-Level Architecture PE Architecture

[Du et al., ISCA 2015]

weights activations

psums

Page 18: DNN Accelerator Architectures

18

OS Example: ENVISION

[Moons et al., VLSI 2016, ISSCC 2017]

weights

activations

Page 19: DNN Accelerator Architectures

19

Variants of Output Stationary

# Output Channels# Output Activations

E

F

…M

OSB

Multiple

Multiple

Notes

E

F

…M

OSA

Single

Multiple

TargetingCONV layers

E

F

…M

OSC

Multiple

Single

TargetingFC layers

Parallel Output Region

Page 20: DNN Accelerator Architectures

20

OS Example

2

output fmap

…filters

8

2

21

2

3

3

input fmap

3 2

3

3

Filter overlay82

Page 21: DNN Accelerator Architectures

21

OS Example

2

output fmap

2

21

2 8

3

3 2

input fmap3

3

3

2

8

filters

1

8

Cycle through input fmap and weights (hold psum of output fmap)

Incomplete partial sum

Page 22: DNN Accelerator Architectures

22

OS Example

input fmapCycle through input fmap and weights (hold psum of output fmap)

3

3

3…

2

output fmap

2

8

2

2

21

2 8

3

3

2

filters

1

8

Page 23: DNN Accelerator Architectures

23

OS Example

2

21

28

input fmap

3

3

2

3

3

3filters

1

2

output fmap

2

8

Cycle through input fmap and weights (hold psum of output fmap)

8

Page 24: DNN Accelerator Architectures

24

OS Example

2

21

2

input fmap

3

3

2

3

3

3filters

1

2

output fmap

2

8

Cycle through input fmap and weights (hold psum of output fmap)

8

Page 25: DNN Accelerator Architectures

25

OS Example

2

output fmap

2

21

2 8

3

3 2

input fmap3

3

3

2

8

filters

1

8

Cycle through input fmap and weights (hold psum of output fmap)

Page 26: DNN Accelerator Architectures

26

OS Example

2

output fmap

2

21

2 8

3

3 2

input fmap3

3

3

2

8

filters

1

8

Cycle through input fmap and weights (hold psum of output fmap)

Incomplete partial sum

Page 27: DNN Accelerator Architectures

27

1-D Convolution – Output Stationary

R

Weights

H

Inputs

E = H-R+1†

Outputs

* =

int I[H]; // Input activationsint W[R]; // Filter weightsint O[E]; // Output activations

for (e = 0; e < E; e++)for (r = 0; r < R; r ++)

O[e] += I[e+r] * W[r];

† Assuming: ‘valid’ style convolution

No constraints on loop

permutations!

Page 28: DNN Accelerator Architectures

28

1-D Convolution

R

Weights

H

Inputs

E = H-R+1†

Outputs

* =

int I[H]; // Input activationsint W[R]; // Filter weightsint O[E]; // Output activations

for (r = 0; r < R; r++) for (e = 0; e < E; e++)

O[e] += I[e+r] * W[r];

† Assuming: ‘valid’ style convolution

Page 29: DNN Accelerator Architectures

29

Weight Stationary (WS)

• Minimize weight read energy consumption− maximize convolutional and filter reuse of weights

• Broadcast activations and accumulate psumsspatially across the PE array.

Global Buffer

W0 W1 W2 W3 W4 W5 W6 W7

Psum Activation

PEWeight

Page 30: DNN Accelerator Architectures

30

WS Example: nn-X (NeuFlow)

[Farabet et al., ICCV 2009]

A 3×3 2D Convolution Engine

weights

activations

psums

Page 31: DNN Accelerator Architectures

31

Image Source: Nvidia

Released Sept 29, 2017Global Buffer

PE Array: M x C MACs

http://nvdla.org

WS Example: NVDLA (simplified)

Page 32: DNN Accelerator Architectures

32

WS Example: NVDLA (simplified)

Different Output Channels

DifferentInputChannels

I000

P000

W0000 W1000 W2000 W3000 W4000 W5000 W6000 W7000

W0100 W1100 W2100 W3100 W4100 W5100 W6100 W7100

W0200 W1200 W2200 W3200 W4200 W5200 W6200 W7200

P100 P200 P300 P400 P500 P600

Psum

P700

I100

I200

output fmap / psum

input fmap

Weights

Page 33: DNN Accelerator Architectures

33

WS Example: NVDLA (simplified)

2

output fmap

2

21

2 8

3

3 2

input fmap3

3

3

2

8

filters

1

8

Cycle through input and output fmap (hold weights)

Page 34: DNN Accelerator Architectures

34

WS Example: NVDLA (simplified)

2

output fmap

2

input fmapCycle through input and output fmap (hold weights)

3

3

3

8

2

21

2 8

3

3

2

filters

1

8

Page 35: DNN Accelerator Architectures

35

WS Example: NVDLA (simplified)

output fmap…

2

21

2 8

input fmap

3

3

2

Cycle through input and output fmap (hold weights)

3

3

3…

2…

2

8

filters

1

8

Page 36: DNN Accelerator Architectures

36

output fmapinput fmapCycle through input and output fmap (hold weights)

3

3

3…

2…

2

8

2

21

2 8

3

3

2

filters

1

8

WS Example: NVDLA (simplified)

Page 37: DNN Accelerator Architectures

37

WS Example: NVDLA (simplified)

output fmapinput fmapLoad new weights

3

3

3…

2

21

2 8

3

3

2

filters

1

8

2…

2

8

Page 38: DNN Accelerator Architectures

38

WS Example: NVDLA (simplified)

output fmapinput fmap

3

3

3

Cycle through input and output fmap (hold weights)

2…

2

8

2

21

2 8

3

3

2

filters

1

8

Page 39: DNN Accelerator Architectures

39

Taxonomy: More Examples

• Weight Stationary (WS)

[Chakradhar, ISCA 2010] [nn-X (NeuFlow), CVPRW 2014][Park, ISSCC 2015] [ISAAC, ISCA 2016] [PRIME, ISCA 2016]

• Output Stationary (OS)

[ShiDianNao, ISCA 2015][Peemen, ICCD 2013][Gupta, ICML 2015] [Moons, VLSI 2016] [Thinker, VLSI 2017]

[TPU, ISCA 2017]

Page 40: DNN Accelerator Architectures

40

Input Stationary (IS)

• Minimize activation read energy consumption− maximize convolutional and fmap reuse of activations

• Unicast weights and accumulate psums spatiallyacross the PE array.

Global Buffer

I0 I1 I2 I3 I4 I5 I6 I7

Psum

ActPE

Weight

Page 41: DNN Accelerator Architectures

41

IS Example: SCNN

• Used for sparse CNNs– Sparse CNN is where many weights are zeros

– Activations also have sparsity from ReLU

[Parashar et al., ISCA 2017]

Page 42: DNN Accelerator Architectures

42

1-D Convolution – Weight Stationary

R

Weights

H

Inputs

E = H-R+1†

Outputs

* =

int I[H]; // Input activationsint W[R]; // Filter weightsint O[E]; // Output activations

for (e = 0; e < E; e++) for (r = 0; r < R; r++)

O[e] += I[e+r] * W[r];

† Assuming: ‘valid’ style convolution

How can we implement input stationary with no input index?

Page 43: DNN Accelerator Architectures

43

1-D Convolution – Input Stationary

R

Weights

H

Inputs

E = H-R+1†

Outputs

* =

int I[H]; // Input activationsint W[R]; // Filter weightsint O[E]; // Output activations

for (h = 0; h < H; h++) for (r = 0; r < R; r++)

O[h-r] += I[h] * W[r];

† Assuming: ‘valid’ style convolution

Beware w-r must be >= 0

and <E

Page 44: DNN Accelerator Architectures

44

Reference Patterns ofDifferent Dataflows

Page 45: DNN Accelerator Architectures

45

Single PE Setup

WeightBuffer

Input ActivationBuffer

Output ActivationBuffer

MAC

PE

Page 46: DNN Accelerator Architectures

46

Output Stationary – Reference PatternLayer Shape: - H = 12

- R = 4- E = 9

for (e = 0; e < E; e++) for (r = 0; r < R; r++) O[e] += I[e+r] * W[r];

Cycle Cycle Cycle

Ref

eren

ce In

dex

Ref

eren

ce In

dex

Outputs InputsWeights

Page 47: DNN Accelerator Architectures

47

Output Stationary – Reference Pattern

Observations: - Single output is reused many times (R)

Page 48: DNN Accelerator Architectures

48

Output Stationary – Reference Pattern

Observations: - Single output is reused many times (R)- All weights reused repeatedly

Page 49: DNN Accelerator Architectures

49

Output Stationary – Reference Pattern

Observations: - Single output is reused many times (R)- All weights reused repeatedly- Sliding window of inputs (size = R)

Observations: - Single output is reused many times (R)- All weights reused repeatedly

Page 50: DNN Accelerator Architectures

50

Buffer Data Accesses - Weightsfor (e = 0; e < E; e++) for (r = 0; r < R; r++)

O[e] += I[e+r] * W[r];

OS

MACs

Weight ReadsInput ReadsOutput ReadsOutput Writes

REE*R

E*R

Page 51: DNN Accelerator Architectures

51

Buffer Data Accesses - Inputs

OS

MACs E*R

Weight Reads E*R

Input ReadsOutput ReadsOutput Writes

E

R

E*R

for (e = 0; e < E; e++) for (r = 0; r < R; r++)

O[e] += I[e+r] * W[r];

Page 52: DNN Accelerator Architectures

52

Buffer Data Accesses - Outputs

OS

MACs E*R

Weight Reads E*R

Input Reads E*R

Output ReadsOutput Writes

ER

0

E

for (e = 0; e < E; e++) for (r = 0; r < R; r++)

O[e] += I[e+r] * W[r];

Page 53: DNN Accelerator Architectures

53

1-D Convolution – Weight Stationary

R

Weights

W

Inputs

E = W-ceil(R/2)†

Outputs

* =

† Assuming: ‘valid’ style convolution

int I[H]; // Input activationsint W[R]; // Filter weightsint O[E]; // Output activations

for (e = 0; e < E; e++) for (r = 0; r < R; r++)

O[e] += I[e+r] * W[r];

Page 54: DNN Accelerator Architectures

54

Weight Stationary – Reference Pattern

Observations: - Single weight is reused many times (E)- Large sliding window of inputs (size = E)- Fixed window of outputs (size = E)

Page 55: DNN Accelerator Architectures

55

L1 Weight Stationary - Costsfor (r = 0; r < R; r++)

for (e = 0; e < E; e++)O[e] += I[e+r] * W[r];

OS WS

MACs E*R E*R

Weight Reads E*R R

Input Reads E*R E*R

Output Reads 0 E*R

Output Writes E E*R

Page 56: DNN Accelerator Architectures

56

1-D Convolution – Input Stationary

R

Weights

W

Inputs

E = W-ceil(R/2)†

Outputs

* =

† Assuming: ‘valid’ style convolution

int I[H]; // Input activationsint W[R]; // Filter weightsint O[E]; // Output activations

for (h = 0; h < H; h++) for (r = 0; r < R; r++)

O[h-r] += I[h] * W[r];

Beware w-r must be >= 0

and <E

Page 57: DNN Accelerator Architectures

57

Input Stationary – Reference Pattern

Observations: - Inputs used repeatedly (R times)- Weights reused in large window (size = R)- Sliding window of outputs (size = R)

Page 58: DNN Accelerator Architectures

58

Minimum Costs

OS WS IS Min

MACs E*R E*R E*R E*R

Weight Reads E*R R E*R R

Input Reads E*R E*R E E

Output Reads 0 E*R E*R 0

Output Writes E E*R E*R E

Assume: W ~= E

Page 59: DNN Accelerator Architectures

59

Intermediate Buffering

L0Weights

L0Inputs

L0Outputs

PE

L0Weights

L0Inputs

L0 Outputs

L1Weights

L1Inputs

L1Outputs

Page 60: DNN Accelerator Architectures

60

1-D Convolution – Buffered

R

Weights

H

Inputs

E = H-R+1†

Outputs

* =

int I[H]; // Input activationsint W[R]; // Filter Weightsint O[E]; // Output activations

// Level 1for (e1 = 0; e1 < E1; e1++) for (r1 = 0; r1 < R1; r1++)

// Level 0for (e0 = 0; e0 < E0; e0++) for (r0 = 0; r0 < R0; r0++)

O[e1*E0+e0] += I[e1*E0+e0 + r1*R0+r0]* W[r1*R0+r0];

† Assuming: ‘valid’ style convolution

Note E and R are factored so:E0*E1 = ER0*R1 = R

Page 61: DNN Accelerator Architectures

61

Buffer sizes

• Level 0 buffer size is volume needed in each Level 1 iteration.

• Level 1 buffer size is volume needed to be preserved and re-delivered in future (usually successive) Level 1 iterations.

// Level 1for (e1 = 0; e1 < E1; e1++) for (r1 = 0; r1 < R1; r1++)// Level 0for (e0 = 0; e0 < E0; e0++) for (r0 = 0; r0 < R0; r0++)

O[e1*E0+e0] += I[e1*E0+e0 + r1*R0+r0] * W[r1*R0+r0];

A legal assignment of loop limits will fit intothe hardware’s buffer sizes

Page 62: DNN Accelerator Architectures

62

Spatial PEs

Weight Buffer

Input Buffer

Output Buffer

PE0

PE1

Page 63: DNN Accelerator Architectures

63

1-D Convolution – Spatial

R

Weights

W

Inputs

E = H-R+1†

Outputs

* =

int I[W]; // Input activationsint W[R]; // Filter Weightsint O[E]; // Output activations

// Level 1for (r1 = 0; r1 < R1; r1++)for (e1 = 0; e1 < E1; e1++)

// Level 0spatial-for (r0 = 0; r0 < R0; r0++)spatial-for (e0 = 0; e0 < E0; e0++)

O[e1*E0+e0] += I[e1*E0+e0 + r1*R0+r0]* W[r1*R0+r0];

† Assuming: ‘valid’ style convolution

Note:• E0*E1 = E• R0*R1 = R• R0*E0 <= #PEs

Page 64: DNN Accelerator Architectures

64

Summary of DNN Dataflows

• Minimizing data movement is the key to achieving high energy efficiency for DNN accelerators

• Dataflow taxonomy:– Output Stationary: minimize movement of psums– Weight Stationary: minimize movement of weights– Input Stationary: minimize movement of inputs

• Loop nest provides a compact way to describe various properties of a dataflow, e.g., data tiling in multi-level storage and spatial processing.