Top Banner
A Programmable Embedded Microprocessor for Bit-scalable In-memory Computing Hongyang Jia ([email protected] ), H. Valavi, Y. Tang, J. Zhang, N. Verma HotChips 2019 1
29

A Programmable Embedded Microprocessor for Bit-scalable In ...

Jan 17, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Programmable Embedded Microprocessor for Bit-scalable In ...

A Programmable Embedded Microprocessor for Bit-scalable In-memory Computing

Hongyang Jia ([email protected]),

H. Valavi, Y. Tang, J. Zhang, N. Verma

HotChips 2019 1

Page 2: A Programmable Embedded Microprocessor for Bit-scalable In ...

Memory

()

Digital Acceleration

MULT (INT8): 0.3pJ

MULT (INT32): 3pJMULT (FP32): 5pJ

MULT (INT4): 0.1pJ

Memory Size (๐‘ซ๐‘ซ)

En

erg

y p

er

Acc

ess

64b

Wo

rd (

pJ)

โ€ข โ€ฆ BUT, not data movement & memory accessing

Programmability overhead (instruction/operand fetch/decode)

Compute

>90% <10%

CPU

GPU

Accelerator(TPU)

1 5 10 50

Performance/Watt

โ€ข 10-100ร— gains in compute energy efficiency & speed

(https://cloud.google.com)

(45nm CMOS)

2

Typical Instruction Energy Breakdown

Page 3: A Programmable Embedded Microprocessor for Bit-scalable In ...

Focusing on Embedded Memory (SRAM)

โ€ข Even models for Edge AI can be large

(courtesy IBM)

โ€ข โ€ฆBUT, reducing bit precision makes feasible to store on chip

28nm: <20mm2 embedded SRAM 16nm: <12mm2 embedded SRAM

Data Type Application Model # Params

Vision

Image

classificationResNet-50 26M

Object

detectionSSD300 24M

Language

Keyword

spotting

Deep

Speech 238M

Machine

translation

Base

Transformer50M

3

Page 4: A Programmable Embedded Microprocessor for Bit-scalable In ...

Outline

1. Basics of In-Memory Computing (IMC)โ€ข Amortizing data movement

2. IMC challengesโ€ข Compute SNR due to analog operation

3. High-SNR capacitor-based (charge-domain) IMC

4. Programmable heterogeneous IMC architectureโ€ข Bit-scalability, integration in memory space, near-memory data path

5. Prototype measurementsโ€ข Chip measurements, SW libraries, neural-network demonstrations

4

Page 5: A Programmable Embedded Microprocessor for Bit-scalable In ...

Typical Ways to Amortize Data Movement

โ€ข Specialized (memory-compute integrated) architectures๐‘๐‘1โ‹ฎ๐‘๐‘๐‘€๐‘€ =

๐‘Ž๐‘Ž1,1 โ€ฆ ๐‘Ž๐‘Ž1,๐‘๐‘โ‹ฎ โ‹ฑ โ‹ฎ๐‘Ž๐‘Ž๐‘€๐‘€,1 โ‹ฏ ๐‘Ž๐‘Ž๐‘€๐‘€,๐‘๐‘๐‘๐‘1โ‹ฎ๐‘๐‘๐‘๐‘

Processing Element (PE)

โ€ข Reuse accessed data for computeoperations ๐‘๐‘ = ๐ด๐ด ร— ๐‘๐‘

5

Memory

Processor 0Processor 1

Comp. 0

Comp. 1

Comp. 2

Compute

Data Access

Reusein Time

Reusein Space

Compute Intensity

Page 6: A Programmable Embedded Microprocessor for Bit-scalable In ...

In-Memory Computing (IMC) to Amortize Data Movement

IMC Mode(row parallel)

SRAM Mode(row-by-row)

๐’ƒ๐’ƒ๐’„๐’„

[J. Zhang, VLSIโ€™16][J. Zhang, JSSCโ€™17]

๐‘๐‘1โ‹ฎ๐‘๐‘๐‘€๐‘€ =

๐‘Ž๐‘Ž1,1 โ€ฆ ๐‘Ž๐‘Ž1,๐‘๐‘โ‹ฎ โ‹ฑ โ‹ฎ๐‘Ž๐‘Ž๐‘€๐‘€,1 โ‹ฏ ๐‘Ž๐‘Ž๐‘€๐‘€,๐‘๐‘๐‘๐‘1โ‹ฎ๐‘๐‘๐‘๐‘

Digital output

Analog output

BL BLB

1 0

10

a1,1=-1

PRE

a1,N=+1

b1 โ†’

bN โ†’

(big)

(small)

๐‘๐‘ = ๐ด๐ด ร— ๐‘๐‘ โ‡’

โ†’ Critical tradeoff: energy/throughput vs. SNR

6

Page 7: A Programmable Embedded Microprocessor for Bit-scalable In ...

Limited by Analog Computation

โ€ข Use analog circuits to โ€˜fitโ€™ computation in bit cellsโŸถ Compute SNR limited by circuit non-idealities (nonlinearity, variation, noise)

WL

BL BLB

WLDAC

MA

MD

0 1.20.4 0.8

0

20

40

60

80

WL Voltage (V)

Bit

cell

Cu

rren

t (๐›๐›A) (10k-pt Monte Carlo sim.)

Nonlinearity

Variation

7

Page 8: A Programmable Embedded Microprocessor for Bit-scalable In ...

Where does IMC Stand Today?

Energy Efficiency (TOPS/W)

No

rmal

ized

Th

rou

gh

pu

t

(GO

PS

/mm

2 )

Bankman,

ISSCCโ€™18, 28nm

Yuan, VLSIโ€™18, 65nm

Moons, ISSCCโ€™17, 28nm

Ando, VLSIโ€™17, 65nm

Chen, ISSCCโ€™16, 65nm Gonug,

ISSCCโ€™18, 65nm

Biswas,

ISSCCโ€™18, 65nm

Jiang,

VLSIโ€™18, 65nm

10

10e2

10e3

10e4

Valavi, VLSIโ€™18, 65nm

Khwa, ISSCCโ€™18, 65nm

Zhang, VLSIโ€™16, 130nm

Lee, ISSCCโ€™18, 65nm

Shin, ISSCCโ€™17, 65nm

Yin, VLSIโ€™17, 65nm

10e-2 10e-1 1 10 10e2 10e3

Energy Efficiency (TOPS/W)

On

-ch

ip M

emo

ry S

ize

(kB

)

Yuan, VLSIโ€™18, 65nm

10e3

10e2

10

1

10e-2 10e-1 1 10 10e2 10e3

Valavi,

VLSIโ€™18, 65nm

Bankman,

ISSCCโ€™18, 28nm

Lee, ISSCCโ€™18,

65nm

Zhang,

VLSIโ€™16,

130nmKhwa, ISSCCโ€™18, 65nm

Jiang, VLSIโ€™18,

65nm

Biswas, ISSCCโ€™18, 65nm

Gonug, ISSCCโ€™18, 65nm

Chen, ISSCCโ€™16,

65nm

Yin, VLSIโ€™17, 65nm

Ando, VLSIโ€™17, 65nmMoons,

ISSCCโ€™17,

28nm

IMC

Not IMC

โ€ข Potential for 10ร— higher energy efficiency & throughput

โ€ข โ€ฆ BUT, limited scalability (size, workloads, architectures)

8

Page 9: A Programmable Embedded Microprocessor for Bit-scalable In ...

Moving to High-SNR Analog Computation

โ€ข Charge-domain computation based on capacitorsโŸถ capacitances set by geometric parameters, well controlled in advanced nodes

[H. Valavi, VLSIโ€™18]

W1,1,1n

W1,1,1n

IA1,1,1

IA1,2,1

Pre-activation PAn

8T Multiplying Bit Cell (M-BC)1. Digital multiplication2. Analog accumulation

Two modes: o XNOR: ๐‘‚๐‘‚๐‘ฅ๐‘ฅ,๐‘ฆ๐‘ฆ,๐‘ง๐‘ง๐‘›๐‘› = ๐ผ๐ผ๐ด๐ด๐‘ฅ๐‘ฅ,๐‘ฆ๐‘ฆ,๐‘ง๐‘ง โŠ•๐‘Š๐‘Š๐‘–๐‘–,๐‘—๐‘—,๐‘ง๐‘ง๐‘›๐‘›o AND: ๐‘‚๐‘‚๐‘ฅ๐‘ฅ,๐‘ฆ๐‘ฆ,๐‘ง๐‘ง๐‘›๐‘› = ๐ผ๐ผ๐ด๐ด๐‘ฅ๐‘ฅ,๐‘ฆ๐‘ฆ,๐‘ง๐‘ง ร— ๐‘Š๐‘Š๐‘–๐‘–,๐‘—๐‘—,๐‘ง๐‘ง๐‘›๐‘›

(i.e., keep ๐ผ๐ผ๐ด๐ด๐‘ฅ๐‘ฅ,๐‘ฆ๐‘ฆ,๐‘ง๐‘ง high)

WLWLIAbx,y,z

Wbi,j,zn

Wi,j,zn

BL

Ox,y,zn

IAx,y,z

BLb

Metal cap. above bit cello 1.8ร— area of 6T cell

o ~10x smaller than

equal digital circuit

9

XNOR/AND

XNOR/AND

Page 10: A Programmable Embedded Microprocessor for Bit-scalable In ...

Previous Demonstration: 2.4Mb, 64-tile IMC

Moons,

ISSCCโ€™17

Bang,

ISSCCโ€™17

Ando,

VLSIโ€™17

Bankman,

ISSCCโ€™18

Valavi,

VLSIโ€™18

Technology 28nm 40nm 65nm 28nm 65nm

Area (๐ฆ๐ฆ๐ฆ๐ฆ๐Ÿ๐Ÿ) 1.87 7.1 12 6 17.6

Operating VDD 1 0.63-0.9 0.55-10.8/0.8

(0.6/0.5)0.94/0.68/1.2

Bit precision 4-16b 6-32b 1b 1b 1b

on-chip Mem. 128kB 270kB 100kB 328kB 295kB

Throughput

(GOPS)400 108 1264 400 (60) 18,876

TOPS/W 10 0.384 6 532 (772) 866

โ€ข 10-layer CNN demos for MNIST/CIFAR-10/SVHN at energies of 0.8/3.55/3.55 ฮผJ/image

โ€ข Equivalent performance to software implementation

[H. Valavi, VLSIโ€™18]10

Page 11: A Programmable Embedded Microprocessor for Bit-scalable In ...

Need Programmable Heterogeneous Architectures

[B. Fleischer, VLSIโ€™18]

General matrix multiplywith many elements (IMC)

Single/few-word operands(traditional, near-mem. acceleration)

โ€ข Matrix-vector multiply is only 70-90% of operationsโŸถ IMC must integrate in programmable, heterogenous architectures

11

Page 12: A Programmable Embedded Microprocessor for Bit-scalable In ...

Programmable IMC

CPU

(RISC-V)

AXI Bus

DMATimers GPIO UART

32

Program

Memory

(128 kB)

Boot-

loader

Data

Memory

(128 kB)

Compute-In-Memory

Unit (CIMU)

โ€ข 590 kb

โ€ข 16 bank

Ext.

Mem. I/F

Config.

Regs.

To E2PROM To DRAM Controller

Config

APB Bus 32

32

Tx Rx

8 13(data) (addr.)

32(data/addr.)

12

1. Interfaces to standard processor memory space2. Digital near-mem. accelerator (element compute)3. Bit scalability from 1 to 8 bits

w2b

Res

hap

ing

Bu

ffer

Inp

ut-

Vec

tor

Gen

erat

or

x

Ro

w D

eco

der

/ WL

Dri

vers

Memory Read/Write I/F

32b

8b

Near-Mem. Data Path

32b

AD

C

A

32b

Bit

Cell

Compute-In-

Memory Array

(CIMA)

f(y = A x)

AD

C

Page 13: A Programmable Embedded Microprocessor for Bit-scalable In ...

Bit-Parallel/Bit-Serial (BP/BS) Multi-bit IMC

10

20

30

40

6

10

14

18

2 3 4 5 6 7 8

2

4

6

BA

SQ

NR

(d

B)

Bx=2

Bx=4

Bx=8

N=2304, 2000, 1500, 1000, 500, 255

N=2304, 2000, 1500, 1000, 500, 255

N=2304, 2000, 1500, 1000, 500, 255

โ€ข SQNR different that standard INT compute- rounding effects are well modeled- SQNR is high at precisions of interest

13

N=2304

8-b SAR ADC(15|18% energy|areaoverhead)

Max. Dynamic Range: 2305

Dynamic Range: 256

1 1 0 1 0 0

AD

C

8b

x0[Bx-1:0] : 0-1-0

a0,0 a1,0

xN-1[Bx-1:0] : 0-0-0

x2303[Bx-1:0] : 1-1-0

[BA-1:0] [BA-1:0]

BX : Input-vector bit precision (E.g., 3b)

BA : Matrix bit precision (E.g., 3b)

Page 14: A Programmable Embedded Microprocessor for Bit-scalable In ...

Word-to-bit (w2b) Reshaping Buffer

Highly configurable connection network

Sequenced data for bit-serial compute

Circular interface for configurable convolutional striding

(Bx : Input-vector bit precision, E.g., 2b)

14

Reg.

File<0>

96b

72b

32b <0> <23>

Shifting

(conv.)

To Sparsity/AND-logic Controller

8b R

eg

8b R

eg

8b R

eg

8b R

eg

8b R

eg

8b R

eg

8b R

eg

8b R

eg

0[ ]1[ ]

0[ ]1[ ]

0[ ]1[ ]

0[ ]1[ ]

Reg.

File<7>

0[ ]1[ ]

0[ ]1[ ]

0[ ]1[ ]

0[ ]1[ ]

Sparsity-proportionalenergy savings

Page 15: A Programmable Embedded Microprocessor for Bit-scalable In ...

Data-transfer Analysis

0

50

100

150

50

100

150

200

0

100

200

300

400

Bx=1

Bx=2

Bx=4

0

200

400

600

800Bx=8

CIMU Underutilization

Nx

NC

IMU

Ny

CIMU

Under-

utilization

BA

1 2 4 8

Cyc

le C

ou

nt

Nx: number of cycles to transfer ๐’™๐’™NCIMU: number of cycles for CIMU computeNy: number of cycles to transfer ๐’š๐’šBX: bit precision of ๐’™๐’™ elementsBA: bit precision of A elements

15

(33k cycles to load the whole CIMU memory)

Page 16: A Programmable Embedded Microprocessor for Bit-scalable In ...

Near-memory Datapath (NMD)

Local

Scale

Local

Exp.

Global

Exp.

Global

Offset

9b

8b

19b 32b

32b

9b

Local

Offset

11b

ReLU Unit

8b ADC

Cross-column

muxโ€™ing

BPBS Buffering

Non-linearfunctions

16

Page 17: A Programmable Embedded Microprocessor for Bit-scalable In ...

Prototype

DM

EM

PM

EM

CP

U AD

C

Te

st

Str

uc

t.

DM

A e

tc.

4ร—4CIMATiles

3m

m

4.5mm

Nea

r-m

em

. D

ata

pa

th

Technology (nm) 65

CIMU Area (mm2) 8.5

VDD (V) 1.2|0.85

On-chip mem. (kB) 74

Bit precision (b) 1-8

Thru.put (1b-GOPS/mm2) 0.26|0.10

Energy Eff. (1b-TOPS/W) 192|400

โ€ข Recent work has moved to advanced CMOS nodesโŸถ Observe energy/density scaling like digital, while maintaining analog precision

17

Page 18: A Programmable Embedded Microprocessor for Bit-scalable In ...

Column Computation

0 1000 2000 3000 4000 5000

Ideal Pre-activation Value

0

50

100

150

200

250

AD

C O

utp

ut

Co

de

(Error bars show std.

deviation over 256 CIMA

columns)

8b ADC output

1Breakdown within CIMU accelerator

18

SummaryTech. (nm)VDD (V)

CPU

(pJ/instru)DMA

(pJ/32b-transfer)

Reshap. Buf.1

(pJ/32b-input)

FCLK (MHz)

Total area (mm2)

CIMA1

(pJ/column)ADC1

(pJ/column)

Dig. Datapath1

(pJ/output)

1.2 | 0.7/0.8565

52 | 26

13.5 | 7.0

35 | 12

13.5100 | 40

20.4 | 9.7

3.56 | 1.79

14.7 | 8.3

Energy Breakdown @VDD = 1.2V | 0.7V (P/D MEM, Reshap. Buf), 0.85V (rest)

Page 19: A Programmable Embedded Microprocessor for Bit-scalable In ...

Characterization and Demonstrations

Neural-Network Demonstrations

Network A

(4/4-b activations/weights)

Network B

(1/1-b activations/weights)

Accuracy of chip

(vs. ideal)

92.4%

(vs. 92.7%)

89.3%

(vs. 89.8%)

Energy/10-way

Class.1105.2 ฮผJ 5.31 ฮผJ

Throughput1 23 images/sec. 176 images/sec.

Neural Network

Topology

L1: 128 CONV3 โ€“ Batch norm

L2: 128 CONV3 โ€“ POOL โ€“ Batch norm.

L3: 256 CONV3 โ€“ Batch. norm

L4: 256 CONV3 โ€“ POOL โ€“ Batch norm.

L5: 256 CONV3 โ€“ Batch norm.

L6: 256 CONV3 โ€“ POOL โ€“ Batch norm.

L7-8: 1024 FC โ€“ Batch norm.

L9: 10 FC โ€“ Batch norm.

L1: 128 CONV3 โ€“ Batch Norm.

L2: 128 CONV3 โ€“ POOL โ€“ Batch Norm.

L3: 256 CONV3 โ€“ Batch Norm.

L4: 256 CONV3 โ€“ POOL โ€“ Batch Norm.

L5: 256 CONV3 โ€“ Batch Norm.

L6: 256 CONV3 โ€“ POOL โ€“ Batch Norm.

L7-8: 1024 FC โ€“ Batch norm.

L9: 10 FC โ€“ Batch norm.

CIFAR-10 Image Classification

19

2 4 6 8

2

4

6

8

2 4 6 85

10

15

20

SQ

NR

(d

B)

Multi-bit Matrix-Vector Multiplication

DX=1152Bit-true Sim.

DX=1152

MeasuredDX=1152DX=1152

Bx=2

BA

Bx=4

0 20 40 60 80-500

0

500

0 20 40 60 80-60

-40

-20

0

20

Data Index

Co

mp

ute

Val

ue

Bx=2, BA=2 Bx=4, BA=4

Bit True Sim.

Measured

BA

Data Index

Page 20: A Programmable Embedded Microprocessor for Bit-scalable In ...

Development board

To Host

Processor

20

Page 21: A Programmable Embedded Microprocessor for Bit-scalable In ...

Application-mapping Flows

NN Libs.Keras etc.

Int. Quant.

Chip Quant.Chip

Dev. SDKPython/Matlab

NN Layers

Chip Config.

Imp. SDKEmbedded C

NN Layers

Chip Config.

MVM

TrainingData

Params.

Runtime Ctrl.

User

DesignDeployment Flow

On-ChipHost Processor

Training Inference

Dev

. Flo

w

21

Page 22: A Programmable Embedded Microprocessor for Bit-scalable In ...

Training/Inference Libraries1. Deep-learning Training Libraries

(Keras)

2. Deep-learning Inference Libraries (Python, MATLAB, C)

Dense(units, ...)

Conv2D(filters, kernel_size, ...)

...

Standard Keras libs:

QuantizedDense(units, nb_input=4, nb_weight=4,

chip_quant=True, ...)

QuantizedConv2D(filters, kernel_size, nb_input=4,

nb_weight=4, chip_quant=True, ...)

...

QuantizedDense(units, nb_input=4, nb_weight=4,

chip_quant=False, ...)

QuantizedConv2D(filters, kernel_size, nb_input=4,

nb_weight=4, chip_quant=False, ...)

...

Custom libs:(INT/CHIP quant.)

chip_mode = True

outputs = QuantizedConv2D(inputs,

weights, biases, layer_params)

outputs = BatchNormalization(inputs,

layer_params)

...

High-level network build (Python):

Embedded C:

Function calls to chip (Python):

chip.load_config(num_tiles, nb_input=4,

nb_weight=4)

chip.load_weights(weights2load)

chip.load_image(image2load)

outputs = chip.image_filter()

chip_command = get_uart_word();

chip_config();

load_weights(); load_image();

image_filter(chip_command);

read_dotprod_result(image_filter_command);22

Page 23: A Programmable Embedded Microprocessor for Bit-scalable In ...

Conclusions

Matrix-vector multiplies (MVMs) are a little different than other computationsโŸถ high-dimensionality operands lead to data movement / memory accessing

Capacitor-based IMC enables high-SNR analog computeโŸถ enables scale and robust functional specification of IMC for architectural design

Programmable IMC is demonstrated (with supporting SW libraries)โŸถ physical IMC tradeoffs will drive specialized mapping/virtualization algorithms

Acknowledgements: funding provided by ADI, DARPA, NRO23

Resent work has moved to advanced nodesโŸถ energy and density scaling similar to standard digital logic

Page 24: A Programmable Embedded Microprocessor for Bit-scalable In ...

Backup

24

Page 25: A Programmable Embedded Microprocessor for Bit-scalable In ...

IMC Tradeoffs

CONSIDER: Accessing ๐‘ซ๐‘ซ bits of data associated with computation,

from array with ๐‘ซ๐‘ซ columns โจ‰ ๐‘ซ๐‘ซ rows.

Memory(D1/2ร—D1/2 array)

Computation

Memory &Computation(D1/2ร—D1/2 array)

D1/2

Conventional IMC Metric Conventional In-memory

Bandwidth 1/D1/2 1

Latency D 1

Energy D3/2 ~D

SNR 1 ~1/D1/2

โ€ข IMC benefits energy/delay at cost of SNR

โ€ข SNR-focused systems design is critical (circuits, architectures, algorithms)

25

Page 26: A Programmable Embedded Microprocessor for Bit-scalable In ...

Algorithmic Co-design(?)

โ€ขโ€ข

โ€ข โ€ข โ€ขโ€ขโ€ข โ€ขโ€ข โ€ขโ€ข

โ€ขโ€ขโ€ขโ€ขโ€ข

โ€ข

WEAK classifier K

WEAK classifier 2Weighted

Voter

Classifier

Trainer

WEAK classifier 1

Feature 1

Fe

atu

re 2

โ€ขโ€ข

โ€ข โ€ข โ€ขโ€ขโ€ข โ€ขโ€ข โ€ขโ€ข

โ€ขโ€ขโ€ขโ€ขโ€ข

โ€ข

โ€ขโ€ข

โ€ข โ€ข โ€ขโ€ขโ€ข โ€ข

โ€ข โ€ขโ€ข

โ€ขโ€ขโ€ขโ€ขโ€ขโ€ข

โ€ขโ€ข

โ€ข โ€ข โ€ขโ€ขโ€ข โ€ขโ€ข โ€ขโ€ข

โ€ขโ€ขโ€ขโ€ขโ€ข

โ€ข

โ€ข Chip-specific weight tuning

[Z. Wang, TVLSIโ€™15]

[Z. Wang, TCAS-Iโ€™15]

[S. Gonu., ISSCCโ€™18]

โ€ข Chip-generalized weight tuning

G gi

Training InferenceParameters๐œƒ๐œƒ(๐‘ฅ๐‘ฅ,๐บ๐บ,โ„’)

Normalized MRAM cell standard dev.1 2 3 4 5 6 7 8 9 1010

2030405060708090

100

Acc

ura

cy

L = |๐‘ฆ๐‘ฆ โˆ’ ๏ฟฝ๐‘ฆ๐‘ฆ(๐‘ฅ๐‘ฅ,๐œƒ๐œƒ)|2L = |๐‘ฆ๐‘ฆ โˆ’ ๏ฟฝ๐‘ฆ๐‘ฆ(๐‘ฅ๐‘ฅ,๐œƒ๐œƒ,๐บ๐บ)|2

E.g.: BNN Model (applied to CIFAR-10)

[B. Zhang, ICASSP 2019]26

Page 27: A Programmable Embedded Microprocessor for Bit-scalable In ...

M-BC Layout

2fF:

1fF:

0.5fF:

E.g., MOM-capacitor matching (130nm):

[H. Omran, TCAS-Iโ€™16]

WL

BLGND

GND

WL

BLb

VDD

VDD

GND (shield)

VDD

GND

BLBLb

VDD

GND

๐‘ฐ๐‘ฐ๐‘ฐ๐‘ฐ๐’™๐’™,๐’š๐’š,๐’›๐’›๐‘ฐ๐‘ฐ๐‘ฐ๐‘ฐ๐’ƒ๐’ƒ๐’™๐’™,๐’š๐’š,๐’›๐’›

WL(6T area: 1.0 A.U.) (M-BC area: 1.8 A.U.)

GN

D

โ€ข >14-b capacitor matching across M-BCs

โ€ข >14k IMC rows for matching-limited SNR

[H. Valavi, VLSIโ€™18]

27

Page 28: A Programmable Embedded Microprocessor for Bit-scalable In ...

IMC as a Spatial Architecture

Operation Digital-PE Energy (fJ) Bit-cell Energy (fJ)

Storage 250

50Multiplication 100

Accumulation 200

Communication 40 5

Total 590 55

Assume:โ€ข 1k dimensionalityโ€ข 4-b multipliesโ€ข 45nm CMOS

PRE

c11(23)

PRE PRE PRE

a11

[3]

b11

[3]a11

[2]

a11

[1]

a11

[0]

a12

[3]

a12

[2]

a12

[1]

a12

[0]

b21

[3]

c11(22) c11(21) c11(20)

28

Page 29: A Programmable Embedded Microprocessor for Bit-scalable In ...

Application mapping

โ€ข IMC engines must be โ€˜virtualizedโ€™โŸถ IMC amortizes MVM costs, not weight loading โŸถ Need new mapping algorithms (physical tradeoffs very diff. than digital engines)

(output activations)

๐‘พ๐‘พ๐’Š๐’Š,๐’‹๐’‹,๐’›๐’›๐’๐’(N - Iโจ‰Jโจ‰Z filters)

๐‘ฐ๐‘ฐ๐‘ฐ๐‘ฐ๐’™๐’™,๐’š๐’š,๐’›๐’›(Xโจ‰Yโจ‰Z input

activations)

โ€ข EDRAMโ†’IMC/4-bit: 40pJ

โ€ข Reuse: ๐‘๐‘ ร— ๐ผ๐ผ ร— ๐ฝ๐ฝ (10-20 lyrs

โ€ข EMAC,4-b: 50fJ

Activation Accessing:Weight Accessing:

โ€ข EDRAMโ†’IMC/4-bit: 40pJ

โ€ข Reuse: ๐‘‹๐‘‹ ร— ๐‘Œ๐‘Œโ€ข EMAC,4-b:50fJ

Reuse โ‰ˆ 1k

MemoryBound

ComputeBound

Ex.: Xโจ‰Y sets reuse of filter weights

29