Hardware Efficiency in Neuromorphic Computing: Devices ...eecs.ucf.edu/~jinyier/DASS2016/NC-Cao.pdf · Hardware Efficiency in Neuromorphic Computing: Devices, Circuits, and Algorithms

Hardware Efficiency

in Neuromorphic Computing:

Devices, Circuits, and Algorithms

Yu (Kevin) CaoSchool of Electrical , Computer and Energy Engineering

Arizona State University

Acknowledgement: Jae-sun Seo, Shimeng Yu, Sarma Vrudhula, Visar Berisha (ASU); Maxim Bazhenov (UCSD); Jieping Ye (UM)

2016 SIGDA DASS2016 SIGDA DASS

Neuromorphic Computing On-a-chip Challenges and Needs

– Efficiency gap in computing and energy

Hardware: Resistive Cross-point Array– >2000X speedup; Realistic issues

Algorithm: Inhibition and Noise– Motif of feedforward inhibition

– MNIST: >95% accuracy, 3X saving in network size

Summary

- 2 -


From Data to Information

- 3 -

Useful If Tagged and Analyzed

Tagged

AnalyzedBig Gap in Information Analysis!

Big Data Generated

[IDC, December 2012]

77%


Success of Machine Learning A top-down approach: better for digital IC

– Pros: mathematical, accurate, scalable– Cons: big data, heavy computing, off-line learning

- 4 -


Hardware Implementation Today learning is usually in the data center (cloud)

– Big data– Power hungry– Network issue– Data security

- 5 -

30 frames/s

Edge computing (fog): novel hardware/algorithms needed– Local to the sensor, real-time, reliable, low-power– On-line, personalized learning with continuous data

IBM Jeopardy2880 3.5GHz

P7 cores

Google Cat:16,000 CPU

cores


Algorithm complexity– Object diversity (size, pose,

orientation, etc.)– Environmental conditions

(illumination, exposure, occlusion, etc.)

Performance Gap

- 6 -

[V. Narayanan 2012; J. Cong 2015]

Hardware architecture– Memory intensive; – Memory bandwidth

(the von Neumann bottleneck)

GPUCPU

Accelerator

Real-time


Advances in Neuro-biophysics A bottom-up approach: better integration with sensors

– Pros: energy efficient, real time, fundamental (10 Nobel Prizes)– Cons: lacking the dynamics, limited scale and accuracy

- 7 -

AnatomyC. Golgi, S. R. Cajal

1906

Ion ChannelR. MacKinnon

2003

Connectome2010 –

Leaky-Integrate-Fire Neuron Model (LIF) Sparseness

2005

pre-spike post-spiketpre tpost

synapse

Δt= tpost - tpre

-100 -50 0 50 100-60-40-20

020406080

100120

∆t<0 LTD

Cond

ucta

nce

Chan

ge ∆

G (%

)

Spike Timing ∆t (ms)

∆t>0 LTP

Spike-Timing-Dependent-Plasticity(STDP) of Synapse


Neurobiological Basis of Learning

Reward (supervision): global feedback signal

Inhibition: unsupervised sparse feature extraction

Habituation: stabilize the learning and convergence

Learning: local, feed forward STDP or SRDPon each plastic synapse

Synapse: non-linear, noisy, retention and endurance issues

- 8 -

Monkey, Parietal cortex, Nature Communications, 2015

Honeybee, olfactory system, Nature Neuroscience, 2007

Mouse, Motor cortex, Nature Communications, 2014


Brain-inspired Computing

- 9 -

Neuron4-100μm[22nm]

Task Complexity (log)M

achi

ne C

ompl

exity

(Log

)

CPU

Brain

Neural Computer

MicrocircuitFO = 1K-100K

[FO = 4]

Architecture/System100B, 100Hz, 20W, 30% ER/neuron, 95% accuracy

[1.4B, 3.7GHz, 45W, <10-9 BER]


Neuromorphic Computing On-a-chip Challenges and Needs

– Efficiency gap in computing and energy

Hardware: Resistive Cross-point Array– >2000X speedup; Realistic issues

Algorithm: Inhibition and Noise– Motif of feedforward inhibition

– MNIST: >95% accuracy, 3X saving in network size

Summary

- 10 -


Hardware Acceleration Training / Learning: computationally very expensive

– Involving many parallel operations (data fetch, matrix/vector product, etc.), not suitable to a sequential architecture

– 1.83 minute to process feature extraction of one HD image, with a 8-core 3.4GHz CPU, using sparse coding

103 – 105 speedup required to achieve real-time, on-line training of HD images at 30 frames/second– Conventional hardware is inadequate

- 11 -

GPU10 – 30 X

FPGA10 – 50 X

ASIC102 – 103 X

Beyond CMOS >103 X


Resistive Cross-point Array Analog memory to emulate the fully connected synapses

- 12 -

Image Patch X (100)

Dictionary D(1000 x 100)

Extracted Feature Z(1000, sparse)

Original Image

CMOS Periphery circuits for

input/output neurons

Ij

Vi

Rij

RRAM/SRAM for synapse weight

Ij = Σ(1/Rij)⋅Vi


A multi-level memory cell to represent the synapse weight

CMOS option: Multi-bit transposable SRAM

Metrics Desired Targets PCM RRAM

Device Dimension <10nm ~20nm ~10nm

Programming Voltage <1V <3V <3V

Programming Speed <μs ~50ns ~10ns

Energy Consumption <10fJ/spike ~10pJ/spike ~100fJ/spike

Multi-level States >100 ~100 ~30

Dynamic Range >5 >100 >100

Synaptic Device

- 13 -


RRAM: Switching Dynamics On top of CMOS, at the cross point; non-volatile Cell conductance (1/R or G) for the weight D G is tuned by the voltage and the pulse number (timing)

Issues: variability, non-linearity, process integration

- 14 -

[S. H. Jo et al., Nano Letter 2009]

Vw


Circuits for the Algorithm All cells are DC connected, different from the memory The value of Z, X (or r) represented by the number of

voltage pulses; D by the RRAM conductance

- 15 -

Zj

ri

Ir, i

Gij

readwrite (r)

Vr, i

VZ, j

IZ, jread

write (Z)

Input Neuron (X or r)

Dictionary D

Out

put N

euro

n (Z

)

Task Operations

𝑫𝑫 � 𝒁𝒁 𝐼𝐼𝑟𝑟,𝑖𝑖 = �𝑖𝑖

𝐺𝐺𝑖𝑖𝑖𝑖 � 𝑉𝑉𝑍𝑍,𝑖𝑖

𝑫𝑫𝑻𝑻 � 𝒓𝒓 𝐼𝐼𝑍𝑍,𝑖𝑖 = �𝑖𝑖

𝐺𝐺𝑖𝑖𝑖𝑖 � 𝑉𝑉𝑟𝑟,𝑖𝑖𝑖𝑖

𝑫𝑫update

∆𝐺𝐺𝑖𝑖𝑖𝑖= 𝜂𝜂 � 𝑟𝑟 � 𝑍𝑍


Read: Integrate-and-Fire A current-to-digital converter, operating as the

Integrate-and-Fire neuron model

- 16 -

ATB

Vreset

Vspike

Vspike

Ir,i (or IZ,j)(0 – 12 μA)

D QR

D QR

8-bit spike counter

Q[5]

Q[6]Q[5] Q[7]

RE

Q[0]

Ccol (Crow)

VpD Q

RD Q

R

Q[6] Q[7]Vin

0.50

0.53

0.0

1.5

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

RE RE

VspikeVspike

Vin Vin

Time (ns)

Volta

ge (V

)

0 4 8 12

Current (µA)

w/o ATB w/ ATB

0

2

4

6

8

Num

ber o

f Pul

ses

I = 6μA I = 1μA


Write: SRDP Write RRAM through the spiking rate between input

(X or r) and output (Z) neurons

– Z value for the time window to write– r value for the pulse number (firing rate)

- 17 -

∆𝐺𝐺𝑖𝑖𝑖𝑖∝ 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 = 𝑾𝑾𝒓𝒓𝑾𝑾𝑾𝑾𝑾𝑾 𝑻𝑻𝑾𝑾𝑻𝑻𝑾𝑾 � 𝑭𝑭𝑾𝑾𝒓𝒓𝑾𝑾𝑭𝑭𝑭𝑭 𝑹𝑹𝑹𝑹𝑾𝑾𝑾𝑾 = 𝜂𝜂 � 𝑍𝑍 � 𝑟𝑟


Parallel Write: O(1)

>2000 X speed up at 65nm

- 18 -

0 100 200 300 40080

120

160

200

Cond

ucta

nce

(nS)

Pulse number

100 200 300 400

Potentiation:+3V, 40msDepression:-3V, 10ms

[email protected]

Z1=1

Z2=0.5

R1=16

R2=8

0.0

3.01.5

0.0

3.01.5

0.0

3.01.5

0.0

3.01.5

0.0

3.01.5

0.0

3.01.5

0.0

3.01.5

0.0

3.01.5

Z1=1

Z2=0.5

R1=-16

R2=-8

R>0 R<0

Time Time

Z1

Z2

R1

R2

80120160

Z1⋅|R1|=16

Z1⋅|R2|=8

Z2⋅|R1|=8

Z2⋅|R2|=4

Potentiation: Depression:

D1

D2

D3

D4

R>0 R<0

Initial state

∆D=77

∆D=39

∆D=43

∆D=26

6090

120

Cond

ucta

nce

(nS)

80100120

0 8 16 24 3280

100120

Pulse number

+3V, 100ms -3V, 30ms

R1 R2

Z1

Z2

D1 D2

D3 D4

Col1 Col2


Array Integration Peripheral circuits consume significant area Solution: scaling up the array size; non-CMOS neurons

- 19 -

130nm 1T1R array


Realistic Device Properties (1) Non-zero off-state conductance; limited levels / precision Fixed-point computing

– Weight (D): 6 bits (64 levels) – Output (Z): 4 bits– On/off ratio needs to be > 25

- 20 -+ ++

DICTIONARY ARRAY

DU

MM

Y C

OLU

MN

-+ + + --

Z INPUT

Di-1Z DiZ Di+1Z

Devices with Minimum

Conductance

Solution: spatial redundancy to solve non-zero off-state


Realistic Device Properties (2)

- 21 -

10k 20k 30k 40k 50k 60k

40

50

60

70

80

90

Realistic (with resistivesynaptic device)

Ideal (software)

0 200 400 600 800 100020

30

40

50

60

70

80

90

100

∆C

ondu

ctan

ce (%

)

Number of Write

Decay in RRAM Write (Habituation)

Nonlinear, noisy, poor endurance (habituation in programming)

These hardware problems (variations, unreliable synapse) and performance demands (real time, on-line learning, and mobile) co-exist in biological cortical and sensory systems!

A bio-plausible solution: robust, low power, accurate, on-line

[S. Yu, et al., IEDM 2015]


RHINO: A Biomimic Solution Inspired by the olfactory system in insects and the

network motif that is general in biological process

- 22 -

[Nature Review, 2007]

Mushroom Body (MB)

Antennal Lobe (AL)

Kenyon Cells (KCs) 15,000

Lateral Horn Interneurons (LHIs), 100


Network Structure and Rules Rewarding for associative

(supervised) learning Inhibition to speed up the

formation of sparsity Habituation (decay in learning

rate) to achieve the convergence

STDP/SRDP rules with rewarding to update W’s

Constructive role of noiseand habituation

No global operations (normalization, etc.)

- 23 -

Input (X), 28 x 28

Output (E), 2000

Inhibition (I), 100

Classifier (C)

Reward


Training Procedure Initialization

– WX2E and WX2I are initialized randomly, with 50% connectivity; WI2E are uniformly initialized

Training through global feedback from C, no local iteration Training is full image based, mainly feedforward

- 24 -

Initialize Compute reward; train WE2C

Train excitation WX2E and WX2I

Train inhibitionWI2E


Demonstration: MNIST MNIST for handwriting recognition

– Data represented by 0 – 50 spikes– Full image 28 x 28– No pooling or normalization– 50% connectivity of WX2E and WX2I

- 25 -

E: 2000

C: 10

X: 28 x 28

I: 100

0k 10k 20k 30k 40k 50k 60k4

6

8

10

12

14

16

without inhibition with inhibition

0 20 40 60

82

84

86

88

90

92

94

96

RHINO Sparse coding No feedforward I


Neuron Firing Rate Homeostatic balance, which controls overfiring of the output

neurons, is essential for learning

- 26 -

Firin

g R

ate

of 2

000

E N

euro

ns

Handwriting Digits (10 categories)0 1 2 3 4 5 6 7 8 9

With homeostatic balance

Without homeostatic balance

0 1 2 3 4 5 6 7 8 90 1 2 3 4 5 6 7 8 9

Beforetraining


Sparsity and Noise for Accuracy Sparsity under thresholding: an appropriate range is necessary Initial randomness: without noise, learning cannot start Habituation, similar as the learning rate, is critical for the convergence

- 27 -

0 10 20 30 40 50 60

20

40

60

80

Accu

racy

(%)

Number of Training Images (k)

2.5% 5% 10% 15% 20%

Percentage of Firing Neurons:

0K 2K 4K 6K 8K

75

80

85

90

100% 40% 80% 20% 60% 10%


Size Reduction With 100 Is, the network size of E is reduced by 3X at

the same accuracy of 95% The mechanism is similar to the residual net

- 28 -

w/ inhibition (E + I) w/o inhibition (E only)

[Microsoft, 2015]

LHIs

KCs

AL

+ ‒

+


Results Comparison

- 29 -

Reference Input Data format and precision

Learning rules

Number of neurons

Number of parameters

Number of images Accuracy

Mushroom body 28x28 Spike Rewarded STDP 50000 5E5 60000 87%

Two layer SNN (Querlioz, 2013) 28x28 Spike STDP 300 2.4E5 60000x3 93.5%

Unsupervised ETH 28x28 Spike STDP 6400 4.6E7 200000 95.0%

This work 28x28 Spike rate in a 50 window

RewardedSRDP 2100 8.4E5 60000 95.0%

This work 28x28 Spike rate in a 50 window

Rewarded SRDP 6000 2.4E6 60000 96.2%

Spiking RBM 28x28 Spike rate Contrastive divergence 500 3.9E5 20000 92.6%

Sparse Coding 10x10 patch 3-bit number Gradient 300 3E4 60000x10 94.0%

Two layer NN 28x28 Floating number

Gradient descent 1000 7.8E5 60000 95.5%

Spiking CNN 28x28 Spike timing Regenerative learning 5.6E4 1.2E5 60000 99.08%


Summary Resistive cross-point array: an analog platform for

synaptic operations– >2000X speedup; Accuracy degradation due to device issues

RHINO: a bio-inspired algorithm– >95% accuracy and 3X size reduction, using “negative” effects

Future: brain-inspired hardware-algorithm for low precision, compact network, and high energy efficiency

- 30 -

Hardware Efficiency in Neuromorphic Computing: Devices ...eecs.ucf.edu/~jinyier/DASS2016/NC-Cao.pdf · Hardware Efficiency in Neuromorphic Computing: Devices, Circuits, and Algorithms

Documents