Hardware Efficiency in Neuromorphic Computing: Devices, Circuits, and Algorithms Yu (Kevin) Cao School of Electrical , Computer and Energy Engineering Arizona State University Acknowledgement: Jae-sun Seo, Shimeng Yu, Sarma Vrudhula, Visar Berisha (ASU); Maxim Bazhenov (UCSD); Jieping Ye (UM)
30
Embed
Hardware Efficiency in Neuromorphic Computing: Devices ...eecs.ucf.edu/~jinyier/DASS2016/NC-Cao.pdf · Hardware Efficiency in Neuromorphic Computing: Devices, Circuits, and Algorithms
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Hardware Efficiency
in Neuromorphic Computing:
Devices, Circuits, and Algorithms
Yu (Kevin) CaoSchool of Electrical , Computer and Energy Engineering
Arizona State University
Acknowledgement: Jae-sun Seo, Shimeng Yu, Sarma Vrudhula, Visar Berisha (ASU); Maxim Bazhenov (UCSD); Jieping Ye (UM)
2016 SIGDA DASS2016 SIGDA DASS
Neuromorphic Computing On-a-chip Challenges and Needs
Algorithm: Inhibition and Noise– Motif of feedforward inhibition
– MNIST: >95% accuracy, 3X saving in network size
Summary
- 2 -
2016 SIGDA DASS2016 SIGDA DASS
From Data to Information
- 3 -
Useful If Tagged and Analyzed
Tagged
AnalyzedBig Gap in Information Analysis!
Big Data Generated
[IDC, December 2012]
77%
2016 SIGDA DASS2016 SIGDA DASS
Success of Machine Learning A top-down approach: better for digital IC
– Pros: mathematical, accurate, scalable– Cons: big data, heavy computing, off-line learning
- 4 -
2016 SIGDA DASS2016 SIGDA DASS
Hardware Implementation Today learning is usually in the data center (cloud)
– Big data– Power hungry– Network issue– Data security
- 5 -
30 frames/s
Edge computing (fog): novel hardware/algorithms needed– Local to the sensor, real-time, reliable, low-power– On-line, personalized learning with continuous data
Algorithm: Inhibition and Noise– Motif of feedforward inhibition
– MNIST: >95% accuracy, 3X saving in network size
Summary
- 10 -
2016 SIGDA DASS2016 SIGDA DASS
Hardware Acceleration Training / Learning: computationally very expensive
– Involving many parallel operations (data fetch, matrix/vector product, etc.), not suitable to a sequential architecture
– 1.83 minute to process feature extraction of one HD image, with a 8-core 3.4GHz CPU, using sparse coding
103 – 105 speedup required to achieve real-time, on-line training of HD images at 30 frames/second– Conventional hardware is inadequate
- 11 -
GPU10 – 30 X
FPGA10 – 50 X
ASIC102 – 103 X
Beyond CMOS >103 X
2016 SIGDA DASS2016 SIGDA DASS
Resistive Cross-point Array Analog memory to emulate the fully connected synapses
- 12 -
Image Patch X (100)
Dictionary D(1000 x 100)
Extracted Feature Z(1000, sparse)
Original Image
CMOS Periphery circuits for
input/output neurons
Ij
Vi
Rij
RRAM/SRAM for synapse weight
Ij = Σ(1/Rij)⋅Vi
2016 SIGDA DASS2016 SIGDA DASS
A multi-level memory cell to represent the synapse weight
CMOS option: Multi-bit transposable SRAM
Metrics Desired Targets PCM RRAM
Device Dimension <10nm ~20nm ~10nm
Programming Voltage <1V <3V <3V
Programming Speed <μs ~50ns ~10ns
Energy Consumption <10fJ/spike ~10pJ/spike ~100fJ/spike
Multi-level States >100 ~100 ~30
Dynamic Range >5 >100 >100
Synaptic Device
- 13 -
2016 SIGDA DASS2016 SIGDA DASS
RRAM: Switching Dynamics On top of CMOS, at the cross point; non-volatile Cell conductance (1/R or G) for the weight D G is tuned by the voltage and the pulse number (timing)
Issues: variability, non-linearity, process integration
- 14 -
[S. H. Jo et al., Nano Letter 2009]
Vw
2016 SIGDA DASS2016 SIGDA DASS
Circuits for the Algorithm All cells are DC connected, different from the memory The value of Z, X (or r) represented by the number of
voltage pulses; D by the RRAM conductance
- 15 -
Zj
ri
Ir, i
Gij
readwrite (r)
Vr, i
VZ, j
IZ, jread
write (Z)
Input Neuron (X or r)
Dictionary D
Out
put N
euro
n (Z
)
Task Operations
𝑫𝑫 � 𝒁𝒁 𝐼𝐼𝑟𝑟,𝑖𝑖 = �𝑖𝑖
𝐺𝐺𝑖𝑖𝑖𝑖 � 𝑉𝑉𝑍𝑍,𝑖𝑖
𝑫𝑫𝑻𝑻 � 𝒓𝒓 𝐼𝐼𝑍𝑍,𝑖𝑖 = �𝑖𝑖
𝐺𝐺𝑖𝑖𝑖𝑖 � 𝑉𝑉𝑟𝑟,𝑖𝑖𝑖𝑖
𝑫𝑫update
∆𝐺𝐺𝑖𝑖𝑖𝑖= 𝜂𝜂 � 𝑟𝑟 � 𝑍𝑍
2016 SIGDA DASS2016 SIGDA DASS
Read: Integrate-and-Fire A current-to-digital converter, operating as the
Integrate-and-Fire neuron model
- 16 -
ATB
Vreset
Vspike
Vspike
Ir,i (or IZ,j)(0 – 12 μA)
D QR
D QR
8-bit spike counter
Q[5]
Q[6]Q[5] Q[7]
RE
Q[0]
Ccol (Crow)
VpD Q
RD Q
R
Q[6] Q[7]Vin
0.50
0.53
0.0
1.5
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
RE RE
VspikeVspike
Vin Vin
Time (ns)
Volta
ge (V
)
0 4 8 12
Current (µA)
w/o ATB w/ ATB
0
2
4
6
8
Num
ber o
f Pul
ses
I = 6μA I = 1μA
2016 SIGDA DASS2016 SIGDA DASS
Write: SRDP Write RRAM through the spiking rate between input
(X or r) and output (Z) neurons
– Z value for the time window to write– r value for the pulse number (firing rate)
– Weight (D): 6 bits (64 levels) – Output (Z): 4 bits– On/off ratio needs to be > 25
- 20 -+ ++
DICTIONARY ARRAY
DU
MM
Y C
OLU
MN
-+ + + --
Z INPUT
Di-1Z DiZ Di+1Z
Devices with Minimum
Conductance
Solution: spatial redundancy to solve non-zero off-state
2016 SIGDA DASS2016 SIGDA DASS
Realistic Device Properties (2)
- 21 -
10k 20k 30k 40k 50k 60k
40
50
60
70
80
90
Realistic (with resistivesynaptic device)
Ideal (software)
0 200 400 600 800 100020
30
40
50
60
70
80
90
100
∆C
ondu
ctan
ce (%
)
Number of Write
Decay in RRAM Write (Habituation)
Nonlinear, noisy, poor endurance (habituation in programming)
These hardware problems (variations, unreliable synapse) and performance demands (real time, on-line learning, and mobile) co-exist in biological cortical and sensory systems!
A bio-plausible solution: robust, low power, accurate, on-line
[S. Yu, et al., IEDM 2015]
2016 SIGDA DASS2016 SIGDA DASS
RHINO: A Biomimic Solution Inspired by the olfactory system in insects and the
network motif that is general in biological process
- 22 -
[Nature Review, 2007]
Mushroom Body (MB)
Antennal Lobe (AL)
Kenyon Cells (KCs) 15,000
Lateral Horn Interneurons (LHIs), 100
2016 SIGDA DASS2016 SIGDA DASS
Network Structure and Rules Rewarding for associative
(supervised) learning Inhibition to speed up the
formation of sparsity Habituation (decay in learning
rate) to achieve the convergence
STDP/SRDP rules with rewarding to update W’s
Constructive role of noiseand habituation
No global operations (normalization, etc.)
- 23 -
Input (X), 28 x 28
Output (E), 2000
Inhibition (I), 100
Classifier (C)
Reward
2016 SIGDA DASS2016 SIGDA DASS
Training Procedure Initialization
– WX2E and WX2I are initialized randomly, with 50% connectivity; WI2E are uniformly initialized
Training through global feedback from C, no local iteration Training is full image based, mainly feedforward
- 24 -
Initialize Compute reward; train WE2C
Train excitation WX2E and WX2I
Train inhibitionWI2E
2016 SIGDA DASS2016 SIGDA DASS
Demonstration: MNIST MNIST for handwriting recognition
– Data represented by 0 – 50 spikes– Full image 28 x 28– No pooling or normalization– 50% connectivity of WX2E and WX2I
- 25 -
E: 2000
C: 10
X: 28 x 28
I: 100
0k 10k 20k 30k 40k 50k 60k4
6
8
10
12
14
16
without inhibition with inhibition
0 20 40 60
82
84
86
88
90
92
94
96
RHINO Sparse coding No feedforward I
2016 SIGDA DASS2016 SIGDA DASS
Neuron Firing Rate Homeostatic balance, which controls overfiring of the output
Sparsity and Noise for Accuracy Sparsity under thresholding: an appropriate range is necessary Initial randomness: without noise, learning cannot start Habituation, similar as the learning rate, is critical for the convergence
- 27 -
0 10 20 30 40 50 60
20
40
60
80
Accu
racy
(%)
Number of Training Images (k)
2.5% 5% 10% 15% 20%
Percentage of Firing Neurons:
0K 2K 4K 6K 8K
75
80
85
90
100% 40% 80% 20% 60% 10%
2016 SIGDA DASS2016 SIGDA DASS
Size Reduction With 100 Is, the network size of E is reduced by 3X at
the same accuracy of 95% The mechanism is similar to the residual net
- 28 -
w/ inhibition (E + I) w/o inhibition (E only)
[Microsoft, 2015]
LHIs
KCs
AL
+ ‒
+
2016 SIGDA DASS2016 SIGDA DASS
Results Comparison
- 29 -
Reference Input Data format and precision
Learning rules
Number of neurons
Number of parameters
Number of images Accuracy
Mushroom body 28x28 Spike Rewarded STDP 50000 5E5 60000 87%