Page 1
Angshuman Parashar, Minsoo Rhu*, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally
SCNN: An Accelerator for Compressed-sparseConvolutional Neural Networks
* Now at POSTECH (Pohang University of Science and Technology), South Korea
Page 2
2© NVIDIA 2017
Motivation
Page 3
3© NVIDIA 2017
Convolution (dense)Inner product between every input pixel and every filter coefficient
Sliding window is intuitive and maps to reasonable hardware implementation
* = ?
Input activation maps Filters
(Weights)
Output activation maps
Page 4
4© NVIDIA 2017
Convolution (dense)Inner product between every input pixel and every filter coefficient
Sliding window is intuitive and maps to reasonable hardware implementation
* =
: useless MUL ops
Page 5
5© NVIDIA 2017
Convolution (dense)Inner product between every input pixel and every filter coefficient
Sliding window is intuitive and maps to reasonable hardware implementation
* =
: useless MUL ops
Page 6
6© NVIDIA 2017
Convolution (dense)Inner product between every input pixel and every filter coefficient
Sliding window is intuitive and maps to reasonable hardware implementation
* =
: useless MUL ops
Page 7
7© NVIDIA 2017
Convolution (dense)Inner product between every input pixel and every filter coefficient
Sliding window is intuitive and maps to reasonable hardware implementation
* =
: useless MUL ops
Page 8
8© NVIDIA 2017
Convolution with SparsityMost operand values are zero
Static sparsity: pruned network weights set to ‘0’ during training
* =
* Han et al., “Learning Both Weights and Connections for Efficient Neural Network”, NIPS-2015
Page 9
9© NVIDIA 2017
Convolution with SparsityMost operand values are zero
* =
* Albericio et al., “CNVLUTIN: Ineffectual-Neuron-Free Deep Neural Network Computing”, ISCA-2016
Dynamic sparsity: negative-valued activations clamped to ‘0’ during inference
Page 10
10© NVIDIA 2017
Convolution with SparsityMost operand values are zero
Sliding window based convolution is wasteful
Fraction of non-zero (NZ) activations & weights is roughly 20~50% per layer
0
0.2
0.4
0.6
0.8
1
1.2
Densi
ty
Density (activations)
Density (weights)
< VGGNet >
Page 11
11© NVIDIA 2017
Convolution with SparsityMost operand values are zero
Sliding window based convolution is wasteful
Fraction of non-zero (NZ) activations & weights is roughly 20~50% per layer
* = ?Dynamic sparsity Static sparsity
Page 12
12© NVIDIA 2017
Convolution with SparsityMost operand values are zero
Sliding window based convolution is wasteful
Fraction of non-zero (NZ) activations & weights is roughly 20~50% per layer
* =
: useless MUL ops
Page 13
13© NVIDIA 2017
Convolution with SparsityMost operand values are zero
Sliding window based convolution is wasteful
Fraction of non-zero (NZ) activations & weights is roughly 20~50% per layer
* =
: useless MUL ops
0
Page 14
14© NVIDIA 2017
Convolution with SparsityMost operand values are zero
Sliding window based convolution is wasteful
Fraction of non-zero (NZ) activations & weights is roughly 20~50% per layer
* =
: useless MUL ops
0
Page 15
15© NVIDIA 2017
Convolution with SparsityMost operand values are zero
Sliding window based convolution is wasteful
Fraction of non-zero (NZ) activations & weights is roughly 20~50% per layer
* =
: useless MUL ops
0 0
Page 16
16© NVIDIA 2017
MotivationCNN inference often performed in power-limited environments
Our goal: sparsity-optimized CNN accelerator for high energy-efficiency
Page 17
17© NVIDIA 2017
Possible Solutions
Page 18
18© NVIDIA 2017
Option 1: Leverage Dense CNN DesignEmploy pair of bit-masks to track non-zero weights and/or activations
* = ?NZ activations NZ weights
(3x3)
sliding window
Page 19
19© NVIDIA 2017
Option 1: Leverage Dense CNN Design
* =
NZ bitmask
(weights)
NZ activations NZ weights
Employ pair of bit-masks to track non-zero weights and/or activations
?
Page 20
20© NVIDIA 2017
Option 1: Leverage Dense CNN Design
* =
NZ bitmask
(weights)
NZ activations NZ weights
Employ pair of bit-masks to track non-zero weights and/or activations
?
NZ bitmask
(activations)
Page 21
21© NVIDIA 2017
Option 1: Leverage Dense CNN Design
* =
?
vv
NZ bitmask
(activations)
vv
NZ bitmask
(weights)
AND
gate
vv
=
2 useful MULs
NZ activations NZ weights
Employ pair of bit-masks to track non-zero weights and/or activations
CLK#0
ALU ALU ALU ALU
Page 22
22© NVIDIA 2017
Option 1: Leverage Dense CNN Design
* =
?
v
NZ bitmask
(activations)
v
NZ bitmask
(weights)
AND
gate
v
=
1 useful MUL
NZ activations NZ weights
Employ pair of bit-masks to track non-zero weights and/or activations
CLK#1
ALU ALU ALU ALU
Page 23
23© NVIDIA 2017
Option 1: Leverage Dense CNN Design
* =
?
NZ bitmask
(activations)
NZ bitmask
(weights)
AND
gate=
0 useful MULs
NZ activations NZ weights
Employ pair of bit-masks to track non-zero weights and/or activations
CLK#2
ALU ALU ALU ALU
Page 24
24© NVIDIA 2017
Option 1: Leverage Dense CNN Design
* =
?
NZ bitmask
(activations)
NZ bitmask
(weights)
AND
gate=
0 useful MULs
ALU ALU ALU ALU
Challenge: find enough useful MULs to fully
populate the vector ALUs
NZ activations NZ weights
Employ pair of bit-masks to track non-zero weights and/or activations
CLK#2
Page 25
25© NVIDIA 2017
SCNN Intuition & Approach
Page 26
26© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
y
z
x
?
Page 27
27© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
y
z
x
x
?
Page 28
28© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
y
z
x
x
?
Page 29
29© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
y
z
x
x
?
Page 30
30© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
y
z
xx
?
Page 31
31© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
y
z
x
x?
Page 32
32© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
y
z
x
x
?
Page 33
33© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
y
z
x
x x
x
x
x
x
?
Page 34
34© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
z
x
y y
y
y
y
y
y
?
Page 35
35© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
yx
z z
z
z
z
z
z ?
Page 36
36© NVIDIA 2017
The SCNN approachThe Cartesian product (i.e., all-to-all) based convolution operation
Assuming a convolution stride of ‘1’:
Minimum # MULs: Cartesian product of the NZ activations and the NZ weights
* = ?y
z
x
b
d
c
e
a
f
Page 37
37© NVIDIA 2017
The SCNN approachThe Cartesian product (i.e., all-to-all) based convolution operation
Assuming a convolution stride of ‘1’:
Minimum # MULs: Cartesian product of the NZ activations and the NZ weights
* = ?y
z
x
b
d
c
e
a
f
b
d
e
f
c y
z
Compress
NZ activations
Compress
NZ weights
Page 38
38© NVIDIA 2017
The SCNN approachThe Cartesian product (i.e., all-to-all) based convolution operation
Assuming a convolution stride of ‘1’:
Minimum # MULs: Cartesian product of the NZ activations and the NZ weights
=
x
a
b
d
e
f
cy
z
xa *
ya *
za *
xb *
yb *
zb *
…
Scatter
network
Accumulate MULs
PE frontend PE backend
Page 39
39© NVIDIA 2017
SCNN Architecture
Page 40
40© NVIDIA 2017
SCNN Architecture2D spatially arranged processing elements (PEs)
Page 41
41© NVIDIA 2017
SCNN ArchitectureWorkload distribution
Weights broadcast to all PEs
All PEs have a copy of all the NZ weights of the CNN model
Page 42
42© NVIDIA 2017
SCNN ArchitectureWorkload distribution
Filters
(Weights)
*
PE-0 PE-1
PE-2 PE-3
Input activation maps
=
Output activation maps
PE-0 PE-1
PE-2 PE-3
Each PE is allocated with a partial volume of input and output activations
Input and output activations stay local to PE
Page 43
43© NVIDIA 2017
SCNN ArchitectureHalo resolution?
B
Filters
(Weights)
Output halos:
PE-0 calculates (A x B), but the result should be accumulated in PE-1 (X)
*
APE-0 PE-1
PE-2 PE-3
Input activation maps
=
Output activation maps
PE-0 PE-1
PE-2 PE-3
X
Page 44
44© NVIDIA 2017
SCNN ArchitectureInter-PE communication channel for halo resolution
PE
Weights
IAOA
NEneighbor
SEneighbor
: traffic for halo resolution
Page 45
45© NVIDIA 2017
SCNN PE microarchitecture(Compressed-sparse frontend) + (Scattered-dense backend)
I
F
Page 46
46© NVIDIA 2017
SCNN PE microarchitecture(Compressed-sparse frontend) + (Scattered-dense backend)
I
F
PE frontend
PE backend
Page 47
47© NVIDIA 2017
Evaluation
Page 48
48© NVIDIA 2017
Evaluation
Network models and input activations
trained → pruned → retrained models using Caffe
Area & power
System-C → Catapult HLS → Verilog RTL → Synthesis of an SCNN PE
Performance & energy
Performance model for cycle-level simulation of SCNN
Analytical model for design space exploration (dataflows, sparse vs. dense)
Methodology
Page 49
49© NVIDIA 2017
Evaluation
DCNN: operates solely on dense weights and activations
DCNN-opt: DCNN with (de)compression of activations + ALU power-gating
SCNN: sparse-optimized CNN accelerator
Architecture configurations
Page 50
50© NVIDIA 2017
PerformanceDense vs. sparse
< VGGNet >
* Network-wide improvement over DCNN: AlexNet (2.37x), GoogLeNet (2.19x)
Page 51
51© NVIDIA 2017
Energy consumptionDense vs. sparse
* Network-wide improvement over DCNN: AlexNet (2.13x), GoogLeNet (2.51x)
< VGGNet >
Page 52
52© NVIDIA 2017
Related WorkQualitative comparison to prior work
ArchitectureSparse optimizations
Convolution dataflowWeights Activations
DaDianNao [ASPLOS ‘14] - -
(Variant of)
Sliding-window
Eyeriss [ISCA ‘16] - Power-gating
CNVLUTIN [ISCA ‘16] - Zero-skipping
Cambricon-X [MICRO ’16] Zero-skipping -
SCNN Zero-skipping Cartesian-product
Page 53
53© NVIDIA 2017
Follow-up questions?
Technical leads:
SCNN architecture & sparse models: Minsoo Rhu
TimeLoop (CNN analytical model): Angshuman Parashar
Power & area modeling: Rangharajan Venkatesan
Contacts the authors
Page 54
54© NVIDIA 2017
Conclusion
Novel Cartesian-product based convolution operation
Simple/compressed/sparse PE frontend
Scatter/dense PE backend
Superior performance and energy-efficiency
Average 2.7x higher performance than dense CNN architecture
Average 2.3x higher energy-efficiency than dense CNN architecture
SCNN: a compressed-sparse CNN accelerator