Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator Performance
Pierre Paulin, Director of R&D
18 February 2020
CMC AI Workshop
Scaling Deep Neural Network Accelerator
Performance
© 2019 Synopsys, Inc. 2
• Deep Neural Network Trends
• EV7x Processor and DNN Engine Overview
– Specialized DNN accelerator
– Local optimization of data movement
– Local data compression of coefficient and feature-maps
• Advanced Bandwidth Optimization Techniques
– DMA broadcast of coefficients and feature-maps
– Multi-level layer fusion
– Multi-level tiling across memory hierarchy
Outline
© 2019 Synopsys, Inc. 3
Deep Neural Network Trends
© 2019 Synopsys, Inc. 4
Trends in Convolutional Neural Networks Topologies
Trend 1: Reduced Computational Requirements
Trend 2: Reduced Model Size
Trend 3: Reduced Data Reuse and Parallelism
Trend 4: Feature-map Bandwidth Becomes Dominant
Examples:
MobileNet,
DenseNet
© 2019 Synopsys, Inc. 5
Trend 1: Reduced Computational Requirements
Nearly 100X reduction
© 2019 Synopsys, Inc. 6
Trend 2: Reduced Model Sizes 2012
50
55
60
65
70
75
80
85
0 1 10 100
% A
ccu
racy T
op
1
Milion Weights
2014
50
55
60
65
70
75
80
85
0 1 10 100
% A
ccu
racy T
op
1
Milion Weights
2016
50
55
60
65
70
75
80
85
0 1 10 100
% A
ccu
racy T
op
1
Milion Weights
2018
50
55
60
65
70
75
80
85
0 1 10 100
% A
ccu
racy T
op
1
Milion Weights
© 2019 Synopsys, Inc. 7
Trend 3: Reduced Data Reuse and ParallelismExample: Depthwise Separable Kernels used in MobileNet V2
Depth-wise Separable 3x3 Convolution
Traditional 1x1 Convolution
Conv 1x1
DW Conv 3x3
Conv 1x1
+
64
256
256
64
64
High Computation
High Data Reuse
High Parallelism
Low Computation
Low Data Reuse
Low Parallelism
© 2019 Synopsys, Inc. 8
Trend 4: Feature-map Bandwidth Becomes Dominant
Within one Dense Block:
- Traditional
Example: DenseNet and Multilayer DenseNet
More Connections between Layers
→ More Bandwidth for Feature-maps
© 2019 Synopsys, Inc. 9
Trends in Convolutional Neural Networks Topologies
Trend 1: Reduced Computational Requirements
Trend 2: Reduced Model Size
Trend 3: Reduced Data Reuse and Parallelism
Trend 4: Feature-map Bandwidth Becomes Dominant
Examples:
MobileNet,
DenseNet
© 2019 Synopsys, Inc. 10
Scaling Performance with Bandwidth Constraints
8
LPDDR4 16
10 20 30 100TOPS
LPDDR5 22.4
HBM2 (50%) 128
GB/s
32
64
1 PUN PUs
2
• Bandwidth reduction has
direct impact on
performance and power
• Over 50% of SoC power is
DRAM access
© 2019 Synopsys, Inc. 11
• Deep Neural Network Trends
– Accuracy and Funtionality
• EV7x Processor Family Overview
• DNN Engine
– Specialized DNN accelerator
– Local optimization of data movement
– Local data compression of coefficient and feature-maps
• Advanced Bandwidth Optimization Techniques
– DMA broadcast of coefficients and feature-maps
– Multi-level layer fusion
– Multi-level tiling across memory hierarchy
Embedded Vision Processor Outline
© 2019 Synopsys, Inc. 12
EV7x Processor and DNN Engine Overview
© 2019 Synopsys, Inc. 13
EV7x Vision Processor IP with 35 TOPS Performance
• Addresses market requirements for full range of
vision applications: always-on IoT, augmented
reality, autonomous driving…
• Faster neural network accelerator executes all
graphs include the latest, most complex graphs
• Enhanced vision engine for low-power, high-
performance Vision, Simultaneous Localization and
Mapping (SLAM) and DSP algorithms
• Architectural changes and power gating techniques
reduce power consumption
• High-bandwidth encryption protects coefficients
and biometric data
• Automatic graph partitioning using MetaWare EV
for improved performance, bandwidth, latency
14,080 MAC Engine Made Possible with Better Utilization, Bandwidth & Power
Vision Engine1, 2 or 4 VPU configurations
DNN Accelerator880 to 14,080 MAC configurations
TracePower Mgmt. Sync & Debug
AXI Interfaces
DMACoherency
Shared Memory
Closely Coupled
Memories
AES Encryption
MetaWare EV Development Toolkit
OpenCL™ C, C/C++
Development Tools
OpenCV, OpenVX™
Libraries & Runtime
Simulators,
Virtual Platforms
DNN Mapping
Tools
DMA
VP
U 4
512-bit
vector
DSP
32-bit
scalar
VFPU
VCCMCache
VP
U 3
512-bit
vector
DSP
32-bit
scalar
VFPU
VCCMCache
VP
U 2
512-bit
vector
DSP
32-bit
scalar
VFPU
VCCMCache
VP
U 1
512-bit
vector
DSP
32-bit
scalar
VFPU
VCCMCache
Convolutions 2D
Fully Connected Layers
Activations
Synopsys DesignWare ARC EV7x Processor
© 2019 Synopsys, Inc. 14
EV6x/7x Scalable DNN Engine for
Deep Learning-based Vision
- High performance, low power and low area
- Fully programmable
carcar
skybuilding
building
© 2019 Synopsys, Inc. 15
DNN Accelerator Supports Up to 35 TOPS For All DNN Applications
• Deep Neural Network Engine supports
– Convolutional Neural Networks (CNN)
– Batched Recurrent Neural Networks (RNN)
• EV7x max performance
– Up to 14,080 multiply-accumulators per engine
• Improved utilization provide increases MAC
efficiency
– Higher MAC utilization for 1x1 and 3x3
convolutions
– Increased support for non-linear functions
(PReLU, ReLU6, Maxout, Sigmoid, Tanh, …)
• Architectural enhancements improve
bandwidth, accuracy and power
0.1 to 35 TOPS to Address All Vision Applications
DNN Accelerator880 to 14,080 MAC configurations
DMA
Clo
se
ly C
ou
ple
d
Me
mo
ries
Convolutions 2D
Fully Connected Layers
Activations
© 2019 Synopsys, Inc. 16
Graph Mapping: Support of Multiple CNN Frameworks
• Support new graph
frameworks via ONNX-based
interoperability
– ONNX export utilities being
made available for numerous
frameworks
• Neutral Intermediate
representation
– Integrates the union of Caffe,
Tensorflow, ONNX features
EV Processor
Vision
CPU
DNN
Engine
CNN Graph Mapping Tool
Caffe
features
Tensorflow
features
EV High Level I/R
ONNX
features
Future
frameworks
Preliminary – Subject to Change
Common features
© 2019 Synopsys, Inc. 17
The Bandwidth Challenge
© 2019 Synopsys, Inc. 18
Scaling Performance with Bandwidth Constraint1 clusterN cluster
• Bandwidth reduction has
direct impact on
performance and power
• Over 50% of SoC power is
DRAM access
8
LPDDR4 16
TOPS
LPDDR5 22.4
HBM2 128
GB/s
32
64
© 2019 Synopsys, Inc. 19
• Coefficient Pruning
– Coefficients with a zero value are skipped/counted
• Feature Map Compression
– Lossless runtime compression and decompression
of feature maps to external memory
• Multi-level Layer Fusion
– Merging multiple folded layers into single
primitives reduces feature map bandwidth
• Optimized Handling of Coeff. and Feature-maps
– Sharing of common data across slices to minimize
bandwidth of coefficients and feature-maps
loading
EV DNN Bandwidth Improvement Features
© 2019 Synopsys, Inc. 20
DNN Engine
Convolution
ALU Conv. 2D
AGUs CC MEMs
Classification
ALU Conv. 1D
AGUs CC MEMs
Convolution
ALU Conv. 2D
AGUs CC MEMs
Classification
ALU Conv. 1D
AGUs CC MEMs
Convolution
ALU Conv. 2D
AGUs CC MEMs
Classification
ALU Conv. 1D
AGUs CC MEMs
Convolution
ALU Conv. 2D
AGUs CC MEMs
Classification
ALU Conv. 1D
AGUs CC MEMs
Feature-map partitioningSplit each layer over multiple slices
• Higher throughput – up to 4X
• Lower latency – up to 4X – due to parallel processing of a layer
• Significant bandwidth reduction
Con
v 3
x3
© 2019 Synopsys, Inc. 21
• Opposing CNN Graph Trends
– Reduced compute requirements and model size
– Reduced data reuse and parallelism
– Feature-map bandwidth becomes dominant
• Synopsys DNN Engine
– Specialized DNN accelerator
– Local optimization of data movement
– Local data compression of coeff. and feature-maps
• Advanced Bandwidth Optimization Techniques
– Optimized handling of coefficients and feature-maps
– Multi-level layer fusion and tiling
• Improved scalability, lower power
– 10 TOPs/W (7 nm)
Summary
8
LPDDR4 16
TOPS
LPDDR5 22.4
HBM2 128
GB/s
32
64
Thank You