Top Banner
Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator Performance
22

Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

Jul 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

Pierre Paulin, Director of R&D

18 February 2020

CMC AI Workshop

Scaling Deep Neural Network Accelerator

Performance

Page 2: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 2

• Deep Neural Network Trends

• EV7x Processor and DNN Engine Overview

– Specialized DNN accelerator

– Local optimization of data movement

– Local data compression of coefficient and feature-maps

• Advanced Bandwidth Optimization Techniques

– DMA broadcast of coefficients and feature-maps

– Multi-level layer fusion

– Multi-level tiling across memory hierarchy

Outline

Page 3: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 3

Deep Neural Network Trends

Page 4: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 4

Trends in Convolutional Neural Networks Topologies

Trend 1: Reduced Computational Requirements

Trend 2: Reduced Model Size

Trend 3: Reduced Data Reuse and Parallelism

Trend 4: Feature-map Bandwidth Becomes Dominant

Examples:

MobileNet,

DenseNet

Page 5: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 5

Trend 1: Reduced Computational Requirements

Nearly 100X reduction

Page 6: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 6

Trend 2: Reduced Model Sizes 2012

50

55

60

65

70

75

80

85

0 1 10 100

% A

ccu

racy T

op

1

Milion Weights

2014

50

55

60

65

70

75

80

85

0 1 10 100

% A

ccu

racy T

op

1

Milion Weights

2016

50

55

60

65

70

75

80

85

0 1 10 100

% A

ccu

racy T

op

1

Milion Weights

2018

50

55

60

65

70

75

80

85

0 1 10 100

% A

ccu

racy T

op

1

Milion Weights

Page 7: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 7

Trend 3: Reduced Data Reuse and ParallelismExample: Depthwise Separable Kernels used in MobileNet V2

Depth-wise Separable 3x3 Convolution

Traditional 1x1 Convolution

Conv 1x1

DW Conv 3x3

Conv 1x1

+

64

256

256

64

64

High Computation

High Data Reuse

High Parallelism

Low Computation

Low Data Reuse

Low Parallelism

Page 8: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 8

Trend 4: Feature-map Bandwidth Becomes Dominant

Within one Dense Block:

- Traditional

Example: DenseNet and Multilayer DenseNet

More Connections between Layers

→ More Bandwidth for Feature-maps

Page 9: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 9

Trends in Convolutional Neural Networks Topologies

Trend 1: Reduced Computational Requirements

Trend 2: Reduced Model Size

Trend 3: Reduced Data Reuse and Parallelism

Trend 4: Feature-map Bandwidth Becomes Dominant

Examples:

MobileNet,

DenseNet

Page 10: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 10

Scaling Performance with Bandwidth Constraints

8

LPDDR4 16

10 20 30 100TOPS

LPDDR5 22.4

HBM2 (50%) 128

GB/s

32

64

1 PUN PUs

2

• Bandwidth reduction has

direct impact on

performance and power

• Over 50% of SoC power is

DRAM access

Page 11: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 11

• Deep Neural Network Trends

– Accuracy and Funtionality

• EV7x Processor Family Overview

• DNN Engine

– Specialized DNN accelerator

– Local optimization of data movement

– Local data compression of coefficient and feature-maps

• Advanced Bandwidth Optimization Techniques

– DMA broadcast of coefficients and feature-maps

– Multi-level layer fusion

– Multi-level tiling across memory hierarchy

Embedded Vision Processor Outline

Page 12: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 12

EV7x Processor and DNN Engine Overview

Page 13: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 13

EV7x Vision Processor IP with 35 TOPS Performance

• Addresses market requirements for full range of

vision applications: always-on IoT, augmented

reality, autonomous driving…

• Faster neural network accelerator executes all

graphs include the latest, most complex graphs

• Enhanced vision engine for low-power, high-

performance Vision, Simultaneous Localization and

Mapping (SLAM) and DSP algorithms

• Architectural changes and power gating techniques

reduce power consumption

• High-bandwidth encryption protects coefficients

and biometric data

• Automatic graph partitioning using MetaWare EV

for improved performance, bandwidth, latency

14,080 MAC Engine Made Possible with Better Utilization, Bandwidth & Power

Vision Engine1, 2 or 4 VPU configurations

DNN Accelerator880 to 14,080 MAC configurations

TracePower Mgmt. Sync & Debug

AXI Interfaces

DMACoherency

Shared Memory

Closely Coupled

Memories

AES Encryption

MetaWare EV Development Toolkit

OpenCL™ C, C/C++

Development Tools

OpenCV, OpenVX™

Libraries & Runtime

Simulators,

Virtual Platforms

DNN Mapping

Tools

DMA

VP

U 4

512-bit

vector

DSP

32-bit

scalar

VFPU

VCCMCache

VP

U 3

512-bit

vector

DSP

32-bit

scalar

VFPU

VCCMCache

VP

U 2

512-bit

vector

DSP

32-bit

scalar

VFPU

VCCMCache

VP

U 1

512-bit

vector

DSP

32-bit

scalar

VFPU

VCCMCache

Convolutions 2D

Fully Connected Layers

Activations

Synopsys DesignWare ARC EV7x Processor

Page 14: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 14

EV6x/7x Scalable DNN Engine for

Deep Learning-based Vision

- High performance, low power and low area

- Fully programmable

carcar

skybuilding

building

Page 15: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 15

DNN Accelerator Supports Up to 35 TOPS For All DNN Applications

• Deep Neural Network Engine supports

– Convolutional Neural Networks (CNN)

– Batched Recurrent Neural Networks (RNN)

• EV7x max performance

– Up to 14,080 multiply-accumulators per engine

• Improved utilization provide increases MAC

efficiency

– Higher MAC utilization for 1x1 and 3x3

convolutions

– Increased support for non-linear functions

(PReLU, ReLU6, Maxout, Sigmoid, Tanh, …)

• Architectural enhancements improve

bandwidth, accuracy and power

0.1 to 35 TOPS to Address All Vision Applications

DNN Accelerator880 to 14,080 MAC configurations

DMA

Clo

se

ly C

ou

ple

d

Me

mo

ries

Convolutions 2D

Fully Connected Layers

Activations

Page 16: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 16

Graph Mapping: Support of Multiple CNN Frameworks

• Support new graph

frameworks via ONNX-based

interoperability

– ONNX export utilities being

made available for numerous

frameworks

• Neutral Intermediate

representation

– Integrates the union of Caffe,

Tensorflow, ONNX features

EV Processor

Vision

CPU

DNN

Engine

CNN Graph Mapping Tool

Caffe

features

Tensorflow

features

EV High Level I/R

ONNX

features

Future

frameworks

Preliminary – Subject to Change

Common features

Page 17: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 17

The Bandwidth Challenge

Page 18: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 18

Scaling Performance with Bandwidth Constraint1 clusterN cluster

• Bandwidth reduction has

direct impact on

performance and power

• Over 50% of SoC power is

DRAM access

8

LPDDR4 16

TOPS

LPDDR5 22.4

HBM2 128

GB/s

32

64

Page 19: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 19

• Coefficient Pruning

– Coefficients with a zero value are skipped/counted

• Feature Map Compression

– Lossless runtime compression and decompression

of feature maps to external memory

• Multi-level Layer Fusion

– Merging multiple folded layers into single

primitives reduces feature map bandwidth

• Optimized Handling of Coeff. and Feature-maps

– Sharing of common data across slices to minimize

bandwidth of coefficients and feature-maps

loading

EV DNN Bandwidth Improvement Features

Page 20: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 20

DNN Engine

Convolution

ALU Conv. 2D

AGUs CC MEMs

Classification

ALU Conv. 1D

AGUs CC MEMs

Convolution

ALU Conv. 2D

AGUs CC MEMs

Classification

ALU Conv. 1D

AGUs CC MEMs

Convolution

ALU Conv. 2D

AGUs CC MEMs

Classification

ALU Conv. 1D

AGUs CC MEMs

Convolution

ALU Conv. 2D

AGUs CC MEMs

Classification

ALU Conv. 1D

AGUs CC MEMs

Feature-map partitioningSplit each layer over multiple slices

• Higher throughput – up to 4X

• Lower latency – up to 4X – due to parallel processing of a layer

• Significant bandwidth reduction

Con

v 3

x3

Page 21: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

© 2019 Synopsys, Inc. 21

• Opposing CNN Graph Trends

– Reduced compute requirements and model size

– Reduced data reuse and parallelism

– Feature-map bandwidth becomes dominant

• Synopsys DNN Engine

– Specialized DNN accelerator

– Local optimization of data movement

– Local data compression of coeff. and feature-maps

• Advanced Bandwidth Optimization Techniques

– Optimized handling of coefficients and feature-maps

– Multi-level layer fusion and tiling

• Improved scalability, lower power

– 10 TOPs/W (7 nm)

Summary

8

LPDDR4 16

TOPS

LPDDR5 22.4

HBM2 128

GB/s

32

64

Page 22: Scaling Deep Neural Network Accelerator Performance - CMC · 2020-03-28 · Pierre Paulin, Director of R&D 18 February 2020 CMC AI Workshop Scaling Deep Neural Network Accelerator

Thank You