HUAWEI.COM Security Level: DaVinci: A Scalable Architecture for Neural Network Computing Heng Liao, Jiajin Tu, Jing Xia, Xiping Zhou 2019-07
HUAWEI.COM Security Level:
DaVinci: A Scalable Architecture for Neural Network Computing
Heng Liao, Jiajin Tu, Jing Xia, Xiping Zhou
2019-07
Key element to enable intelligence in physical devices
Control, Process,
Transfer, Store ……
Ubiquitous AI computation
Smart
Toy
Memory Footprint (GB)
IP
Camer
a
Smart
phoneIoT
Robotics
Industrial
Embedded AI
Autonomous
Driving
Cloud
AI Inference
Drones
Model
Training
Auto ML
Meta-Learning
Algorithm
Discovery
Smart City
Intelligent
Surveillance
General Artificial
Intelligence
Scalability
Applications across ~106 performance range
Computation (TOPS)
Ubiquitous AI computation
D-Tiny
D-Lite
D-Mini
DaVinciMax
Cloud-training
SurveillanceWireless
Phone
Wearable
DaVinci
Edge-training
Memory Footprint (GB)
Computation (TOPS)
Scalability
Rich Variety of Computing architectures in Huawei Portfolio
• Wide range of performance & efficiency
• CPU: General purpose
• GPU: Graphics
• NPU: DNN
• ISP: Camera sensor pipeline
• DSP: Camera post processing, AR
• VPU: Vision Processing Unit
• NP: Network Processor
• Each category represents a different PPA curve
Target: Search for Optimal PPA in Design Space
Power
Pe
rfo
rma
nce
Architecture Overview of DaVinci
Building Blocks and their Computation Intensity
N N2 N3
1 1 1
2 4 8
4 16 128
8 64 512
16 256 4096
32 1024 32768
64 4096 262144
1D Scalar Unit + 2D Vector Unit + 3D Matrix Unit
Full flexibility Rich & efficient operations High intensity
GPU +Tensor core
AI core + SRAM
Area (normalizedto 12 nm)
5.2mm^2 13.2mm^2
Computepower
1.7Tops fp16
8Tops fp16
DaVinci Core
• Cube:4096(16^3) FP16 MACs + 8192 INT8 MACs
• Vector:2048bit INT8/FP16/FP32 vector with special functions
(activation functions, NMS- Non Minimum Suppression, ROI, SORT)
• Explicit memory hierarchy design, managed by MTE
16^3 CubeA/B
DFF 16^2
accu
mul
ator
Buffer A L0
(64KB)
Buffer B L0
(64KB)
Unified Buffer (256KB)
DaVinci Core (16^3 Cube)
trans
img2
col
MTE
Acc
um D
FF
L1 B
uffe
r (1M
B)
L1 D
MAC
BIU
8*16 Vector Unit
Buffer C L0 (256KB)
I Cache
(32KB) Sca
lar
PSQ
L0 load4096-bit
L1 load2048-bit
System
ControlConfigPort
FP16->FP32 FP32->FP16ReLU
L/S2048-bit
Scalar Unit/ AGU/
Mask Gen GP
R
SPR
Cube Queue
Vector Queue
MTE Queue
Event SyncInstr.
Dispatch
deco
mp
L2 access1024-bit *2
Micro Architecture Configurations
Core Version
Cube Ops/cycle
VectorOps/Cycle
L0Bus width
L1Bus Width
L2Bandwidth
Davinci
Max8192 256
Match
Execution
Units
Not
bottleneck
A:8192
B:2048
910: 3TB/s ÷32
610: 2TB/s ÷8
310: 192GB/s÷2
Davinci
Lite4096 128
A:8192
B:204838.4GB/s
Davinci
Tiny512 32
A:2048
B:512None
Set the
performance
baseline
Minimize vector
bound
Ensure this is
not a bound
Scarce, limited by NoC,
avoid bound where possible
Resource Matching ---- Vector
05
1015202530
task
0
task
10
task
20
task
30
task
40
task
50
task
60
task
70
task
80
task
90
task
10
0
task
11
0
task
12
0
task
13
0
task
14
0
task
15
0
task
16
0
task
17
0
task
18
0
task
19
0
task
20
0
task
21
0
task
22
0
task
23
0
task
24
0
Bert cube cycle/vector cycle ratio
02468
1012
Mobilenet cube cycle/vector cycle
ratio
0102030405060
con
v1
res2
b_
br2
a
res2
c_b
r2c
res3
b_
br2
a
res3
c_b
r2c
res4
a_
br2
a
res4
b_
br2
c
res4
d_
br2
b
res4
f_b
r2a
res5
a_
br2
b
res5
c_b
r2a
ResNet50 cube cycle/vector cycle
• Balance Computation Power between
CUBE vs Vector by overlapping its
computation time with Vector
• Carefully allocated the number of MACs
in CUBE and Vectors
• Support multiple matrices multiply vector
operations in CUBE.
• Expand the width of data bus between L1
feature map buffer and CUBE
Resource Matching ---- Memory Hierarchy
Davinci carefully balance the memory hierarchy
design to avoid bandwidth become bottleneck at key
locations.
Examples:
• Reduce the DDR bandwidth requirement by
reusing data within L1, L0A, L0B.
• Asymmetric bandwidth provided according to the
nature of computation
• L1 -> L0A bandwidth >> L1->L0B bandwidth,
because W*H could be much bigger than
output channel number
More Challenges of DaVinci
Overview of the DSA Developer Stack
Instruction Set Architecture
Low Level 1 Compiler (Intrinsic C)
(Architecture defined programming)
Level 1 Library (written by expert)
Level 2 Compiler
(parallel/kernel programming model)
Level 2 Library (written by skilled
programmer)
Level 3 Compiler
(mathematical programming model)
Level 3 Library (written by novice
programmer)
GPU
Cuda/OpenCL
NPU
CCE C
CCE Lib
TIK
TIK LIB
TBE
TBE LIB
CudaNN/
CuBLAS
TVM/XLA
Challenge 1: How to Enable Parallelism with Single Thread
Scalar
instr 0
Scalar
instr 1
Scalar
instr 2
MTE
Instr 0
Vector
Instr 0
CUBE
Instr 0
Scalar
instr 3
Scalar
instr 7
MTE
Instr 1
Vector
Instr 1
Data
Dependency
Resource
Limit
Sync
Scalar
instr 4
Scalar
instr 5
Scalar
instr 6
Barrier
global
Time0 1 2 3 4 6 7 95 8
Dispatch
Unit
Scalar
Pipe
MTE
Queue
Vector
Queue
Cube
Queue
• Programmer is comfortable with the sequential code
• Davinc’s C like programming interface (CCE) let programmer to control the parallelism explicitly .
Solution with Multi-thread?
How about support hardware multi-thread feature?
• The code in each thread is sequential
• CUBE is a share resource between threads
• It has hardware cost
How does it work - TIK
• Typical sequential Davinci code is a combination of nested FOR loops
• Software multi-thread can be added to any FOR loop body (iterator kernel).
threads
Programmer view of multi-thread
Advanced Compiler Techniques
• Architecture independent DSL→ C → Binary lowering process
• Traversal order determines data reuse factor
• Millions of legitimate mappings
• Find optimal mapping to
• bridge the 2,000x memory bandwidth gap
Kernel Work Space
Tensor Data Set
Putting All This Together
PytorchTensorflow Mindspore
Graph Engine (L2: Architecture independent)
CANN (L1: Architecture dependent)
Task scheduler
Davinci Core Davinci Core Davinci Core
User Program: AI model description
Operator Library, User defined operators
• User program AI model using familiar frameworks
• Extends operator library when necessary
• The tasks are executed in a single node, or over a network cluster
Davinci AI Core in SoCs
Mobile AP SoC
Automotive SoC
AI Inference SoC
A55 Cluster
CPUCPUCPU
DMA
Engine
SPI
Flash
UART / I2C /
SPI / GPIO /etc.
Da Vinci
AI Core
Da Vinci
AI Core
LPDDR4x
64bits
USB
3.0
Device
Gigabit
Ethernet
PCIe 3.0
Ctrl
RC/EP
X86/ARM
Host
X86/ARM Host
or
PCIE EP Device
Network
LPDDR4
Chip
On-chip Buffer
8MBFHD Video
JPEG/PNG
Codec
Ascend310
1~4 LANE
CPUCPUCPUCPUCPU
x8
DSU
DMA
Engine
LPDDR4x
64bits
LPDDR4
Chip
Low
Power
M3
CHIE Interconnect 512bits
A55
TS
Last Level
Cache
3MB
DDR4 DIMM
On-chip Buffer
32MB
HBM 2.0
DMA
Engine
Da Vinci
AI Core
Da Vinci
AI Core
HBM 2.0
DMA
Engine
HBM 2.0
DMA
Engine
HBM 2.0
DMA
Engine
DVPPDVPP
0~1
DVPP
DVPP
2~3
SubsysTS
Subsys
DDR4
L3 Cache
Taishan
Cluster
MP4
0~3
DDR4
CHIE NOC (Mesh)
Taishan
Cluster
MP4
0~3
Taishan
Cluster
MP4
0~3
Taishan
MP4
0~15
Da Vinci
AI Core
Da Vinci
AI Core
Da Vinci
AI Core
Da Vinci
AI Core
Da Vinci
AI Core
16~23Da Vinci
AI Core
Da Vinci
AI Core
Da Vinci
AI Core
Da Vinci
AI Core
Da Vinci
AI Core
Da Vinci
AI Core
Da Vinci
AI Core
24~31
Da Vinci
AI CoreDa Vinci
AI CoreDa Vinci
AI CoreDa Vinci
AI CoreDa Vinci
AI CoreDa Vinci
AI Core
Da Vinci
AI Core
0~7
Da Vinci
AI CoreDa Vinci
AI CoreDa Vinci
AI CoreDa Vinci
AI CoreDa Vinci
AI CoreDa Vinci
AI Core
Da Vinci
AI Core
8~15
DDR4 DIMM
DDR4 DIMM
DDR4 DIMM
On-chip HBM DIE On-chip HBM DIE
On-chip HBM DIE On-chip HBM DIE
Nimbus V3
X86/ARM
Host
FPGA
Extended
Da Vinci Board
NetworkHAC
Subsys
PCIE
Subsys
Hydra
Subsys
CCIX
Network
Subsys
IO
Subsys
IMU
Virtuvian
AI Training SoC
Ascend 910
Wireless SoC
Bus
CPU
L3 Dache
DDR Queue Manager
IO/Serdes/ETH
…
Bus
DSP
Shared Memory
…
Bus
AVP
Shared Memory
…
LDPC/Polar FFTCGRA
…
Davinci
…
…
Mesh BusRing Bus
Alleviate Memory Wall and Slow down
of Moore’s Law
Memory Wall & I/O Wall
Bandwidth @ BandwidthBandwidth
RatioChallenges
Execution Engine (512TOPS)
2048T Byte/Sec 1 Can build faster EU, but no way to feed data
L0 Memory 2048T Byte/Sec 1/1 Very wide datapath, hard to do scatter-gatherInner-loop data reuse
L1 Memory 200T Byte/Sec 1/10 Intra-kernel data reuse
L2 Memory 20T Byte/sec 1/100 Inter-kernel data reuse
HBM Memory 1T Byte/sec 1/2000 HBM size limits memory footprint
Intra Node bandwidth
50G Byte/sec 1/40000 Scale-up node increase memory footprint, but severely bandwidth constrained
Inter Node bandwidth
10G Byte/sec 1/200000 Model parallelism across nodes,severely bandwidth constrained
Co
mp
iler
Inn
ovatio
nC
luste
r In
no
vation
Technology challenges — Why do we need 3DIC
• 3DIC can help alleviating memory wall, IO wall and logic wall.
• The search for new transistor to help power and thermal all.
• Architecture innovation to reach new grounds despite of all
challenges
3DIC
Memory
Wall
Logic
Wall
Power wall
Thermal Wall
Interface Wall
Process scaling can not satisfy
the need for more die area (many
chips have reached full reticle
size in 7+ process)
Bounded by memory
bandwidth & capacity
Interface IO speed
does not scale with
process node
Ultimate system performance is
thermal limited
Denard slow down
causing power limits
AI Training SoC: Logic + 3DSRAM + 12 HBM
HBM2EHBM2E
HBM2EHBM2E
HBM2EHBM2E
HBM2EHBM2E
HBM2EHBM2E
HBM2EHBM2E
3D-SRAM
3D-SRAM 3D-SRAM
3D-SRAM
3D-SRAM 3D-SRAM
IO Die
AI SOC Die
AI SOC Die3D-SRAM 3D-SRAM
Interposer
Substrate
HBM2E HBM2E
• Customized HBM2E with two Stacks to increase HBM bandwidth
• Large 3D-SRAM as AI core cache
3DSRAM dies
stacked below
SOC die
Mobile AP: LoL + MoL
HBMHBMHBMHBMLogi
cPackage board
3DM
Logic
Package board
Step 1
• One logic die + 3D DRAM
• 3DM+POP LPDDR
Step 3:
• Multi-layer 3D DRAM
(remove POP LPDDR)
• Multi-layer Logic die
Step 2
• Two logic die + 3D DRAM
• POP LPDDR
Physical Design of Davinci AI Chips
Ascend 910Ascend 310High Computing DensityHigh Power Efficiency
Ascend-Mini
FP16:8 TeraFLOPS
INT8 :16 TeraOPS
16 Channel Video Decode – H.264/265
1 Channel Video Encode – H.264/265
Architecture: DaVinci
Power:8W
Process: 12nm
Ascend-Max
FP16: 256 TeraFLOPS
INT8: 512 TeraOPS
128 Channel Video Decode –
H.264/265
Architecture: DaVinci
Process: 7+ nm EUV
Power: 350W
Ascend Architecture
Comparison of Computing Density
Tesla V100 (12 nm)• 416mm²• 120TFlops fp16
Asend910 (7+ nm)• 182.4 mm²• 256TFlops fp16
Google TPUv3 (?nm)• (undisclosed)• 105Tflops bfloat16
Ascend 910 AI Server
Features AI Server SPEC.
Specification8 * Davinci2 * Xeon CPU + 24 DIMM
Performance 2PFops/Chassis ,256T/AI Module
Memory 24DIMM,Up to 1.5TB
Storage6 * 2.5inch,NVME;24TB2 * 2.5inch,SAS/SATA,Raid1
Interface8*100G Fiber4 * PCIe IO
Power 6000W
Ambient Temperature
5~35℃X86 Node
8* Ascend 910
Davinci Node
Ascend 910 Cluster
• 2048 Node x 256TFlops = 512 Peta Flops
……
Ascend Cluster
D
D D
DD
DD
D
AIServer
D
D D
DD
DD
D
AIServer
• 1024~2048 Node Cluster
Cluster Network
Board level interconnection
Ascend910 Board
Ascend910 Die Shot
Total 8 Dies integrated Two dummy dies are added to ensure
mechanical uniformity
Total size:456+168+96x4+110x2=1228mm2
Vitruvian
456mm2
Nimbus
168mm2
Dummy Die
110mm2
2mm
HBM
96mm2
HBM
96mm2
HBM
96mm2
HBM
96mm2
Dummy Die
110mm2
0.7
2m
m
Ascend910 Floorplan
Mesh NoC connects 32 Davinci Cores in
the Ascend 910 Compute Die
NoC provides Read Bandwidth of
128GBps + Write Bandwidth of 128GBps
per core
Inter-chip connections
3x 240Gbps HCCS ports – for NUMA connections
2x100Gbps RoCE interfaces for networking
31.2
5m
m
14.6mm 1.9mm
3m
m
Tensor
Vector
Ascend910 7nm
8TFlops Tensor 1.6mm * 1.2mm
128GFlops Vector 0.9mm * 1mm
Ascend910 NoC
1024bits 2GHz NoC Mesh
Topology : 6 Rows x 4 Columns
Access Bandwidth to on-chip L2
Cache: 4 TByte/s
Access Bandwidth to offchip
HBM: 1.2 TByte/s
NoC bandwidth fairly shared
among the Davinci Cores
3.68mm
4.5
5m
m
Ascend310 Die Shot
10.6
5m
m
9.8mm
Tensor
Vector
Ascend310 12nmTensor 3.08mm * 1.73mmVector 1.6mm * 1.35mm
Ascend310
Ascend910
Kunpeng920
Kunpeng Vs Ascend
Future: Davinci 3DSRAM Floorplan
3D SRAM Architecture Diagram
Logic
Memory
Interposer/Substrate
BPV
TDV TSVBPV Delectric
ubump~ 20 um
More Challenges…
• Generalized Auto ML
• Efficiency for Re-enforcement Learning, GNN?
• Generalized method for Data/Model/Pipeline parallelism
• How to unify data precision?
• Finding the sweet spot architecture
Big chip vs Small chip
Dense vs Sparse
Out of memory, near memory, in memory
Copyright©2018 Huawei Technologies Co., Ltd.
All Rights Reserved.
The information in this document may contain predictive statements including, without limitation, statements regarding
the future financial and operating results, future product portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.
Huawei Confidential
把数字世界带入每个人、每个家庭、每个组织,构建万物互联的智能世界。Bring digital to every person, home and organization for a fully connected, intelligent world.
Thank you.