Harvard SoC Designs · Multiple accelerators + other sys questions accelerators that want to access coherent memory actually also want a local cache. C. SoC-Acceler ator Interfacing

Paul Whatmough

Arm ML Research Lab

[email protected]

Harvard SoC Designs

Paul Whatmough

Arm ML Research Lab

[email protected]

Harvard SoC Designs

…how to win friends and influence people with successful

Arm-based SoC tape outs

3

Why build test chips in academia?If I just do simulation I can write twice as many papers…

Circuits research

• Measured test chip essential for tier-1 publication

• Often need to characterize devices and passives

Architecture research

• Understand the whole stack, soup to nuts

• Know you are solving a real problem

• Add real impact to your work; not just another academic paper

All models are wrong, some are useful

• Many things are hard to simulate convincingly

• Data to build models to design better circuits

Training for researchers

• Build a deep understanding of real computers

• Problem solving, team work, time management

• Extremely valuable depth of experience

4

Harvard / Arm collaboration on ML hardware

Harvard VLSI-Architecture group

• vlsiarch.eecs.harvard.edu

Harvard NLP group

• nlp.seas.harvard.edu

RoboBees

Computer Architecture

VLSICircuits

Algorithms

David Brooks

Gu-Yeon Wei

Paul Whatmough(Arm)

Sasha RushHarvard NLP

5

Deep learning hardware for IoT

Next-generation UI/UX

• Small form-factor without a big screen or input device

• Speech detection/recognition/synthesis, gaze detection, biometric authentication (e.g. face detection)

• Personalize interface and predict user decision-making

Efficient DNN inference hardware on embedded devices

• Privacy, latency and energy issues with transmitting data

• Efficiency to handle large compute and storage demand

• But, still care about programmer efficiency and flexibility

6

Hardware architecture specialization

What is the gap in energy efficiency and flexibility?

• Measured results comparing energy efficient across compute sub-systems

How can we allow hardware accelerators and CPUs to share data efficiently?

• Demonstrate lightweight L2$ coherent data sharing via accelerator coherency port (ACP)

CPUSingle

Function

Flexibility Efficiency

GPU NPU

7

Recent tape outs at Harvard

8

SM2 – 28nm DNN ENGINE

Programmable DNN Classifier for IoT

• Parallelism/reuse 8-way SIMD, 10X data reuse @ 128b/cycle BW

• Small data-types 8-bit weights, –30% energy

• Sparse activation data +4X throughput and –4X energy

• Algorithmic resilience +50% throughput or –30% energy

[Whatmough et al., ISSCC ’17, JSSC ’18]

Operand Load

Weight SRAM

8wayFXP

MAC

SIMD Unit Activation

Bias ReLU

AGUNodeSRAM

DS

TB

DS

TB

Sparse FC-DNN Pipeline

Node List Sparse Index

Act. SRAM

Stall

Inte

rlo

ckStall

9

SM2 – System architecture

ARM Cortex-M0

UART

DNN Engine

W-MEM1MB

SYSCTL

Bridge

GPIO

D-MEM64KB

I-MEM64KB

Timers

BIST

Wdog

UARTs

32

b A

HB

32

b A

PB

12

8b

AX

I

M

M

S

S

S

S M S

S

S

Low-BW Peripherals

S

S

SCAN M

S M

S

Accelerator SubsystemCortex-M0 Subsystem

VACC – Accelerator LogicVMEMP – SRAM Periphery

VMEMC – SRAM Core

USB

GPIO

Scan

USB

DCO

28nm SoC Test Chip

S

VDCO

VSOC

RTC

ASY

NC

FCLKHCLKOSC

VSOC

[Whatmough et al., ISSCC ’17, JSSC ’18]

10

SM3 – 16nm DNN ENGINE v2

Technology 16nm FinFET

On-chip Model Size 1 MB

Layer Support FC-DNN

Voltage Range 0.4 – 1 V

Frequency Range 57 – 1360 MHz

Peak Eff. 16-bit 750 GOPS/W

MNIST Energy/Acc. 151 nJ / 98.51 %

Resilience FeaturesRazor

+ Adaptive Clock

Test Chip Summary

Weight

SRAM

Weight

SRAM

Accelerator

SR

AMARM

M0

2.5 mm

DCO

Noise

Gen.

2.5

mm

2.44

1.81

1.01

0.75Analog SNN

65nm

Digital SNN

28nm

Digital

DNN 28nmDigital DNN

16nm

Digital SNN

28nm

2.38X

[Whatmough et al., Hot chips ‘17, Lee et al. ESSCIRC ’18, JSSC ‘19]

11

SM3 – IoT application demos

Labeled Faces in the Wild (FACE)

Keyword Spotting (KWS)

Human Activity Recognition (HAR)

• Opportunity (HAR1): Detect 18 gestures w/ 7 sensors

• Smartphone (HAR2): Detect basic activities w/ smartphone inertial measurement unit (IMU)

• PAMAP2: Detect activities of daily living w/ 4 sensors

• Daphnet Freezing of Gait: Detect freezing incidents for Parkinson’s patients

Hand written digit classification (MNIST)

[Kodali et al., ICCD’17]

12

SMIV – 16nm SoC to evaluate ML hardware

Cortex-A53

4 Banks

x1MB SRAM

NIC-400 128-bit Interconnect

Cortex

M0Periph.

ThinLink

IO

Bridge

NIC-400 64-bit InterconnectAccelerator

Coherency

Port

(ACP)

Arm Cortex-A53 64-bit CPU Cluster Cache-Coherent Datapath Accelerators

eFPGA

2x2

EFLX4K

DRAM

PCIe

Cortex-A53

2MB L2

AHB 32-bit Interconnect

FC

ENGINE

+SRAM

ACC0 ACC1 ACC2 ACC3

Always-On (AON) 32-bit Cortex-M0 Cluster

[Whatmough et al., Hotchips ’18, VLSI ’19]

13

SMIV – GEMM performance across accelerators

Dual-Core CPU (A53)

• C/C++ or BLAS library

• 18.9 GOPS/W (1x)

Dual-Core CPU with SIMD enabled (A35)

• Assembly or BLAS library

• 58.7 GOPS/W (3.1x)

4-Tile Embedded FPGA (eFPGA)

• Verilog/VHDL/HLS, reconfigurable

• 312 GOPS/W (16.5x)

Quad-Core Datapath Accelerator (CCA)

• Verilog/VHDL/HLS, fixed functionality

• >1 TOPS/W (54.9x)

18.9 GOPS/W

58.7 GOPS/W

312.4 GOPS/W

1.04 TOPS/W

Software Programmable

Reconfigurable

Fixed

[Whatmough et al., Hotchips ’18, VLSI ’19]

14

SMIV – Gem5-Aladdin model

Leverages gem5-Aladdin framework

Compares different IO interfaces for interfacing accelerator to CPU across variety of DNN workloads running on NVDLA-like accelerator

Shows ACP with direct interface to CPU’s L2 cache can improve energy-delay product over using DMA

Multiple accelerators + other sys questions

accelerators that want to access coherent memory actually

also want a local cache.

C. SoC-Accelerator Interfacing

There have been a few publications over the years that in-

vestigated SoC-accelerator interfacing. Sadri et al. performed

an early investigation into the performance of the ARM

Accelerator Coherency Port on the Xilinx Zynq platform and

found that when the CPU and accelerator communicate over

ACP, they can cooperatively perform an FIR filter on images

about 20% faster and consume 20% less energy than by

using DMA [37]. However, this study only looked at simple

filters on images, and the physical platform limited the ACP

bandwidth to 2GB/s, which is far too low for contemporary

DNN workloads. Moreau et al. also used the ACP interface

when building SNNAP because of lower round-trip latency

[38], but they also used the Zynq platform and would be

equally bandwidth-limited. Recent work by Shao et al. found

that software-based cache-coherency management for DMA

interfaces contribute a significant amount of overhead to the

end-to-end runtime of an accelerated workload, and the so-

called serial data arrival effect limits how much speedup

can be gained from additional datapath parallelism [22].

However, their cache explorations also assume the presence

of a private accelerator cache.

IV. EXPERIMENTAL INFRASTRUCTURE

In this section, we will describe the SoC that forms the

basis for our studies and our experimental methodology.

We will describe the accelerator architecture, the networks

that are studied, the software that implements them, and the

performance and energy modeling infrastructure.

A. Baseline System Architecture

Figure 1 shows the baseline SoC used in this paper.

The microarchitectural parameters of the SoC components

are illustrated in Table II. The SoC consists of an out-of-

order CPU cluster with 64KB of private L1 instruction and

data caches and 2MB of shared L2 last-level cache. The

main system interconnect is a 256-bit full-duplex coherent

crossbar. The CPUs are clocked at 2.5GHz, while the system

bus and LLC run at half-processor speed, providing up to

40GB/s of interconnect bandwidth. The system is backed by

4GB of LP-DDR4 DRAM running at 1600MHz, providing

up to 25.6GB/s of bandwidth.

This paper is not about designing a new accelerator, but

rather about extracting more performance/watt of an existing

design without changing its core datapath. In this paper,

we use an accelerator design inspired by the Nvidia Deep

Learning Accelerator (NVDLA) [39]. We choose this design

because it is an open-source project backed by a major

hardware vendor. As shown on the left side of Figure 1, the

design is based around a convolution accelerator consisting

of 8 PEs. Each PE operates on a different output feature map

PE0

PE1

PE2

PE3

B0

B4 B5 B6

B1 B2 B3

B7

Input Scratchpad

PE4

PE5

PE6

PE7

B0

B4 B5 B6

B1 B2 B3

B7

Weight Scratchpad

B0

Output Scratchpad

B1

B2

B3

B4

B5

B6

B7

+

+

+

+

+

+

+

+

Control Logic

AcceleratorCPU0 CPU1

L1 cache L1 cache

2MB L2 cache

LPDDR4

MC

4GB LPDDR4

System Bus

AC

P

IO In

terfa

ce

DM

A

ACP8 MACC Arrays

16

16

32 16

4GB LPDDR4

Fig. 1: The baseline SoC used in this paper.

Component Parameters

CPU CoreOut-of-order X86 @2.5GHz8-µop issue width, 192-entry ROB

L1 Cache64KB i-cache & d-cache, 4-way associative64B cacheline, LRU, 2-cycle access latency

L2 Cache 2MB, 16-way, LRU, 20-cycle access latency

DRAM LP-DDR4, @1600MHz, 4GB, 4 channels, 25.6GB/s

System Bus 256-bit fully-coherent crossbar @1.25GHz, 40GB/s

TABLE II: SoC microarchitectural parameters

and contains a32-way MACC unit for spatial reduction along

the channel dimension. Thedataflow is described in Figure 2.

In total, there are 256 multipliers. Inputs and weights are 16-

bit fixed point; outputs are accumulated in 32-bit fixed point

and reduced to 16-bit before being written to the scratchpad.

In the emerging vernacular commonly used to describe DNN

dataflows, this dataflow is L0 weight-stationary (weights are

reused at the register level within a MACC array), and L1

input/output stationary (on every cycle, inputs are reread

from the SRAM and outputs are accumulated in-place in

the SRAM). It is backed by three distinct SRAMs, one each

for inputs, weights, and outputs. We only model the core

datapath and dataflow of NVDLA; other specific features like

its convolution buffer are not modeled.

In addition to convolution, our accelerator also supports

inner products by mapping them onto the convolution hard-

ware. To reduce memory bandwidth requirements, the accel-

erator supports reading sparse compressed fully-connected

weights, but they are first decompressed into the internal

scratchpads before running the inner product. The compres-

sion format is based on the CSC format described by Han et

al. [40]. Weights compression for convolutional layers is not

supported, since much research in the machine learning and

architecture community has observed that they are harder to

prune than fully-connected layers [23], [40]. Max and aver-

age pooling and batch normalization are natively supported,

as are a range of activation functions.

The accelerator runs at 1GHz. It is attached to the system

bus by a 256-bit bus for DMA operations. To support

delegated coherency, we also add a special port called

[Xi et al., in review]

15

CHIPKIT: Tutorial on Agile Research Test Chips @MICRO’19

microarch.org/micro52/program/workshops.html#chipkit

16

CHIPKIT: Tutorial on Agile Research Test Chips @MICRO’19

Outline

• Research test chips: fabrication routes, process technologies, project planning

• Test chip architectures: CPUs, peripherals, memories, interconnects and frameworks

• Design methodologies for custom blocks: Verilog, SystemVerilog, HLS and beyond

• Physical design flow: linting, synthesis, place and route, DRC/LVS, timing closure

• Bring up and test: packaging, PCBs, clocking, testing flows

Supporting materials

• CHIPKIT – a collection of generators and RTL glue for rapid SoC design

• SM2 – a simple Arm Cortex-M0 based scaffold for rapid (and functional!) test chips

17

Arm research enablement offerings

SoC HW/SW co-development with DesignStart

• DesignStart Eval - Cortex M0/ M3 based systems, evaluation with obfuscated RTL

• DesignStart Pro Academic - Cortex M0/ M3 based systems, RTL for SoC design

Compute systems modelling and architecture exploration

• Gem5 - CPU and system modelling

IP Building blocks*

• Design IP – CPUs, Interconnects, peripherals

• Physical IP – Standard cells, Memory compilers, POP IP

www.arm.com/resources/research/enablement

*Any logic Arm IP that is not part of DesignStart will be provided on a case by case basis, depending on the research project scope, objectives and alignment with Arm research agenda

18

Acknowledgments

Harvard University

• Faculty: Gu-Yeon Wei, David Brooks, Sasha Rush

• Many talented PhD students and post-docs

• Generous sponsors

People, papers, software

• vlsiarch.eecs.harvard.edu

• nlp.seas.harvard.edu

P. N. Whatmough, S. K. Lee, H. Lee, S. Rama, D. Brooks and G. Wei, IEEE Int. Solid-State Cir. Conf. (ISSCC), 2017

P. N. Whatmough, S. K. Lee, D. Brooks and G. Wei, IEEE Journal of Solid-State Circuits (JSSC), 2018

S. K. Lee, P. N. Whatmough, N. Mulholland, P. Hansen, D. Brooks and G. Wei, IEEE Euro. Solid State Cir. Conf. (ESSCIRC), 2018

S. K. Lee, P. N. Whatmough, D. Brooks and G. Wei, IEEE Journal of Solid-State Circuits (JSSC), 2019

P. N. Whatmough, S. K. Lee, M. Donato, H.-C. Hsueh, S. L. Xi, U. Gupta, L. Pentecost, G. Ko, D. Brooks and G.-Y. Wei, Symposium on VLSI Circuits (VLSI), 2019

1919 Confidential © 2017 Arm Limited

Thank You!Danke!Merci!谢谢!ありがとう!Gracias!Kiitos!감사합니다धन्यवाद

Harvard SoC Designs · Multiple accelerators + other sys questions accelerators that want to access coherent memory actually also want a local cache. C. SoC-Acceler ator Interfacing

Documents