2 Integrated Systems Laboratory 1 Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core Platform for Micropower in-Sensor Analytics Southampton 19.01.2018 Davide Rossi 1 , Antonio Pullini 2 , Igor Loi 1 , Davide Schiavone 2 , Francesco Conti 1 , Florian Glaser 1 , Florian Zaruba 2 , Stefan Mach 2 , Giovanni Rovere 2 , Germain Haugou 2 , Manuele Rusci 1 , Alessandro Capotondi 1 , Giuseppe Tagliavini 1 , Daniele Palossi 2 , Andrea Marongiu 1,2 , Fabio Montagna 1 , Victor Javier Kartsch Morinigo 1 , Simone Benatti 1 , Lei Li 2 , Renzo Andri, Lukas Gavigelli, Eric Flamand 2 , Frank K. Gürkaynak 2 , Andeas Kurth, Pirmin Vogel, Alessandro Capotondi, Luca Benini 1,2
50
Embed
Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
2Integrated Systems Laboratory
1Department of Electrical, Electronicand Information Engineering
Scaling-up Edge Computing with PULPA many-core Platform for Micropower in-Sensor Analytics
Southampton 19.01.2018
Davide Rossi1, Antonio Pullini2, Igor Loi1, Davide Schiavone2,Francesco Conti1, Florian Glaser1, Florian Zaruba2, StefanMach2, Giovanni Rovere2, Germain Haugou2, Manuele Rusci1,Alessandro Capotondi1, Giuseppe Tagliavini1, Daniele Palossi2,Andrea Marongiu1,2, Fabio Montagna1, Victor Javier KartschMorinigo1, Simone Benatti1, Lei Li2, Renzo Andri, LukasGavigelli, Eric Flamand2, Frank K. Gürkaynak2, Andeas Kurth,Pirmin Vogel, Alessandro Capotondi, Luca Benini1,2
2Integrated Systems Laboratory
1Department of Electrical, Electronicand Information Engineering
3x Cost reduction if data volume is reduced by 95%
||
NSP on MCUs?
6
High performance MCUs
Low-Pow
er MC
Us
Courtesy of J Pineda, NXP + Updates
1pJ/OP InceptionV3 @2fps in 10mW
|| 7
Parallel Ultra-Low Power
||
Outline
8
Near Threshold Multiprocessing Non-Von Neumann Accelerators Aggressive Approximation From Frame-based to Event-based Processing Outlook and Conclusion
||
Minimum Energy Operation
9
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
Total EnergyLeakage EnergyDynamic Energy
0.55 0.55 0.55 0.55 0.6 0.7 0.8 0.9 1 1.1 1.2
Ener
gy/C
ycle
(nJ)
32nm CMOS, 25oC
4.7X
Logic Vcc / Memory Vcc (V)
Source: Vivek De, INTEL – Date 2013
Near-Threshold Computing (NTC): 1. Don’t waste energy pushing devices in strong inversion2. Recover performance with parallel execution3. Manage leakage+process/temperature variations in NT!!!
||
Near-Threshold Multiprocessing
10
. . . . .
4-stage RISC-V RV32IMC+
I$B0 I$Bk
DEMUX
L1 TCDM+T&SMB0 MBM
Shared L1 DataMem + Atomic Variables
DMA + HW SYNCH
Tightly Coupled DMAAnd Hardware Sychronizer
Periph+ExtM
N Cores PE0 PEN‐1
D. Rossi et al., "Energy-Efficient Near-Threshold ParallelComputing: The PULPv2 Cluster," in IEEE Micro, Sep./Oct. 2017.
||
Near-Threshold Multiprocessing
11
NT but parallel Max. Energy efficiency when Active + strong PM for (partial) idleness
1.. 8 PE-per-cluster, 1…32 clusters PGAS machine
PMCA-managed IOMMU
||
ULP (NT) Bottleneck: Memory
12
“Standard” 6T SRAMs: High VDDMIN Bottleneck for energy efficiency >50% of energy can go here!!!
Near-Threshold SRAMs (8T) Lower VDDMIN Area/timing overhead (25%-50%) High active energy Low technology portability
Standard Cell Memories: Wide supply voltage range Lower read/write energy (2x - 4x) High technology portability Major area overhead 4x 2.7x
with controlled placement
2x-4x256x32 6T SRAMS vs. SCM
A. Teman et.al., ‘Power, Area, and Performance Optimization of Standard Cell Memory Arrays Through Controlled Placement’,
in ACM TDAES, May 2016
||
I$: a Look Into ‘Real Life’ Applications
13
Issues:1) Area Overhead of SCMs (4Kb/core not affordable….)2) Capacity miss (with small caches)3) Jumps due to runtime (e.g. OpenMP, OpenCL) and other function calls
Applications on PULP
SHORT JUMP LOOP BASED APPLICATIONS
LONG JUMP APPLICATIONSLIBRARY BASED
Exixting ULP processors Latch based I$ REISC (ESSCIRC2011) 64b
Sleepwalker (ISSCC 2012) 128b
Bellevue (ISCAS 2014) 128b
Survey of State of The Art
SCM-BASED I$ IMPROVES EFFICIENCY BY ~2X ON SMALL BENCHMARKS, BUT…
||
Shared I$
14
Share instruction cache OK for data parallel execution model Not OK for task parallel execution model,
or very divergent parallel threads Architectures SP: single-port banks connected through
a read-only interconnect Pros: Low area overhead Cons: Timing pressure, contention
MP: Multi-ported banks Pros: High efficiency Cons: Area overhead (several ports)
Results Up to 40% better performance than
private I$ Up to 30% better energy efficiency Up to 20% better energy*area efficiency
I. Loi, et.al., "The Quest for Energy-Efficient I$ Design in Ultra-Low-Power Clustered Many-Cores," in IEEE Transactions on
1. HW loops and Post modified LD/ST2. Bit manipulations3. Packed-SIMD ALU operations with dot product4. Rounding and Normalizazion5. Shuffle operations for vectors
Extending RISC-V for NSP
Small Power and Area overhead
RISC‐V V1
V2V3
HW loopsPost modified Load/StoreMac
SIMD 2/4 + DotProduct + ShufflingBit manipulation unitLightweight fixed point
V2
V3
Baseline RISC-V RV32IMCV1
M. Gautschi et al., "Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices," in IEEE TVLSI, Oct. 2017.
DVFS no yes yesI$ 4kB SRAM private 4kB SCM private 4kB SCM sharedDSP Extensions no no yesHW Synchronizer no no yes
2.6pj/OP
||
PULPv3 Demo BoardPULPv3 board featuring voltage regulator (down to 0.5V)Peltier element to heat-up / cool the samplesTEC controller to drive the peltier elementEmbecosm MAGEEC Energy Monitoring Shield for power measurementsJTAG Programmer for loading code on PULPv3Host laptop to show plots
>2x leakage reduction @ 70°C
Extended operating range for zero-margin design (i.e. signoff in typical corner)
28
||
Full application with PULP
Is PULP really needed, or is processing light enough for an ULP MCU?Case study: real-time seizure detection 23 EEG channels Implemented on PULPv3 6x speedup on 8 processors
Near threshold operation and efficient architecture
Parallelization and workload distribution+
Sub-mW operation (months of battery lifetime) Real-time even under msec constraints=
Room for ↑↑ ExG channel (>128) and sensor fusion
70x lower energy than Ambiq MCU
29
||
Not only NT: GAP8 (55nm TSMC) & Mr. Wolf (45nm TSMC)
PULP’s Commercial “big brother” TSMC55 (available in march)
L2
SPIM x 2
SoCTCDMBANK #0
TCDMBANK #1
TCDMBANK #M‐1
SHARED I$
RISCV#0
RISCV#1
RISCV#7
BRIDGE
CLUSTER
TCDM INTERCONNECT
...
...CLUSTER
BUS
PERIPH
ERAL
INTE
RCONNEC
T
DC FIFO
DC FIFOID R.
AXI2MEM
SOC BU
S
ROM
SoC Ctrl
JTAG2AXI
JTAG
FLL Ctr
UART
GPIO
Top
DU DU DU
PAD M
UX
JTAGTAP
I2C x 2
I2S x 2
SOC AP
B
TMC
uDMA
SPI S
FLL x 2
DMA
CLUSTERPERIPHS
LVDS I/Q
AXI TO MEM+ MPU FILTER
FC SUBSYSTEM
L2 REFILLMASTER
UDMASLAVE
ROM REFILL MASTER
APB SLAVE
AXIMASTER
PWM x 4
10b // (CPI)
PP Im
PP Au
DC/DCRTC
PMU
BOR BOR TRC
Q-SPIM
AXI TO APB + MPU FILTER
Hyper-Bus
Serial I/Q
AQFN84 package
HWCE
30
||
Not only NT: GAP8 (55nm TSMC) & Mr. Wolf (45nm TSMC)
PULP’s Commercial “big brother” TSMC55 (available in march)
L2
SPIM x 2
SoCTCDMBANK #0
TCDMBANK #1
TCDMBANK #M‐1
SHARED I$
RISCV#0
RISCV#1
RISCV#7
BRIDGE
CLUSTER
TCDM INTERCONNECT
...
...CLUSTER
BUS
PERIPH
ERAL
INTE
RCONNEC
T
DC FIFO
DC FIFOID R.
AXI2MEM
SOC BU
S
ROM
SoC Ctrl
JTAG2AXI
JTAG
FLL Ctr
UART
GPIO
Top
DU DU DU
PAD M
UX
JTAGTAP
I2C x 2
I2S x 2
SOC AP
B
TMC
uDMA
SPI S
FLL x 2
DMA
CLUSTERPERIPHS
LVDS I/Q
AXI TO MEM+ MPU FILTER
FC SUBSYSTEM
L2 REFILLMASTER
UDMASLAVE
ROM REFILL MASTER
APB SLAVE
AXIMASTER
PWM x 4
10b // (CPI)
PP Im
PP Au
DC/DCRTC
PMU
BOR BOR TRC
Q-SPIM
AXI TO APB + MPU FILTER
Hyper-Bus
Serial I/Q
AQFN84 package
HWCE
31
||
Outline
32
Near Threshold Multiprocessing Non-Von Neumann Accelerators Aggressive Approximation From Frame-based to Event-based Processing Outlook and Conclusion
||
Recovering More Silicon Efficiency
33
1 > 1003 6
CPU GPGPU HW IP
GOPS/W
Accelerator Gap
SW HWMixed
ThroughputComputing
General-purposeComputing
<1 pj/OP
Closing The Accelerator Efficiency Gap with Agile Customization
ULP parallelComputing
||
Heterogeneous PULP Cluster
34
||
Heterogeneous PULP Cluster
35
||
Heterogeneous PULP SoC: Fulmine
36
CTRL
Line Buffer +Scalable Precision
SOP
Fulmine: Hardware Convolutional Engine (HWCE) in the Cluster ~6.9 mm2 PULP chip, 1/2016 4 cores 1 HWCRYPT accelerator 1 HWCE for 3D conv layers 64kB L1, 192kB L2 QSPI (M/S), I2C, I2S, UART
~1514 kGE for the cluster 232 kGE for the CNN HWCE
Crypto Engine Secure Analytics
SHARED-MEM IF
F. Conti et al., "An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics,",in IEEE TCAS-I Sept. 2017.
||
Heterogeneous PULP CNN Performance
37
10
100
1000
10000
100000
1 core 4 coresvectorized
16bweights
8b weights 4b weights
SW HWCE
1
10
100
1000
10000
1 core 4 coresvectorized
16bweights
8b weights 4b weights
SW HWCE
Scaled to ST FD-SOI 28nm @ Vdd=0.6V, f=115MHz
PERFORMANCE ENERGY EFFICIENCY
13 GOPS3250 GOPS/W
218x
Cluster performance and energy efficiency on a 64x64 CNN layer (5x5 conv)
84x
[Mop
/s]
[Mop
/s/W
]
5x
0.3 pj/OP
||
SoP-Unit Optimization
38
Image Mapping (3x3, 5x5, 7x7)
Equivalent for 7x7 SoP
ImageBank
FilterBank
1 MAC Op = 2 Op (1 Op for the “sign-reverse”, 1 Op for the add).
||
Origami, YodaNN vs. Human
39
Type Analog (bio)
Q2.9Precision
Q2.9Precision
Binary‐Weight
Network human ResNet‐34 ResNet‐18 ResNet‐18
Top‐1 error [%]
21.53 30.7 39.2
Top‐5 error [%]
5.1 5.6 10.8 17.0
Hardware Brain Origami Origami YodaNN
Energy‐eff. [uJ/img]
100.000(*) 1086 543 31
The «energy-efficient AI» challenge (e.g. Human vs. IBM Watson)
Game over for humans also in energy-efficient vision?Not yet!
Numbers above assume on-chip storage (1MB binary weights)Object recognition is a very basic task!
*Pbrain = 10W, 10% of the brain used for vision, trained human @ 10img/sec
||
Outline
40
Near Threshold Multiprocessing Non-Von Neumann Accelerators Aggressive Approximation From Frame-based to Event-based Processing Outlook and Conclusion
||
Back to System-Level
41
Event-Driven Computation, which occurs only when relevant events are detected by the sensor
Event-based sensor interface tominimize IO energy (vs. Frame-basedinteface
Mixed-signal event triggering with an ULP imager with internal processing AMS capability
PULPv3GrainCam
Mixed-SignalEvent-based
Imager
DigitalParallel
Processor
Smart Visual Sensor idle most of the time (nothing interesting to see)
A Neuromorphic Approach for doing nothing VERY well
||
GrainCam Readout
42
Readout modes: IDLE: readout the counter of asserted pixels ACTIVE: sending out the addresses of asserted
pixels (address-coded representation), according raster scan order
Event-based sensing: output frame data bandwidth depends on the external context-activity
Frame-based
{x0, y
0}
{x1, y
1}
{x2, y
2}
{x3, y
3}
{xN-1
, yN-1
}
Event-based
Ultra Low Power Consumption e.g. 10-20uW @10fps
||
Power Management
Graincam IDLE
PULP Deep Sleep
#events > thresholdSwitch to ACTIVE
ACTIVE
Deep Sleep10μW @10fps
7μW
READOUT
Wake‐upevent
Data Trasfer
Dataevent
Processing
10‐20μW @10fps
2.88 mW Active Power @ 0.55V , 81MHz
Graincam PULP
M. Rusci, et. al., "A Sub-mW IoT-Endnode for Always-On Visual Monitoring and Smart Triggering," in IEEE IoT Journal, 2017.
uDMA interface
43
||
Outline
44
Near Threshold Multiprocessing Non-Von Neumann Accelerators Aggressive Approximation From Frame-based to Event-based Processing Outlook and Conclusion
||
PULP: An Open Source Parallel Computing Platform
45
PULP Hardware and Software released under Solderpad License
Compiler Infrastructure
Processor & Hardware IPs
Virtualization Layer
Programming Model
Low-Power Silicon Technology
Started in 2013
||
Zero-riscy RV32-ICM
Micro-riscy RV32-CE
Ariane RV64-ICM
RI5CY RV32-ICMX SIMD HW loops Bit
manipulation
Fixed point
RI5CY + FPU RV32-
ICMFX
RISC-V cores available in PULP
Low Cost Core
Linux capable
Core
Core with DSP enhancements
Floating-point capable Core
32 bit 64 bit
11GK,18KG 40KG 200KG70KG
46
||
PULP Open-Source Release and External Contributions
47
1 February 2016First release of PULPino, our single-core microcontroller
2
3
5
May 2016Toolchain and compiler for our RISC-V implementation (RI5CY), DSP extensions
August 2017PULPino updates, new cores Zero-riscyand Micro-riscy, FPU, toolchain updates
End of 2017PULPino v2 + SDK and Virtual Platform, Smart peripherals (uDMA ready), support for Verilator, new event unit
February 2018PULPissimo, ARIANE, PULP
1
2
4
September 2017Porting of ARM CMSIS to PULPinohttps://github.com/misaleh/CMSIS-DSP-PULPino
June 2017Porting of Verilator and BEEBS benchmarks to PULPinohttps://github.com/embecosm/ri5cy
December 2017STING: Open-Source Verification Environment for PULPinohttp://valtrix.in/programming/running-sting-on-pulpino
Releases Community Contributions
4
November 2017Numerous Bug fixes to RiscV in PULPinohttps://github.com/pulp-platform/riscv
3
||
Poseidon: enter 22FDX
QUENTIN KERBIN
HYPERDRIVE
First 22FDX test-chip of the labs (taped out Jan 12th) Goals:
Explore performance/energy efficiency: In the low-power domain LVT In the high-performance domain SLVT
Explore the voltage range of the technology (LVT + SLVT) Explore body biasing
Test SoCs: Quentin: ultra-low-power 32-bit MCU+XNOR Accelerator Kerbin: high performance (1GHz) 64-bit processor Hyperdrive: Binary Convolutional Networks Accelerator
Chip architecture: Signal pads are multiplexed among the three macros (only one
active at same time) Body bias pads shared among all macros Power supply pads are private for each SoC, with independent
logic, memory arrays and memory periphery supplies for accurate power measurements
Poseidon chip layout
48
||
Conclusion
49
Near-sensor processing Energy efficiency requirements: pJ/OP and below Technology scaling alone is not doing the job for us Ultra-low power architecture and circuits are needed
CNNs- and heavy DSP functions can be squeezed into mW envelope Non-von-Neumann acceleration Very robust to low precision computations (deterministic and statistical) fJ/OP is in sight!
More than CNN is needed (e.g. non-linear functions, online optimizazion) Open Source HW & SW approach innovation ecosystem