FPGA Smart Dust John McAllister Institute of Electronics, Communications and Information Technology (ECIT), Queen’s University Belfast [email protected]
Oct 05, 2020
FPGA Smart DustJohn McAllister
Institute of Electronics, Communications and Information Technology (ECIT),
Queen’s University Belfast
FPGA Then & Now
Then…Virtex-II
Multipliers Look-Up Tables
Block RAM
Then…..VHDL Verilog Constraints
/Directives
Synthesis (Synplify, XST)
Place and Route (ISE/Vivado)
Now…Virtex-Ultrascale
DSP Slices Look-Up Tables
Block RAM
Now…C/C++ SystemC Constraints
/Directives
High Level Synthesis Tool (Vivado)
VHDL Verilog
The HLS AdvantageVHDL/Verilog
C/C++/SystemC
Why HLS?
Design abstraction
Design productivity
Design time
Control of results
Performance or efficiency
Fewer things for the designer to manageReduced from 100s or 1000s to 10s
FPGA Compute
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
V585T V1500T V2000T VX330T VX415T VX485T VX550T VX690T VX980T VX1140TVH290T VH580T VH870T
LUT MACs
DSP48E1 MACs
An Alternative
Software
Constraints /Directives
Architectural Synthesis
Compilation
Processing Elements
FPGA Processors
Vector coprocessor
Conventional Soft Processors (e.g. Microblaze, NIOS, MIPS,
LEON)
Lean Processors (e.g. iDEA)
‘Smart Dust’ Processing Elements
90 LUTs
Scaling Up
Pile-em Up
Point-to-Point FIFO Connection
...
Interface Controller
...
PE PE PE
SPU...
PE PE PE
SPU
...
PE PE PE
SPU... PE PE PE
SPU...
PE PE PE
SPU
... PE PE PE
SPU
...
Tree Search
Preprocessing (QR Decomposition)
When It Worked
Tree Search
Preprocessing (QR Decomposition)
Tree Search
Preprocessing (QR Decomposition)
Tree Search
Preprocessing (QR Decomposition)
108
As Good As Custom RTL Circuits
Realisation Throughput (Mbps) DSP48e LUTs BRAM
FPE 502.5 144 16,601 0
Barbero & Thompson, ICC ‘08 600 160 13,197 49
Qi & Chakrabarti, SiPS ‘10 200 64 18,893 12
Wu & Masera, Euromicro DSD ‘10 27.7 0 6,587 0
When It Didn’t: Low Compute/Communication Ratio
Low Compute/Data Access Ratio Is A Problem
SIMD FFT MIMD FFT
The Issue
Streaming Processing Elements
Stream Processing
The Effect of StreamingSIMD FFT MIMD FFT
When It Didn’t: Large Data Objects
10242 Matrix-Matrix Multiplication
CIF Full Search Motion Estimation
Token Processing
Block Memory Access & Zero-Overhead Repeat
Dramatic Reductions in No. of Instructions
Class FPE sFPE δ(%)ALU 32768 32 -99.9COMM 2048 6 -99.7CTRL 559 4 -99.7NOP 0 4Total 32375 54 -99.8
Class FPE sFPE δ(%)ALU 268353 26 -99.9COMM 2467 14 -99.4CTRL 12582 12 -99.9NOP 1026 6 -99.6Total 284428 58 -99.9
10242 Matrix-Matrix Multiplication
CIF Full Search Motion Estimation
14284671
43900
6100
sFPE FPE VEGAS VENICE
10242 Matrix Multiplication2.8
2.1
1.4
0.6
sFPE FPE VEGAS VENICE
x106/s LUTs
32
64
132
20
sFPE FPE VEGAS VENICE
DSP48e
16
96
32
17
sFPE FPE VEGAS VENICE
BRAM
Full Search Motion Estimation106.9
56.4
4.810.9 15.8
sFPE FPE VIPERS VEGAS VENICE
1.9 4.79.4 8.4
66.1
sFPE FPE VIPERS VEGAS VENICE
1
22
54
20 20
sFPE FPE VIPERS VEGAS VENICE
32
44
10
64
17
sFPE FPE VIPERS VEGAS VENICE
Frames/s LUTs
DSP48e BRAM
FFTs
0.50 1.01
3.89 6.
03
0.84 2.68 3.16
7.08
2.24 3.51
10.11
21.13
64 128 256 512
sFPE SpiralXilinx
0.23 0.
60 0.99
2.10
0.21 0.48 0.75
2.11
0.50 0.65
1.23
4.22
64 128 256 512
12
32
64
160
8 16 24 2424
48
136
272
64 128 256 512
5
9 10
15
8
10
22
24
4
8
16
28
64 128 256 512
Frames/s
LUTs
DSP48e BRAM
SummaryGoal: productivity gain with performance/cost
benefit.
One instance: multicores, requiring a designer to handle
tens of components
HLS undermines a key reason for
using FPGA.
Domain-specific, configurable and programmable RTL
components?Are there others?
Good performance/cost, much greater productivity