Page 1
Boston University Slideshow Title Goes Here
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Tong Geng1,2, Tianqi Wang1, Chunshu Wu1, Chen Yang1, Shuaiwen Leon Song2
Ang Li2, Martin Herbordt1
1Boston University2Pacific Northwest National Laboratory
Page 2
Boston University Slideshow Title Goes Here
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
CNN: most popular algorithm in ML
▪ Widely used in computer vision such as object classification, detection and recognition
Page 3
Boston University Slideshow Title Goes Here
Birth of BNN
▪ Limitations of CNN
▪ High computation + memory intensity
▪ Long Latency
▪ 32-bit floating-point -> 8-bit fixed-point -> binary
▪ Computation and Memory access are not intensive anymore
▪ Potentially, much Lower Latency
▪ Relatively lower accuracy than CNN, however, becoming better
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Page 4
Boston University Slideshow Title Goes Here
DNN: BNN vs CNN✓ Easier Computation: floating-point-multiply-add (FMA) operations ➔ single-bit XNOR/POPCOUNT.
✓ Less Storage: floating-point parameters and activations ➔ single-bit
✓ Energy Efficient: ideal for edge device
“Accelerating Neural Networks with Binary Arithmetic,” https://ai.intel.com/accelerating-neural-networks-binary-arithmetic.
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Page 5
Boston University Slideshow Title Goes Here
Inference of BNN on different platforms
• Utilization to accelerate the inference of AlexNet• batch size of 1:
× GPU: ~ 1%. CPU: ~ 6%.
• batch size of 10:
× GPU: ~ 7%. CPU: ~ 10%.
✓FPGA: >60%: Millions of one-bit ALUs on a single device.
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Eriko Nurvitadhi, David Sheffield, Jaewoong Sim, Asit Mishra, Ganesh Venkatesh, and Debbie Marr. 2016. Accelerating binarized neural networks: comparison of
FPGA, CPU, GPU, and ASIC. In Field-Programmable Technology (FPT), 2016 International Conference on. IEEE, 77–84.
Page 6
Boston University Slideshow Title Goes Here
Challenges
• Challenges to make truly low-latency inference usingFPGA
1.The critical Normalization Layer (NL) uses full-precisionfloating point (i.e., 2 FP MUL/DIV + 3 FP ADD/SUB).
2.Existing works process layers sequentially. Hence, theirlatencies are accumulated with no overlapping.
3.Optimal designs for all layers need to be simultaneouslyconfigured on FPGA with no reconfiguration.
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Page 7
Boston University Slideshow Title Goes Here
Challenges
• Challenges to make truly low-latency inference usingFPGA
1.The critical Normalization Layer (NL) uses full-precisionfloating point (i.e., 2 FP MUL/DIV + 3 FP ADD/SUB).
2.Existing works process layers sequentially. Hence, theirlatencies are accumulated with no overlapping.
3.Optimal designs for all layers need to be simultaneouslyconfigured on FPGA with no reconfiguration.
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Page 8
Boston University Slideshow Title Goes Here
Intra-layer Fusion
• Addressing the 1st challenge: Intra-layer fusion• 3-to-1 fusion: ACT, NL and BL are fused to a Comparison layer.
• Simplified Computation: 2 Comparisons from ACT and BL and 5 floating-point operations from NL become an integer- comparison.
• Less Storage: 4 floating-point variables in NL become 1 integer.
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Page 9
Boston University Slideshow Title Goes Here
Intra-layer Fusion for Networks with Shortcuts
• ResNet
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
POP1
POP1
NO
RM
BIN +POP3
NORM1
POP3
Threshold Lookup(TL)
> >>
Simplify
(A) Original ResNet
(B) Optimized ResNet
CONV2 CONV3CONV1
NORM3
TL
Page 10
Boston University Slideshow Title Goes Here
Challenges
• Challenges to make truly low-latency inference usingFPGA
1.The critical Normalization Layer (NL) uses full-precisionfloating point (i.e., 2 FP MUL/DIV + 3 FP ADD/SUB).
2.Existing works process layers sequentially. Hence, theirlatencies are accumulated with no overlapping.
3.Optimal designs for all layers need to be simultaneouslyconfigured on FPGA with no reconfiguration.
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Page 11
Boston University Slideshow Title Goes Here
Inter-layer Fusion• Addressing the 2nd challenge: inter-layer fusion
• Fine-grained pipelining to fuse CONV & 1st FC layers.
• An image is processed based on the data dependency.
• Layers are processed in parallel ➔ overlapped latencies.
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Page 12
Boston University Slideshow Title Goes Here
Challenges
• Challenges to make truly low-latency inference usingFPGA
1.The critical Normalization Layer (NL) uses full-precisionfloating point (i.e., 2 FP MUL/DIV + 3 FP ADD/SUB).
2.Existing works process layers sequentially. Hence, theirlatencies are accumulated with no overlapping.
3.Optimal designs for all layers need to be simultaneouslyconfigured on FPGA with no reconfiguration.
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Page 13
Boston University Slideshow Title Goes Here
Workload Balancing• Each Layer: NIC*NOC*K*K
• NIC=Num of Input Channel
• NOC=Num of Output Channel
• K=Kernel size
• Terms: PIC, POC, SIC, SOC• PIC=Parallelism of Input Channel
• POC=Parallelism of Output Channel
• SIC=Sequential of Input Channel
• SOC=Sequential of Output Channel
• Match Data Production and Consumption Rates of adjacent layers by adjusting PIC, POC, SIC and SOC
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Sequential
Parallel
Page 14
Boston University Slideshow Title Goes Here
Parameterized Architecture
• Addressing the 3rd challenge: decent architecture design
• The architecture is flexible enough to support load balancing.
• All layers are fully configured on FPGA using model parallelism.
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
PE: Processing ElementMAC: Multiply-Accumulate UnitTB: Threshold BufferSWM: Shared Weight Memory
VPBHPB
PE #1
PE #2
PE #3
PE #POC
SIDSR
BANK Set #1
BANK Set #2
BANK Set #K
#PIC
*K
123...
32
Ch
ann
els
amo
ng #P
IC
#SIC
...
#W
BANK #1
BANK #2
BANK #3
BANK #PIC/32 123...
32
Ch
ann
els
amo
ng #P
OC
#SOC
...
#W/P
Control Processor
(CP)
XNOR #1
XNOR #3
XNOR #2
PO
PC
OU
NT
XNOR #PIC
TB
>ACC
#SOC
123...
32 Channels among #PIC
#SOC
...#SIC
123
...
32
Ch
ann
els am
on
g #PO
C
#SOC
SWM
BankSet#1
BankSet#2
BankSet#K2
BankSet
#K2-1... B
an
k#
1
Ba
nk
#2
Ba
nk
#P
IC/3
2...
Ba
nk
#1
Ba
nk
#2
......
#PIC/32#POC
Ba
nk
#P
IC/3
2
...
POOLING Engine #1
POOLING Engine #2
POOLING Engine #(POC/32)
#PIC
Abbreviation:VPB: Vertical Pooling BufferHPB: Horizontal Pooling BufferSIDSR: Shared Input Data Shift Registers
Parameters:K: Filter Kernel SizeW: Width of the input feature mapP: Pooling Kernel Size
303132
303132
303132
30
31
32
Page 15
Boston University Slideshow Title Goes Here
Evaluation: Latency Reduction from Intra-layer Fusion
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
0%
20%
40%
60%
80%
100%
OTHER
COMPARE
BN & SUM-FP
XNOR
POPCOUNT
Page 16
Boston University Slideshow Title Goes Here
Evaluation: Single Layer Latency VS Whole Latency
0
100
200
300
400
CONV1 CONV2 CONV3 CONV4 CONV5 CONV6 CONV7 CONV8 CONV9 CONV10 CONV11 CONV12 CONV13 FC1 Whole VGG
Latency(μs)
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Page 17
Boston University Slideshow Title Goes Here
Evaluation: Hardware Resource Saving from Workload Balancing
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Page 18
Boston University Slideshow Title Goes Here
Evaluation: BNN Inference Latency using LP-BNN
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism
Page 19
Boston University Slideshow Title Goes Here
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism