1/21 Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Explorati on Chen Huang and Frank Vahid Dept. of Computer Science and Engineering University of California, Riverside, USA {chuang,vahid}@cs.ucr.edu This work was supported in part by NSF CNS- 1016792
21
Embed
1/21 Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration Chen Huang and Frank Vahid Dept. of Computer Science and Engineering.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1/21
Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration
Chen Huang and Frank Vahid
Dept. of Computer Science and Engineering University of California, Riverside, USA{chuang,vahid}@cs.ucr.edu
This work was supported in part by NSF CNS-1016792
Chen Huang UC Riverside
2/21
Outline
Haar-feature based object detection algorithm
Custom design space exploration: Feature mapping problem
Experimental results
Chen Huang UC Riverside
3/21
Original image
Scaled images
Haar-Feature based object detection algorithm
(320 – 20) * (240 – 20) = 66,000 sub-windows
X axis
Y axis
0
240
320
Movement of sub-window
Faces detected on different scales
… 20x20 sub- window
Face found
Chen Huang UC Riverside
4/21
Face detection in sub-window
Fail
Pass
Facial Haar features
Calculate Haar-feature value:
Pixel_Sum(Rect_W) – Pixel_Sum(Rect_B)Constant time Pixel_Sum calculation
Pixel_Sum(R1) = P4 - P2 - P3 + P1 = 4
1 1 11 1 1
1 1 1
Original image Integral Image
1 2 32 4 6
3 6 9
p1 p2
p3 p4R1
Need 4 corner values
Stores Pixel sum of Rect(from top-left corner to this point)
P4
P2
P3
P120 x 20 sub-window
Chen Huang UC Riverside
5/21
Cascade decision process
Frontal-face has 2000 features
S12 features
S25 features
S316 features
S22212 features
Divided into multiple stages
……pass pass pass
Face detected
pass
Reject
Fail
Fail any stage will reject current sub-window
Chen Huang UC Riverside
6/21
Algorithm FPGA implementation
Buffer controller
Integral image Rectangle
drawer
Video out(objects in rectangles)
ClassifierImage scaler
20 x 20 Sub-window
Haar feature calculation/decision
Frame grabber
Video in
FPGA
Chen Huang UC Riverside
7/21
Integral image and Classifier
Frame grabber
Video in
Buffer controller
Integral image Rectangle
drawer
Video out(objects in rectangles)
ClassifierImage scaler Classifier
Integral Image Buffer
(20 x 20 17-bit register file)
a1 a2 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4
0
Feature threshold>
Left value
Right valueFeature value
mux +
multiply b
y constant-1 x2 x2 x3
+(Feature sum)
Rect sum Rect sum Rect sum
Data delivery
Chen Huang UC Riverside
8/21
Communication bottleneck
A classifier port
……
20 x 20 Integral image
400-to-1 mux
400-to-1 17-bit MUX:
2300 LUTs
12 MUXes: 27,600 LUTs40% of Virtex5 110T(69,120)
General communication architecture
Drawbacks:
Does not scale well for multiple classifiers
Wire congestion problem
Chen Huang UC Riverside
9/21
Integral image
CF1 CF2 CF3 CF4
Multiple Classifiers
Custom communication architecture for multi-classifier
400-1 mux
CF1 CF2 CF3 CF4
Classifier number
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Feature num
ber
Chen Huang UC Riverside
10/21
Integral image
CF1 CF2 CF3 CF4
Multiple Classifiers
Custom communication architecture for multi-classifier
CF1_port1 CF2_port9 CF3_port7
24-1 mux 9-1 mux 24-1 mux16-1 mux
CF4_port2Custom communication architecture
Classifier number
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Feature num
ber
CF1 CF2 CF3 CF4
Chen Huang UC Riverside
11/21
1 2 3 4 Stage 1
Feature mapping problem
Mapping 26 features into 4 Classifiers
Stage and feature
CF1 CF2 CF3 CF4
5
Classifier
Stage 1
Stage 2
Stage n
pass
pass
Object found
Reject
Fail
Fail
Fail6 7 8 9
10 11 12Stage 2
13 14 15 16
17 18 19 20
21 22 23 24
25 26
Stage 3
Features
CF1 CF2 CF3 CF4
Chen Huang UC Riverside
12/21
Feature mapping problem
SwapMigrate
#possible mapping grows exponentially with #features
Simulated Annealing neighborT
otal stage
delay
Total wire number
Performance Size
Objective:Min (Total stage delay * Total wire number)
1 million iterations (30 min)
Mapping 26 features into 4 Classifiers
Stage and feature
CF1 CF2 CF3 CF4
Stage 3 S
tage 2 Stage 1 1 2 3 4
5
6 7 8 9
10 11 12
13 14 15 16
17 18 19 20
21 22 23 24
25 26
Classifier
CF1 CF2 CF3 CF4
Chen Huang UC Riverside
13/21
BRAM
Select
Automatic VHDL code generation
Scheduling:
Integral Image
5 24 46 92
MUX
Classifier 1
Feature mapping:
1, 4, 66, 3
(needs entry:
5, 24, 46, 92)
1
4
3
1 2 3 4
24 5 92 46
2Mux1: mux4 port map(II(5), II(24), II(46), II(92), select, dout);