This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Slide 1
Scalable and Low Cost Design Approach for Variable Block Size
Motion Estimation Hadi Afshar, Philip Brisk, Paolo Ienne EPFL Hadi
Afshar, Philip Brisk, Paolo Ienne EPFL 30 April 2009
Slide 2
Fixed Block Size Motion Estimation Less compression Few motion
vectors Current Frame Reference Frame MB MV MV: Motion Vector MB:
Macro Block 2
Slide 3
Variable Block Size Motion Estimation More compression More
motion vectors More computation MB MV Current Frame Reference Frame
MV: Motion Vector MB: Macro Block 3
Slide 4
Systolic Arrays and Motion Estimation Data is shared, low
memory bandwidth 4 Current Frame Reference Frame MB MV PE 0 PE 1 PE
2 PE n Memory FF Comparator Regfile Pixel(s) Ref. CS ABS 1 ABS 4
Pix Ref Pix Ref
Slide 5
Comparator Systolic Arrays for VBSME PE 0 PE 1 PE 2 PE n Memory
FF 16-pixel Regfile Comparator SAD MERGE TREE + + + + + Regfile
Comparator SAD BUS NETWORK REUSE UNIT Regfile + Primitive Blocks
Yap TCAS 2004 Song IEICE 2006 Chen TCAS 2006 Li FPT 2006 5
Slide 6
Outline Proposed Design Approach Array Organization Processing
Element Design Scheduling Related Work Case Study: H.264 VBSME
Experimental Results VLSI Implementation FPGA Implementation
Conclusion 6
Slide 7
Proposed Approach Basics: Each PE is augmented by a comparator
unit in addition to the reuse unit Each PE computes the SADs of all
sub- blocks within MB considering a specific reference MB Each PE
is one clock cycle prior to its neighbouring PE Different PEs
compute different SADs of the same MB with different reference MBs
7
Slide 8
Proposed Approach SAD B0,R0 PE 2 TiTi T i +1 T i +2 T i +3 PE 0
PE 1 T i +4 SAD B1,R1 SAD B2,R2 SAD B3,R3 SAD B4,R4 SAD B0,R1 SAD
B1,R2 SAD B2,R3 SAD B3,R4 SAD B0,R2 SAD B1,R3 SAD B2,R4 R0R0 R1R1
R3R3 B0B0 B1B1 B2B2 R2R2 R4R4 8
Slide 9
Proposed Approach S B2,R2 S B1,R1 S B0,R0 PE 2 PE 0 PE 1 MIN S
B0,R1 S B1,R2 S B2,R3 S B0,R2 S B1,R3 S B2,R4 TiTi T i+1 T i+2 T
i+3 T i+4 9
Slide 10
Array Organization Memory FF Comparator SAD BUS NETWORK REUSE
UNIT PE 0 Compare REUSE UNIT PE 1 Compare REUSE UNIT PE n Compare
Min SAD Register File Array Organization - MIN SADs move in the
chain and stored in the regfile - Each PE must compute more than
one search region - (# of Pes) < (# of Search regions) MIN SAD
Reg File 10
Slide 11
PE Design CU output(s) of Previous PE CS ABS 1 ABS 4 Pix Ref
Pix Ref + FB CU RU MIN Reg Regfile 11
Slide 12
PE Design Optimization To minimize the size of RU register file
Each PE should compare and transfer computed SADs ASAP Parallel
comparators are required, when multiple SADs are produced in the
same cycle Transfer Rate B: # of sub-blocks within MB T: # of
cycles required to compute MB SADs 12
Slide 13
PE Design Optimization To minimize the size of RU register file
Each PE should compare and transfer computed SADs ASAP Parallel
comparators are required, when multiple SADs are produced in the
same cycle Transfer Rate B: # of sub-blocks within MB T: # of
cycles required to compute MB SADs Uniform generation of B
sub-blocks within T cycles, reduces the RU regfile Regular
workflow, simplifies controller 13
Slide 14
SAD Scheduling Primitive SADs computations need to be
distributed in T cycles Non-primitive SADs A SAD is generated as
soon as its building SADs are ready Proper scheduling frees SAD
registers for other generated building SADs We propose zig-zag
pattern for reusing Also helps to evenly distribute SAD
computations 14
Slide 15
SAD Scheduling 15
Slide 16
VLSI H.264 VBSME Yap [TCAS 2004]: 1-D array with SAD bus
network Song [IEICE 2006]: 1-D array with SAD bus network Chen
[TCAS 2006] : 2-D array with SAD merge tree, use for HDTV
applications FPGA H.264 VBSME Wei [2003]: 1-D array with SAD bus
network Lopez [ISCAS 2005]: 1-D array using SRAMs with SAD bus
network Li [FPT 2006]: Bit-serial architecture with SAD merge tree
Related Work 16
Slide 17
Case Study: H.264 VBSME MB = 16x16 pixels, B = 41 sub-blocks,
4x4 primitive blocks 4 PEs Each PE computes 4 pixel SADs in each
cycle Search range: 16x16 pixels for each pixel T = 64 cycles, for
each MB Four identical and regular 16-cycles workflows 17
Slide 18
18 SAD Scheduling
Slide 19
Experimental Results H.264 VBSME modelled in VHDL VLSI
Implementations Synopsys DC CMOS libraries 0.18 m: 12k gates, 285
MHz 0.13 m: 18k gates, 400 MHz FPGA Implementations Altera Quartus,
Xilinx ISE Altera APEX, Xilinx VIRTEX-II & STRATIX-II 19
Slide 20
VLSI Implementation MB Processing Time (MBPT) SR: Search Range
T: MB SAD cycles N: # of PEs 20 ~20-25% reduction
Slide 21
VLSI Implementation Gate count (k gates) 21 large area
reduction
Slide 22
FPGA Implementation Throughput (MB / sec) 22 lower throughput
than best designs, but
Slide 23
FPGA Implementation 23 up to 3/4 th area reduction best
efficiency
Slide 24
Scalability Stratix-II 24 almost perfect scalability
Slide 25
Conclusion We improved scalability by redesigning the
organization of systolic array and the design of PEs in the array
Very low cost design, less area and delay We proposed zig-zag
pattern for reusing the primitive SADs Less registers for
maintaining computed SADs Very regular workflow This approach can
be exploited by existing architectures and also can be applied to
future standards with different block sizes 25