DSP Algorithms on FPGA Part II Digital image Processing

DSP Algorithms on FPGADSP Algorithms on FPGA

Part II Digital image ProcessingPart II Digital image Processing

ContentContent

Overview image processing and Overview image processing and FPGAFPGA

Algorithm to FPGA Mapping FlowAlgorithm to FPGA Mapping Flow Nested Loop Algorithms and MODGNested Loop Algorithms and MODG Example: Motion Estimation Example: Motion Estimation Conclusion and Future TrendsConclusion and Future Trends

Video signal in different Video signal in different formatsformats

PAL 720*576(pixels) 25 (f/s) 10.4 (Mp/s) PAL 720*576(pixels) 25 (f/s) 10.4 (Mp/s) NTSC 720*480 29.97 10.4NTSC 720*480 29.97 10.4 HDTV 1920*1080 30.0 62.2 HDTV 1920*1080 30.0 62.2

Common delivery form:Common delivery form: Analog (cable)Analog (cable) USBUSB FirewireFirewire

Image Processing CharacterImage Processing Character

Need available maximize logic by Need available maximize logic by supporting N-D multiple configurable supporting N-D multiple configurable devicesdevices

For Example :For Example :

Image *Image *

11 22 11

22 44 22

11 22 11

ChallengesChallenges

How toHow to……???……??? Appropriate partitioning of algorithms Appropriate partitioning of algorithms

between hardware and softwarebetween hardware and software Exploiting spatial and temporal parallelismExploiting spatial and temporal parallelism Integration the configurable computer into Integration the configurable computer into

the software frameworkthe software framework Selecting a suitable configuration strategySelecting a suitable configuration strategy

How shall we deal with these challenges?How shall we deal with these challenges?

Why SRAM-Based FPGAs? (Pros)Why SRAM-Based FPGAs? (Pros)

Higher logic/storage capacityHigher logic/storage capacity * * Fast carry chain for adders /subtractorsFast carry chain for adders /subtractors

* Built-in XOR gates/LUT* Built-in XOR gates/LUT * Array of bit-parallel multipliers* Array of bit-parallel multipliers

* * Fast and local storage: array of SRAM Fast and local storage: array of SRAM blocksblocks

* * Interconnect supports: three-state buffers/LUTInterconnect supports: three-state buffers/LUT

Equivalent to fine-grained reconfigurable hardwareEquivalent to fine-grained reconfigurable hardware * Finer-gained pipeling can help preserve the* Finer-gained pipeling can help preserve the performance at low power supply voltage performance at low power supply voltage

More mature CMOS manufacturing technologyMore mature CMOS manufacturing technology

Algorithm to FPGA Mapping FlowAlgorithm to FPGA Mapping Flow

MODGFormulation

Space-TimeMapping

Cost Functionssubject to

Constraints

Intra-PEPipelining

1D ScheduleProc. ArrayMODG

Inter-PEPipelined Array

Fully PipelinedArray

New MappingMatrix T1

NestedDo Loop

AlgorithmCompilation

High-levelSynthesis

HDLSynthesis

PMPRConfig.

Generation

The Matrix Multiplication MODG The Matrix Multiplication MODG c11=0

a11

a21

a31a12

a22

a32a13

a23

a33

b11 b12 b13

b21 b22 b23

b32 b33b31

c33c32c31

c23

c13

c21=0

c31=0

c12=0 c13=0

c31c31

a31

a31

b33

b33

A number of different execution orders can be carried out to achieve the same algorithm.

Nested Do Loop Algorithms and Nested Do Loop Algorithms and Inter-Iteration Dependence GraphInter-Iteration Dependence Graph

Do Do ii=1 to =1 to MMDo Do jj=1 to =1 to NNcc[[i,ji,j]=0;]=0;Do Do kk=1 to =1 to KK

cc[[i,ji,j]= ]= cc[[i,ji,j]+]+aa[[i,ki,k]*]*bb[[k,jk,j];];

EndDo EndDo kkEndDo EndDo jjEndDo EndDo II

Dependence vectorsDependence vectors ddaa = ( = (ii,,jj,,kk))tt = (0,1,0)= (0,1,0)tt ddbb = ( = (ii,,jj,,kk))tt = (1,0,0)= (1,0,0)tt ddcc = ( = (ii,,jj,,kk))tt = (0,0,1)= (0,0,1)tt

Index Space Index Space JJ33 = {( = {(ii,,jj,,kk))tt: 1: 1ii,,jj,,kk 3} 3}((MM==NN==KK=3)=3)

Inter-Iteration Data Inter-Iteration Data Dependence graph (DG)Dependence graph (DG)

c11=0

a11

a21

a31a12

a22

a32a13

a23

a33

b11 b12 b13

b21 b22 b23

b32 b33b31

c33c32c31

c23

c13

+X

b

a

c

ab

c

Systolic Mapping (space-time) of Matrix Systolic Mapping (space-time) of Matrix MultiplicationMultiplication

c11=0

a11

a21

a31a12

a22

a32a13

a23

a33

b11 b12 b13

b21 b22 b23

b32 b33b31

c33c32c31

c23

c13

3-D DG (Dependence Graph)

c11=0

a11

a21

a31a12

a22

a32

a23

a33

b11

b21

b31

c21

c31

c11

a13

c21=0

c31=0

D

D

D

D

D

D

D

D

D

2-D Processor Array

P

s s s

Systolic Systolic Mapping of Mapping of

Matrix Matrix Multiplication, Multiplication,

cont.cont.

a11 a21 a31

a12 a22 a32

a13 a23 a33

C11 C21 C31

C11 C21 C31

C11 C21 C31

b11 b11 b11

b21 b21 b21

b31 b31 b31

C12 C22 C32

C12 C22 C32

C12 C22 C32

b12 b12 b12

b22 b22 b22

b32 b32 b32

a11 a21 a31

a12 a22 a32

a13 a23 a33

C13 C23 C33

C13 C23 C33

C13 C23 C33

b13 b13 b13

b23 b23 b23

b33 b33 b33

a11 a21 a31

a12 a22 a32

a13 a23 a33

0 0 0

c11=0

a11

a21

a31a12

a22

a32

a23

a33

b11

b21

b31

c21

c31

c11

a13

c21=0

c31=0

D

D

D

D

D

D

D

D

D

Why Space-Time Mapping is Why Space-Time Mapping is suitable for FPGAs?suitable for FPGAs?

It can bridge the nested Do loop signal/image It can bridge the nested Do loop signal/image

processing algorithms to the processorprocessing algorithms to the processor arrayarray implementation.implementation.

The space-time array matches the modular and The space-time array matches the modular and regular FPGA structure.regular FPGA structure.

The localized/pipelined interprocessor links can The localized/pipelined interprocessor links can overcome the long programmable interconnect overcome the long programmable interconnect delay.delay.

The size of configuration storage can be significantly The size of configuration storage can be significantly reduced because of the almost identical processing reduced because of the almost identical processing elements and interconnect structure.elements and interconnect structure.

Problems with Existing Design Problems with Existing Design Methodologies/ToolsMethodologies/Tools

The dependence graphs of many other The dependence graphs of many other algorithms are not uniform and must be algorithms are not uniform and must be predetermined by human designers.predetermined by human designers.

Existing methodologiesExisting methodologies cannot handle these complex cannot handle these complex

algorithms use unrealistic cost algorithms use unrealistic cost functions (metrics)functions (metrics)

No built-in features of FPGAs have been No built-in features of FPGAs have been incorporated.incorporated.

Longer interconnect delay in deep Longer interconnect delay in deep submicron CMOS technologysubmicron CMOS technology

Much lower hardware utilization due to Much lower hardware utilization due to programmable interconnect delay in programmable interconnect delay in FPGAsFPGAs

There is another problem--There is another problem--speedspeed

What is Intra-PE What is Intra-PE Pipelining?Pipelining?

PE0 PE1 PE2c c c

a0 b0 a1 b1 a2 b2

c

(a)

(b)

c=c+a0xb0 c=c+a1xb1 c=c+a2xb2

c c +

X

a1 b1

d

c +

X

a2 b2

d

c+

X

a0 b0

d

d=a0 x b0 c=c + d

d=a1 x b1 c=c + d

d=a2 x b2 c=c + d

schedule

CLK

CLK

schedule

•Interconnect delay of FPGAs results in even longer clock period.

•To enhance the overall throughput, Intra-Iteration parallelism must be exploited.

•A simple vector dot product array

•It can be observed that the utilization of each operator is increased.

•Of course, the control mechanism is more complex. Tech done example

Examples of Nested Do Loop Examples of Nested Do Loop AlgorithmsAlgorithms

Motion estimationMotion estimation One of the most time consuming operations (tasks) in One of the most time consuming operations (tasks) in

digital video compressiondigital video compression Stereo matchingStereo matching

used to build disparity map for 3D robot/computer used to build disparity map for 3D robot/computer navigationnavigation

Matrix/Vector MultiplicationMatrix/Vector Multiplication FFT, DCT, 2D/3D graphic etc.FFT, DCT, 2D/3D graphic etc.

2D Linear Transform/Operations2D Linear Transform/Operations 2D FFT, 2D DCT, etc.2D FFT, 2D DCT, etc.

Tennis frame 0Tennis frame 0

previous frame

50 100 150 200 250 300 350

50

100

150

200

Tennis frame 1Tennis frame 1

current frame

50 100 150 200 250 300 350

50

100

150

200

Motion Vectors of 8x8-Pixel Blocks Motion Vectors of 8x8-Pixel Blocks

0 50 100 150 200 250 300 350 400-250

-200

-150

-100

-50

0

50Motion Vector Field of frame 1

Reconstructed Frame 1 from Reconstructed Frame 1 from Frame 0 and Motion VectorsFrame 0 and Motion Vectors

Motion compensated frame

50 100 150 200 250 300 350

50

100

150

200

11 2112 22

21 3122 32

31 4132 42

12 2213 23

22 3223 33

34 4233 43

13 2314 24

23 3324 34

33 4334 44

n=0

n=1

n=2

m=0 m=1 m=2

Illustration of Full Search Block Matching Motion Illustration of Full Search Block Matching Motion Estimation Estimation

(6 level Nested do loop)(6 level Nested do loop)

11 21 31 41 51 61 12 22 32 42 52 62 13 23 33 43 53 63 14 24 34 44 54 64 15 25 35 45 55 65 16 26 36 46 56 66 17 27 37 47 57 67 18 28 38 48 58 68

ji

previous frame, y current frame, x

ij=31

Motion vector=(m,n)

Exp: A Simpler PE Exp: A Simpler PE MicroarchitectureMicroarchitecture

Dmin(l-1,N2-1)

x2(l-1,k) x2(l,k)

MAD(l,N2-1)

Sel2

AND

|x-y|

Sel1

y2(l,k)

Reg

RegAND

Min(Dmin(l-1,N2-1),MAD(l,N2-1))

Dmin(l,N2-1)

Reg

Min

+

Reg

MADMAD((m,nm,n)= )= MADMAD((m,nm,n)+|)+|xx((hNhN++ii,,vNvN++jj)-)-yy((hNhN++ii++mm--pp,,vNvN++jj++nn--pp)|)|

Xilinx Core Generator SystemXilinx Core Generator System Critical path delay = 25 ns. based on Xilinx Virtex dataCritical path delay = 25 ns. based on Xilinx Virtex data 1,500-2,000 equivalent gate count1,500-2,000 equivalent gate count Critical path (blue line) can be shortened further by the Intra-Critical path (blue line) can be shortened further by the Intra-

PE pipeliningPE pipelining

Significance of the ContributionsSignificance of the Contributions The MODG representation for nested Do loop algorithmsThe MODG representation for nested Do loop algorithms

The actual execution is not constrained to The actual execution is not constrained to any any predetermined order.predetermined order.

keeps track of every variable instance so that there is no keeps track of every variable instance so that there is no

redundantredundant memory access to memory access to save I/O, save I/O, bandwidthbandwidth and and power consumptionpower consumption..

can be automated using memory .can be automated using memory .

Without the MODG, Without the MODG, the motion estimation and many other nested DO loop the motion estimation and many other nested DO loop

algorithms can be written in many of different DGs,algorithms can be written in many of different DGs, human must be involved to formulate a DG,human must be involved to formulate a DG, the built-in ROM/RAM of FPGA may not be exploited, andthe built-in ROM/RAM of FPGA may not be exploited, and

Significance of the Contributions, cont.Significance of the Contributions, cont.

Space-Time mapping for the MODG can Space-Time mapping for the MODG can be applied tobe applied to any SRAM-based FPGA Architecture any SRAM-based FPGA Architecture

Constraints and Practical Cost functionsConstraints and Practical Cost functions any coarse-grained architectureany coarse-grained architecture

Intra-PE pipeliningIntra-PE pipelining enhances/preserves the throughput rate at enhances/preserves the throughput rate at

low power mode.low power mode.

ConclusionConclusion Users demand more communication/multimedia processing Users demand more communication/multimedia processing

capabilities on the capabilities on the resource-limited Internetresource-limited Internet appliances. appliances. Reconfigurable SOC is the ultimate solution to design the Reconfigurable SOC is the ultimate solution to design the

challenging low-power/high performance platform.challenging low-power/high performance platform. Its success lies on the embedded high-density FPGA core as a Its success lies on the embedded high-density FPGA core as a

reconfigurable (programmable) accelerating hardware.reconfigurable (programmable) accelerating hardware. As technology (supply voltage) scales down, logic (transistor) As technology (supply voltage) scales down, logic (transistor)

is virtually free while the interconnect becomes the bottleneck is virtually free while the interconnect becomes the bottleneck and power consuming.and power consuming.

Parallel execution of nested Do loop algorithms by an array of Parallel execution of nested Do loop algorithms by an array of localized processing elements at moderate clock frequency is a localized processing elements at moderate clock frequency is a viable solution.viable solution.

It can compromise the three main issues: It can compromise the three main issues: design time, design time, power consumption, and performance.power consumption, and performance.

Future TrendsFuture Trends

Memory (storage) organization should be should be investigated due to investigated due to multiple readsmultiple reads per-clock per-clock cycle in order to sustain such high cycle in order to sustain such high throughput.throughput.

The The control mechanismcontrol mechanism of the of the entire arrayentire array is is one of the aspects that will determine its one of the aspects that will determine its success.success.

A given MODG may need to be partitioned of A given MODG may need to be partitioned of so that the resulting array fits the on-chip so that the resulting array fits the on-chip reconfigurable FPGA core.reconfigurable FPGA core.

DSP Algorithms on FPGA Part II Digital image Processing

Documents