Constraint Directed CAD Tool For Automatic Latency-optimal Implementation of FPGA-based Systolic Arrays Greg Nash Reconfigurable Technology: FPGAs and.

Constraint Directed CAD Tool For AutomaticLatency-optimal Implementation of

FPGA-based Systolic Arrays

Greg Nash

Reconfigurable Technology: FPGAs and Reconfigurable Processors for

Computing and Communications IV:

SPIE ITCom, Boston, MA, July 29, 2002

Outline

• Introduction to CAD tool, SPADE (symbolic parallel algorithm development environment)

• Design examples: matrix Lyapunov equation, discreet Fourier transform (DFT)

• Isolating useful designs (Lyapunov)– Alignment of variables in space-time

– Non-optimal solutions

– “Low” bandwidth designs

– “Regular” designs

• Finding optimal solutions (DFT)– Minimum latency

– Maximum throughput

Systolic Array: Matrix Multiply

Project alongtime axis

Space-Time Mapping Systolic Array

d

e

c

1

[ , ] [ , ]* [ , ] 1 , ,N

k

c i j d i k e k j for i j k N

Parallel Processing With Systolic Arrays

• Algorithms– Linear algebra – graph theory – computational geometry

– String matching – sorting/searching – dynamic programming

– Discreet mathematics – number-theoretic algorithms

• Applications (real-time/embedded processing)– Communications – seismic analysis – signal/image processing

– Adaptive processing – arithmetic arrays

• Architecture – Simple processing elements – local interconnects – synchronous

– Fine-grained – pipelined – small local memory

– Local control – regular arrays

• Hardware– FPGA/PLD chips – programmable connections

– Reconfigurable boards – asics

Altera Stratix FPGA: DFT Mapping

Systolic DFT Array

SPADE Operation

MathematicalAlgorithm

InputCode

TransformationSearch

i,k S,T

k

iM

S

Ty

y

T

SM

i

kh

h

T

SM

i

kx

x

S=spatial coordinatesT=temporal coordinatesM=transformation solution

Simulator,GraphicalOutputs

12 12 11 13 13 11

1

, , ,1

11 11 21 21 31 31

1

, ,1

/ ; /

, 1, 3

( ) /

; ; ;

, 1, 3

j

ij ij i k k j i ik

j

ij ij i k k jk

u a l u a l

for j i i j

u a l u l

l a l a l a

for i j j i

l a l u

A LU for i to N do for j to N do if j=1 and i>=1 and i<=N then l[i,j]:=a[i,j]; elif i=1 and j>1 and j<=N then u[i,j]:=a[i,j]/l[i,i]; fi; if i>=j and j>1 and i<=N then l[i,j]:=a[i,j]-add(l[i,k]*\ u[k,j],k=1..j-1) fi; if j>i and i>1 and j<=N then u[i,j]:=(a[i,j]-add(l[i,k]*\ u[k,j],k=1..i-1))/l[i,i] fi; odod;

Algorithm Domain

( ) ( ) I2 0 0. ., (2 , 1) ( )0 1 1

S

x x y yx A I a depends on y B I b for all Iie g x i j x j

• Multiple statements of the general form

– Where Ax,By/ax,by are integer matrices/vectors, S is the dimension of the algorithm space and the dependencies include commutative and associative operators: min, max, ,

SPADE Functionality• Scheduling• Reindexing• Localization• Allocation• Constraint introduction• Solutions

– Primary objective function: latency– Secondary objective functions

• area• regularity• bandwidth

• Automatic operation

“Time-alignment” Constraint

1

[ , ] [ , ] [ , ] 1 , ,N

k

c i j d i k e k j for i j k N

Space-Time Mapping

Systolic Array (N=6)

tu

0t eu n

0t du n

Matrix-matrix multiplication:

cd

e

0t cu n

Lyapunov Matrix Equation Example

• Abstract problem: find X given A (lower triangular) and B (upper triangular)

• Convert to mathematical expression

• Non-uniform recurrence equation in maple language

)j,j(b)i,i(a

)l,i(x*)j,l(b)j,k(x*)k,i(a)j,i(c

)j,i(x

i

k

j

l

1

1

1

1

for i to N do for j to N do

x[i,j] := (c[i,j]-add(a[i,k]*x[k,j],k=1..i-1)- add(b[l,j]*x[i,l],l=1..j-1))/(a[i,i]+b[j,j]); od;od;

CBXAX

Non-latency Optimal Solutions

•Two minimum area, latency optimal designs (L=4N-3) found

•Four smaller area, non-optimal designs (L=4N-2) found

Space-Time View (N=6)

Minimum Bandwidth Secondary Objective Function

• Minimum area secondary objective function, x,a, and b time aligned– 2 unique designs found

– 8 unique data flow paths

– 5 different directions

– Some PEs experience 6 differentdifferent flows of data

• Minimum bandwidth secondary objective function– Single unique minimum area design found

– Variable x placed in “center” of array

N=6

N=6

Maximum Regularity Secondary Objective Function

• Desire simple orthogonal interconnection network topology with minimum number of interconnections

• Avoid time aligned variables (introduces O(N) memory per PE) • Preference for “close” dependency relations between variables

• Four unique solutions found

• Reject

(N=6)x

a

b

x

a

b

x

a

b

1D DFT Design Example

for j to N/4 do for k to N/4 do Y[j,k] := WM[j,k]*add(CM1[j,i]*X[i,k],i=1..4); od; for k to 4 do Z[k,j] := add(CM2[k,i]*Y[j,i],i=1..N/4); odod;

1

2

tM M

tM

Y W C XZ C Y

Z CXBase-4 Transformation

•Mathematical derivation (base-4 form)

•SPADE input code

(2 / )( 1)( 1)

1[ ] [ ] 1, 2...

j N k nN

nZ k X n e k N

•Desired constraints–Minimize number of multipliers (time-align Y)–Time-align X, Z at array boundary–Keep coefficient matrices CM1 and CM2 internal to the array

/ 4 / 41 1

1 / 4 1 / 4/ 2 / 2

1 / 2 1 / 23 / 4 3 / 4

1 3 / 4 1 3 / 4

,N N

N NN N

N NN N

N NN N

Z XZ XZ XZ XZ XZ XZ XZ XZ X

Base-4 vs. Previous Systolic Designs

• CM1 and CM2 contain only elements from the set {1,-1,-i,i}

CM1 X and CM2Yt only involve complex additions

• Twiddle factor matrix WM is of dimension N/4

x4 fewer complex multiplies with x2 more complex adders(previous designs require one complex multiply/add per transform point)

• Takes advantage of reduced arithmetic with radix-4 butterfly, but transform length not limited to N = r m

1

2

tM M

tM

Y W C XZ C Y

1D DFT Systolic Design Result

• Maximum regularity secondary objective function

• Latency = 3N/4+7

• 16 designs found

• Very irregular space-time mappingsSystolic Array

Space-Time

Views

(N=64) Y

X

Z

CM2

CM1Y

Y

X

X

Z

Z

DFT: Constraints Relaxed

• Requires either– X/CM2 time aligned, Z/CM1 internal

– Z/CM1 time aligned, X/CM2 internal

• Minimum area secondary objective designs for 1D DFT– Latency = N/2 + 8

– Six unique designs

– Block processing time = N/4 + 6

– Structure moderatly irregular

Y

X

Z

CM1

IM2

IM1

CM2

CM2

Space-Time View N=64

1D DFT: Throughput Vs. Latency

• High computational efficiencies inside space-time variable mappings are necessary to achieve the best latencies

• High computational efficiency in entire space-time volume is necessary for high throughputs

• Designs need to be “stackable” in time

Latency and Throughput Optimal Designs

• Maximum regularity setting

• Two structurally different designs– X/CM2 time aligned, Z/CM1 internal


• Latency = N/2 + 8

• Throughput = N/4 +1

• Very regular structure

Systolic Array (N=64)

Space-time view, two DFT iterations (N=64)

2D NxN DFT Design

• N 1D “row” DFTs followed by N “column” DFTs

• 1D DFT compution by factoring, N = n1 * n2 , and doing 2D n1 x n2 DFT

• Uses both of two optimal systolic designs– X/CM2 time aligned, Z/CM1 internal


Parameter Prior Designs Base-4 Array

Multipliers N2 N/4 Adders N2 2N Block Processing Time 2N+1 N2/2 + 6N + 18 Latency 4N N2/2+25N/4+19 ~area x throughput-1 = (M+A/4) CPD 5N3/2+ 5N2/4 3N3/8+9N2/2+27N/2

Systolic vs. “Pipelined” 16x16 DFT

† S. Yu and E. Swartzlander, “A Pipelined Architecture for the Multidimensional DFT,” IEEE Trans. Signal Processing, Vol. 49, No. 9, Sept. 2001.

Type Mult Add Registers ROM RAM tcycle Data/cycle

Systolic 4 32 80 16 256 mult 1.6

Pipelined† 6 32 292 24 - mult 4

More Information

• “Automatic Generation of Systolic Array Designs ForReconfigurable Computing” , Proc. Engineering of Reconfigurable Systems and Algorithms (ERSA '02), International Multiconference in Computer Science, Las Vegas, Nevada, June 24, 2002.– General description of SPADE– Faddeev algorithm (Find CX+D, given AX=B, X is unknown)

• Hardware Efficient Base-4 Systolic Architecture for Computing the Discrete Fourier Transform, 2002 IEEE Workshop on Signal Processing Systems, San Diego CA, October 16-18.– Details of base-4 DFT designs– Mapping to FPGAs

• www.centar.net (papers and extended viewgraphs)

Constraint Directed CAD Tool For Automatic Latency-optimal Implementation of FPGA-based Systolic Arrays Greg Nash Reconfigurable Technology: FPGAs and.

Documents

n doxi

latency optimal designs

spacetime view n

matrixmatrix multiplication

nonoptimal designs

matrix lyapunov equation

smaller area

variablesfour unique