This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Constraint Directed CAD Tool For AutomaticLatency-optimal Implementation of
FPGA-based Systolic Arrays
Greg Nash
Reconfigurable Technology: FPGAs and Reconfigurable Processors for
Computing and Communications IV:
SPIE ITCom, Boston, MA, July 29, 2002
Outline
• Introduction to CAD tool, SPADE (symbolic parallel algorithm development environment)
A LU for i to N do for j to N do if j=1 and i>=1 and i<=N then l[i,j]:=a[i,j]; elif i=1 and j>1 and j<=N then u[i,j]:=a[i,j]/l[i,i]; fi; if i>=j and j>1 and i<=N then l[i,j]:=a[i,j]-add(l[i,k]*\ u[k,j],k=1..j-1) fi; if j>i and i>1 and j<=N then u[i,j]:=(a[i,j]-add(l[i,k]*\ u[k,j],k=1..i-1))/l[i,i] fi; odod;
Algorithm Domain
( ) ( ) I2 0 0. ., (2 , 1) ( )0 1 1
S
x x y yx A I a depends on y B I b for all Iie g x i j x j
• Multiple statements of the general form
– Where Ax,By/ax,by are integer matrices/vectors, S is the dimension of the algorithm space and the dependencies include commutative and associative operators: min, max, ,
•Two minimum area, latency optimal designs (L=4N-3) found
•Four smaller area, non-optimal designs (L=4N-2) found
Space-Time View (N=6)
Minimum Bandwidth Secondary Objective Function
• Minimum area secondary objective function, x,a, and b time aligned– 2 unique designs found
– 8 unique data flow paths
– 5 different directions
– Some PEs experience 6 differentdifferent flows of data
• Minimum bandwidth secondary objective function– Single unique minimum area design found
– Variable x placed in “center” of array
N=6
N=6
Maximum Regularity Secondary Objective Function
• Desire simple orthogonal interconnection network topology with minimum number of interconnections
• Avoid time aligned variables (introduces O(N) memory per PE) • Preference for “close” dependency relations between variables
• Four unique solutions found
• Reject
(N=6)x
a
b
x
a
b
x
a
b
1D DFT Design Example
for j to N/4 do for k to N/4 do Y[j,k] := WM[j,k]*add(CM1[j,i]*X[i,k],i=1..4); od; for k to 4 do Z[k,j] := add(CM2[k,i]*Y[j,i],i=1..N/4); odod;
1
2
tM M
tM
Y W C XZ C Y
Z CXBase-4 Transformation
•Mathematical derivation (base-4 form)
•SPADE input code
(2 / )( 1)( 1)
1[ ] [ ] 1, 2...
j N k nN
nZ k X n e k N
•Desired constraints–Minimize number of multipliers (time-align Y)–Time-align X, Z at array boundary–Keep coefficient matrices CM1 and CM2 internal to the array
/ 4 / 41 1
1 / 4 1 / 4/ 2 / 2
1 / 2 1 / 23 / 4 3 / 4
1 3 / 4 1 3 / 4
,N N
N NN N
N NN N
N NN N
Z XZ XZ XZ XZ XZ XZ XZ XZ X
Base-4 vs. Previous Systolic Designs
• CM1 and CM2 contain only elements from the set {1,-1,-i,i}
CM1 X and CM2Yt only involve complex additions
• Twiddle factor matrix WM is of dimension N/4
x4 fewer complex multiplies with x2 more complex adders(previous designs require one complex multiply/add per transform point)
• Takes advantage of reduced arithmetic with radix-4 butterfly, but transform length not limited to N = r m
1
2
tM M
tM
Y W C XZ C Y
1D DFT Systolic Design Result
• Maximum regularity secondary objective function
• Latency = 3N/4+7
• 16 designs found
• Very irregular space-time mappingsSystolic Array
Space-Time
Views
(N=64) Y
X
Z
CM2
CM1Y
Y
X
X
Z
Z
DFT: Constraints Relaxed
• Requires either– X/CM2 time aligned, Z/CM1 internal
– Z/CM1 time aligned, X/CM2 internal
• Minimum area secondary objective designs for 1D DFT– Latency = N/2 + 8
– Six unique designs
– Block processing time = N/4 + 6
– Structure moderatly irregular
Y
X
Z
CM1
IM2
IM1
CM2
CM2
Space-Time View N=64
1D DFT: Throughput Vs. Latency
• High computational efficiencies inside space-time variable mappings are necessary to achieve the best latencies
• High computational efficiency in entire space-time volume is necessary for high throughputs
• Designs need to be “stackable” in time
Latency and Throughput Optimal Designs
• Maximum regularity setting
• Two structurally different designs– X/CM2 time aligned, Z/CM1 internal
– Z/CM1 time aligned, X/CM2 internal
• Latency = N/2 + 8
• Throughput = N/4 +1
• Very regular structure
Systolic Array (N=64)
Space-time view, two DFT iterations (N=64)
2D NxN DFT Design
• N 1D “row” DFTs followed by N “column” DFTs
• 1D DFT compution by factoring, N = n1 * n2 , and doing 2D n1 x n2 DFT
• Uses both of two optimal systolic designs– X/CM2 time aligned, Z/CM1 internal
† S. Yu and E. Swartzlander, “A Pipelined Architecture for the Multidimensional DFT,” IEEE Trans. Signal Processing, Vol. 49, No. 9, Sept. 2001.
Type Mult Add Registers ROM RAM tcycle Data/cycle
Systolic 4 32 80 16 256 mult 1.6
Pipelined† 6 32 292 24 - mult 4
More Information
• “Automatic Generation of Systolic Array Designs ForReconfigurable Computing” , Proc. Engineering of Reconfigurable Systems and Algorithms (ERSA '02), International Multiconference in Computer Science, Las Vegas, Nevada, June 24, 2002.– General description of SPADE– Faddeev algorithm (Find CX+D, given AX=B, X is unknown)
• Hardware Efficient Base-4 Systolic Architecture for Computing the Discrete Fourier Transform, 2002 IEEE Workshop on Signal Processing Systems, San Diego CA, October 16-18.– Details of base-4 DFT designs– Mapping to FPGAs