1 Introduction to FFT Processors Chih-Wei Liu VLSI Signal Processing Lab Department of Electronics Engineering National Chiao-Tung University FFT Design FFT • Consists of a series of complex additions and complex multiplications Algorithm • Cooley-Tukey decomposition for power of two length FFT Architecture • Systematic mapping procedure
26
Embed
Introduction to FFT Processorstwins.ee.nctu.edu.tw/courses/vsp_04/handout/FFT...1 Introduction to FFT Processors Chih-Wei Liu VLSI Signal Processing Lab Department of Electronics Engineering
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Introduction to FFT Processors
Chih-Wei LiuVLSI Signal Processing LabDepartment of Electronics EngineeringNational Chiao-Tung University
FFT DesignFFT
• Consists of a series of complex additions and complex multiplications
Algorithm• Cooley-Tukey decomposition for power of two
length FFT
Architecture• Systematic mapping procedure
2
Algorithm LevelCooley-Tukey decomposition
Radix-2, decimation-in-frequency
∑∑
∑∑−
=+
−
=
++
−
=+
−
=
−==
+==
12/
02/2/
1
0
)12(12
12/
02/2/
1
0
22
)(
)(
N
n
nkN
nNNnn
N
n
knNnk
N
n
nkNNnn
N
n
knNnk
WWxxWxA
WxxWxA
Variants based on CT algorithmFixed radix: Radix-2, Radix-4, Radix-8, Radix-22
Mixed radix: Split-radix, Radix-2/8, Radix-2/4/8Number of addition
• Same for any mixed-radix or fixed-radix algorithm.Number of multiplication
• Depends on the reduction of trivial multiplications.
WNn
A2k+1
A2k
xn+N/2
xn
-1
Hence, increase additions
FFT AlgorithmsReview of Radix-2r algorithm
DIF(decimation in frequency) and DIT(decimation in time) versionRadix-2 algorithmRadix-4 and Radix-22 algorithmRadix-8 and Radix-23 algorithmSplit-radix 2/4 and Split-radix 2/8
11stst and 2and 2ndnd stages in R2MDC (stages in R2MDC (NN=16)=16)
)( nx)(1 nx
B
C
F
G
D
E
H
I
Z -8 Z -4
Z -4A
SW1
Stage 1 Stage 2
7 6 5 4 3 2 1 015 14 13 12 11 10 9 8
3 2 1 0
11 10 9 8
7 6 5 415 14 13 12
(1)(2)
Input pairs : N/2
Stage 1 Stage 2
(3)
Input pairs : N/4
(4)
(5)
(6)
(7)
Input pairs : N/8
(8)
Input pairs : N/16
(9)
(10)
(11)
(12)
(13)
(14)
(15)
Stage 3 Stage 4
0
16w4
16w
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(8)
x(9)
x (10)
x (11)
x (12)
x (13)
x (14)
x (15)
X(0)
X(8)
X(4)
X(12)
X(2)
X(10)
X(6)
X(14)
X(1)
X(9)
X(5)
X(13)
X(3)
X(11)
X(7)
X(15)
016w116w216w316w416w516w616w716w
016w216w416w616w
016w216w416w616w
0
16w
4
16w
0
16w4
16w
016w
4
16w
8 3 2 1
12 11 10 9
7 6 515 14 13
0
49 8 3 2
13 12 11 10
7 615 14
1 0
5 411 10 9 8
15 14 13 12
3 2 1 0
7 6 5 4
R4MDCR4MDC RRadixadix--44 MMultiulti--PPath ath DDelay elay CCommutatorommutator
0
0
00
0
0
0
0
0
0
00
0
0
00
0
0
0
00
1
2
30
2
4
6
0
3
6
9
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(8)
x(9)
x (10)
x (11)
x (12)
x (13)
x (14)
x (15)
X(0)
X(4)
X(8)
X(12)
X(1)
X(5)
X(9)
X(13)
X(2)
X(6)
X(10)
X(14)
X(3)
X(7)
X(11)
X(15)
Stage 1 Stage 2
12
8
4
BF4
3
2
1
COMMUTATOR
1
2
3
BF4
Coefficients Coefficients
COMMUTATOR
Stage 1 Stage 2
Inputs
A B
ControlControl
15
RrMDCRrMDC
Input stageInput stage
k k thth stagestage
stagesNr
r 1−
stagesNr
r 2−
Nr1
Computational Element
InputOutputs
Coefficients
11
−
−krN
rr
Computational Element
COMMUTATOR
1
2−
−krN
rr
1
1−kr
Nr
krN
rr 1−
krN
rr 2−
krN
r1
Outputsfrompreviousstage
Tonextstage
Coefficients
Commutator Control
(a)
(b)
Delay FeedbackDelay FeedbackR2SDFR2SDF
R4SDFR4SDF
R2R222SDFSDF
16
R2SDF(R2SDF(NN=16) =16) RRadixadix--22 SSingleingle--PPath ath DDelay elay FFeedbackeedback
R2SDF R2SDF (N=16)(N=16) vs. R4SDF vs. R4SDF (N=128)(N=128)
BF4 BF4 BF4 BF4
646464
161616
444
111
17
Buffer Styles of pipeline architecture• R2 delay-commutator: inefficient (50%) MEM
usage. (R2MDC)
• R2 delay-feedback: 100% MEM usage.(R2SDF)
single BF_PE radix-2 shared memory architecture
RAM
BF
1
BF_PE
Single PE Architecture
18
Concluding RemarksThe Split-Radix algorithm has less computation complexity, comparing with the fixed Radix algorithm. However, its butterfly operation is irregular (L-shape).The processing speed of pipeline architecture is faster than single-PE architecture. However, the single PE architecture is the most area-efficient, especially for long length FFT/IFFT application.
Review Traditional FFT DesignSteps
1. Given N-point FFT spec., choose fixed-radix algorithm2. Design radix-r butterfly, multiplier, etc.3. Cascade logrN stages to compute N point FFT.
Arbitrary radix can be used Base on Cooley-Tukey decomposition for any composite number
19
Problem of Traditional Approach
Cannot drive architecture for mixed-Radix algorithmThe processing speed is no longer the critical issue any more nowadays.The chip area and the power consumption dominate the design quality. Re-configurable FFT/IFFT architecture design is necessary for various applications.
A length-scalable and latency-specified FFT/IFFT core is necessary.
We adopt split-radix 2/4 algorithm to realize the FFT module.
⎩⎨⎧
⋅+= ⋅−
=+∑ kn
N
N
nNnn WXX 2/
12/
02/ )(2kA
⎪⎪⎩
⎪⎪⎨
⎧
⋅⋅⋅−−⋅+=
⋅⋅⋅+−⋅−=
⋅−
=++++
⋅−
=++++
∑
∑kn
Nn
N
N
nNnNnNnn
knN
nN
N
nNnNnNnn
WWXjXXjX
WWXjXXjX
4/3
14/
04/32/4/
4/
14/
04/32/4/
)(
)(
34k
14k
A
A
21
The Kernel of Processing Element
0A
1A
2A
3A
4A
5A
6A
7A
8A
10A
11A
12A
13A
14A
15A
9A
1−1−1−1−1−1−1−1− 1−
1−1−1−
1−1−1−1−
1−1−
1−1−
1−1−
1−1−
1−
1−
1−
1−
1−
1−
1−
1−j−j−j−j−
j−
j−
j−
08W1
8W0
8W3
8W
116W
016W
216W3
16W0
16W3
16W6
16W9
16W
j−j−
0X
1X
2X
3X
4X
5X
6X
7X
8X
10X
11X
12X
13X
14X
15X
9X
Folded Butterfly UnitsComparing with Radix-2/Radix-22, it saves half memory access times.
Butterfly unit
Butterfly unit
MuxMux
MuxMux
Feedback path
22
Storage Blocks
We use multiple single-port memory banks to replace the multi-port memory.The concept of conflict-free memory. (Vertex coloring problem)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
0
4
26
3
7
5
1
x(0)
x(3)
x(5)
x(6)
x(1)
x(2)
x(4)
x(7)
Bank0 Bank1
Scalable Memory Address GeneratorThere must exist a solution for such vertex coloring problem.The best solution --- The proposed Interleave Rotated Data Allocation (IRDA) algorithm.
RAM-0
Address Switcher (AS)
Address Generator(AG)for 64-length
AT AT AT AT
RAM-1 RAM-2 RAM-3 RAM-0
Rotator
Address Generator(AG)for 16-length
RAM-1 RAM-2 RAM-3
23
The IRDA ConceptA conflict-free memory banks.Simple and length-scalable design.The circular shift rotator.
Multiple PEs architecture.2 pipeline PEs, for example.
RAM 0 RAM 1 RAM 2 RAM 3
00 02 04 06
14 08 10 12
20 22 16 18
26 28 30 24
RAM 4 RAM 5 RAM 6 RAM 7
01 03 05 07
15 09 11 13
21 23 17 19
27 29 31 25
Group1 Group2
The Cached-FFT Algorithm
25
Overview1. Input data are loaded into an N-word main memory.2. C of the N words are loaded into the cache.3. As many butterflies as possible are computed using the data
in the cache.4. Processed data in the cache are flushed to main memory.5. Steps 2-4 are repeated until all N words have been processed
once.6. Steps 2-5 are repeated until the FFT has been completed.