FFT in Hardware and Software
Feb 25, 2016
FFT in Hardware and Software
Background
• Core Algorithm• Original Algorithm, the DFT, O(n2) complexity• New Algorithm, the FFT (Fast Fourier
Transform), O(nlog2(n)) depending on implementation.
DFT Computation
• A summation over the whole input array for every single element in the output array.
• A VERY computationally inefficient algorithm to implement.
]1[][])[()(
n
jnenxnxDFTX
FFT Computation
• A much more computationally efficient algorithm
• Works using the divide and conquer principle.
• First developed by Cooley and Tukey in 1965!
DFT vs. FFT (Number of Operations)
Problem Size (N)
Standard DFT(smaller is better)
FFT(smaller is better)
% of DFT(smaller is better)
2 4 1 254 16 4 258 64 12 1916 256 32 1332 1024 80 864 4096 192 5
128 16384 448 3256 65536 1024 2512 262144 2304 11024 1048576 5120 <1
Exponential Growth of DFT(Smaller is Better)
0
200
400
600
800
1000
1200
0 200 400 600 800 1000 1200
Thou
sand
s
Problem Size
Com
puta
tions
Req
uire
d
Nearly Linear Growth of FFT(Smaller is Better)
0
1
2
3
4
5
6
0 200 400 600 800 1000
Thou
sand
s
Problem Size
Com
puta
tions
Req
uire
d
DFT vs. FFT
Percent of DFT Computation Time(Smaller is Better)
0%
5%
10%
15%
20%
25%
30%
0 200 400 600 800 1000 1200
Problem Size
Perc
ent o
f DFT
Com
puta
tion
Tim
e
FFT Butterfly Operations
• Butterfly arrangement of computations• Repeated on successive pairs of input
data• Then half as many times on alternating
pairs• Then half again as many times on every
fourth element• …
The Butterfly
• Simple operations repeated many times
-WnN
xe[n]
xo[n]
X[n]
X[n+N/2]
WnN
Nnkjnk
N eW2
8-point FFT DemonstrationThe Entire Calculation
x[0]x[4]x[2]x[6]x[1]x[5]x[3]x[7]
X[0]X[1]X[2]X[3]X[4]X[5]X[6]X[7]
Input Array Output
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Multiplication by W factor Addition+
8-point FFT Demonstration
x[0]x[4]x[2]x[6]x[1]x[5]x[3]x[7]
X[0]X[1]X[2]X[3]X[4]X[5]X[6]X[7]
Input Array Output
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Multiplication by W factor Addition+
8-point FFT Demonstration
x[0]x[4]x[2]x[6]x[1]x[5]x[3]x[7]
X[0]X[1]X[2]X[3]X[4]X[5]X[6]X[7]
Input Array Output
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Multiplication by W factor Addition+
8-point FFT Demonstration
x[0]x[4]x[2]x[6]x[1]x[5]x[3]x[7]
X[0]X[1]X[2]X[3]X[4]X[5]X[6]X[7]
Input Array Output
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Multiplication by W factor Addition+
8-point FFT Demonstration
x[0]x[4]x[2]x[6]x[1]x[5]x[3]x[7]
X[0]X[1]X[2]X[3]X[4]X[5]X[6]X[7]
Input Array Output
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Multiplication by W factor Addition+
8-point FFT Demonstration
x[0]x[4]x[2]x[6]x[1]x[5]x[3]x[7]
X[0]X[1]X[2]X[3]X[4]X[5]X[6]X[7]
Input Array Output
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Multiplication by W factor Addition+
8-point FFT Demonstration
x[0]x[4]x[2]x[6]x[1]x[5]x[3]x[7]
X[0]X[1]X[2]X[3]X[4]X[5]X[6]X[7]
Input Array Output
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Multiplication by W factor Addition+
8-point FFT Demonstration
x[0]x[4]x[2]x[6]x[1]x[5]x[3]x[7]
X[0]X[1]X[2]X[3]X[4]X[5]X[6]X[7]
Input Array Output
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Multiplication by W factor Addition+
8-point FFT Demonstration
x[0]x[4]x[2]x[6]x[1]x[5]x[3]x[7]
X[0]X[1]X[2]X[3]X[4]X[5]X[6]X[7]
Input Array Output
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Multiplication by W factor Addition+
8-point FFT Demonstration
x[0]x[4]x[2]x[6]x[1]x[5]x[3]x[7]
X[0]X[1]X[2]X[3]X[4]X[5]X[6]X[7]
Input Array Output
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Multiplication by W factor Addition+
8-point FFT Demonstration
x[0]x[4]x[2]x[6]x[1]x[5]x[3]x[7]
X[0]X[1]X[2]X[3]X[4]X[5]X[6]X[7]
Input Array Output
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Multiplication by W factor Addition+
8-point FFT Demonstration
x[0]x[4]x[2]x[6]x[1]x[5]x[3]x[7]
X[0]X[1]X[2]X[3]X[4]X[5]X[6]X[7]
Input Array Output
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Multiplication by W factor Addition+
8-point FFT Demonstration
x[0]x[4]x[2]x[6]x[1]x[5]x[3]x[7]
X[0]X[1]X[2]X[3]X[4]X[5]X[6]X[7]
Input Array Output
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Multiplication by W factor Addition+
Why Hardware?
• Even more speed for FFT• Extremely parallelizable• A whole layer can be done in two FPGA
clock cycles– 1 multiply cycle– 1 add cycle– (Assuming sufficient multipliers)
Hardware Problems
• Complexity• Input speed• Output speed• If the FPGA takes 24.4ns but takes 20s
to transfer the input data, what gain is there?– i.e. 24.4ns + 20s + 20s = ~40s!
Mitigation of Hardware Problems– Use a faster bus
• AMD Opteron’s Hypertransport– 20.8 GB/s (166.4 Gb/s) per Link (V. 3)– Modules that fit into an AMD 64-bit Opteron
Socket
– http://www.drccomputer.com/pages/modules.html - xilinx based module
– http://www.xtremedatainc.com/xd1000_brief.html - altera based module
Mitigation of Hardware Problems
– Put the FPGA on the die with the DSP• Need silicon vendor support• FPGA can access memory on a very wide bus (i.e.
128 bits per cycle)– Implement the entire project in FPGA
• Time consuming to program• Possibly insufficient room on the FPGA
8-point FFT DemonstrationIn Hardware
x[0]x[4]x[2]x[6]x[1]x[5]x[3]x[7]
X[0]X[1]X[2]X[3]X[4]X[5]X[6]X[7]
Input Array Output
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Multiplication by W factor Addition+
8-point FFT DemonstrationIn Hardware
x[0]x[4]x[2]x[6]x[1]x[5]x[3]x[7]
X[0]X[1]X[2]X[3]X[4]X[5]X[6]X[7]
Input Array Output
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Multiplication by W factor Addition+
8-point FFT DemonstrationIn Hardware
x[0]x[4]x[2]x[6]x[1]x[5]x[3]x[7]
X[0]X[1]X[2]X[3]X[4]X[5]X[6]X[7]
Input Array Output
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Multiplication by W factor Addition+
8-point FFT DemonstrationIn Hardware
x[0]x[4]x[2]x[6]x[1]x[5]x[3]x[7]
X[0]X[1]X[2]X[3]X[4]X[5]X[6]X[7]
Input Array Output
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Multiplication by W factor Addition+
Why Not Software?
• Each butterfly must be done sequentially• Only slight parallelism enabled by a DSP
like the TigerSHARC• Each Butterfly can be done in 2 cycles
(after optimization).
Results of Testing• Linear Profiling of FFT Algorithm in C++
StageCycle count Time
8-point 32-point 256-point 8-point 32-point 256-point
Initialization 21 25 25 35.07ns 41.75ns 41.75ns
Computation 6922 1135 174222 1.895 s 11.559 s 290.950 s
Butterfly 91 151.97ns
Results of Testing
• Profiling of VHDL on FPGA• Butterfly takes 24.377ns to execute
– 62% is computational, 38% is routing on FPGA
Product Offerings
• Most DSP Vendors• Many FPGA Vendors (IP – Intellectual Property)• Microcontroller Vendors (i.e. Blackfin)• FFTW – The Fastest Fourier Transform in the
West• AMD Math Core Library• Intel Library• Highly Optimized for the expected hardware
Published Results• The Radix 4 version delivers a 1 K points complex
processing time of 25 microseconds at 200-MHz system speeds and uses only about 10 percent of the resources in a mid-range Stratix device. The Radix 2 is half the size of the Radix 4 and offers a 1 K points complex processing time of 50 microseconds at 200-MHz system speeds. Additional versions of the new cores are under development. [6]
FFT IP Core Published Results [7]
FFT/IFFT length
Texas InstrumentsC6713
Single 4DSP FFT core
(Smaller is Better)
Quad 4DSP FFT core
(Smaller is Better)
256 12.3µs 3.68µs 920ns
512 27.3µs 6.24µs 1.56µs
1024 60.2µs 11.4µs 2.85µs
References[1] Signals Systems and Transforms[2] James W. Cooley and John W. Tukey, "An algorithm for
the machine calculation of complex Fourier series," Math. Comput. 19, 297–301 (1965).
[3] http://www.drccomputer.com/pages/modules.html - xilinx based module
[4] http://www.xtremedatainc.com/xd1000_brief.html - altera based module
[5] http://www.amd.com/us-en/Processors/DevelopWithAMD/0,,30_2252_2353,00.html
[6] http://www.us.design-reuse.com/news/news5650.html[7] http://www.4dsp.com/fft.htm