FFT Accelerator Project Rohit Prakash(2003CS10186) Anand Silodia(2003CS50210) Date : February 23,2007
Jan 03, 2016
FFT Accelerator Project
Rohit Prakash(2003CS10186)Anand Silodia(2003CS50210)
Date : February 23,2007
Current Objectives
• Validate the number of complex multiplications
• Run the code with intel compiler and compare the results –– For single run– For multiple runs
• Tabulate all the results
• Analyse these using vTune
Number of Complex multiplications
• Our results– (11/4)*nlog4(n) =8960
• Result on net– (3/4)*nlog4(n) = 3840
• The inner loop is trivial and does not require any “complex multiplications”
Inner loop of our Algorithm
TA[k+j]Uw*A[k+j+m/4]Vw*w*A[k+j+m/2]Xw*w*w*A[k+j+3*m/4]A[k+j]T+U+V+XA[k+j+m/4]T+(i)U-V-(i)XA[k+j+2m/4]T-U+V-XA[k+j+3m/4]T-(i)U-V+(i)XWw*w_m
Total number of multiplications n this loop : 11
New Inner loop of our Algorithm
• TA[k+j]• Utwiddle[k]*A[k+j+m/4]• Vtwiddle[2*k]*A[k+j+m/2]• Xtwiddle[3*k]*A[k+j+3*m/
4]• A[k+j]T+U+V+X• A[k+j+m/4]T+i*U-V-i*X• A[k+j+2m/4]T-U+V-X• A[k+j+3m/4]T-i*U-V+i*X
Total number of multiplications n this loop : 3
(3/4)*nlog4(n) =3840
Stuff we tried
• Improved the “bit reversal”– Better than the last time
• Though inefficient (O(nlogn)), still works faster than the previous implementation
• Still there exists many fast algorithms
System Specifications
• Processor: Intel Pentium 4 CPU 3.00Ghz
• Cache Size: 1MB
• RAM: 1GB
• Flags supported : sse, sse2
Results
0
1
2
3
4
5
6
recursive our best FFTW
icpc
g++
User time(ms) for 1024 points (single iteration)
Results
05
101520253035404550
recursive our best FFTW
icpc
g++
User time(ms) for 1024 points (10 iterations)
Results
0
5
10
15
20
25
recursive our best FFTW
icpc
g++
User time for 4096 points (single iteration)
Results
0
50
100
150
200
250
recursive our best FFTW
icpc
g++
User time(ms) for 4096 points (10 iterations)
Results
0200400600800
100012001400160018002000
recursive our best FFTW
icpc
g++
User time(ms) for 262144 points (single iteration)
Results
0
5000
10000
15000
20000
25000
recursive our best FFTW
icpc
g++
User time(ms) for 262144 points (10 iterations)
Analysis
• Results are comparable due to the following reasons– Change in bit reversal– Number of computations
• FFTW : compiling option gcc
• Got to re-write the code for arbitrary number of points
Tabular Representation(1024 points)
Time (ms) Recursive (single run on icpc)
Recursive (single run on g++)
Final (single run on icpc)
Final (single run on g++)
FFTW (single run on icpc)
FFTW (single run on g++)
Recursive (10 runs on icpc)
Recursive (10 runs on g++)
Final (10 runs on icpc)
Final (10 runs on g++)
FFTW (10 runs on icpc)
FFTW (10 runs on g++)
Real 11 13 10 9 10 9 28 56 10 17 11 10
User 4 6 1 2 3 1 21 46 2 10 4 1
System 2 2 4 4 5 5 5 6 4 5 4 7
Tabular Representation(4096 point)
Time (ms) Recursive (single run on icpc)
Recursive (single run on g++)
Final (single run on icpc)
Final (single run on g++)
FFTW (single run on icpc)
FFTW (single run on g++)
Recursive (10 runs on icpc)
Recursive (10 runs on g++)
Final (10 runs on icpc)
Final (10 runs on g++)
FFTW (10 runs on icpc)
FFTW (10 runs on g++)
Real 18 29 10 13 11 10 96 221 12 49 13 12
User 10 23 3 5 3 4 90 215 5 41 4 4
System 4 3 3 6 4 2 3 5 3 6 5 6
Tabular Representation(262144 point)
Time (ms) Recursive (single run on icpc)
Recursive (single run on g++)
Final (single run on icpc)
Final (single run on g++)
FFTW (single run on icpc)
FFTW (single run on g++)
Recursive (10 runs on icpc)
Recursive (10 runs on g++)
Final (10 runs on icpc)
Final (10 runs on g++)
FFTW (10 runs on icpc)
FFTW (10 runs on g++)
Real 889 1971 108 430 90 87 9541 21652 583 3836 601 604
User 779 1835 82 402 60 61 8400 20493 556 3811 579 578
System 111 132 22 25 22 22 1138 1029 23 22 18 21
Vtune Analysis
• TODO
• Vtune (not available)
Further Improvements
• Fast digit reversal
• Fast “twiddle compute”
• TODO:– Comparison with Intel Math Kernel library– Study FFTW implementation– Vtune Analysis
• Try winograd algorithm
• Code more efficiently
References
• Alan H. Karp “Bit Reversal on Uniprocessors”
• Angelo A. Yong “A better FFT Bit-reversal Algorithm”
Thank You