FFT Accelerator Project

FFT Accelerator Project

Rohit Prakash(2003CS10186)Anand Silodia(2003CS50210)

Date : February 23,2007

Current Objectives

• Validate the number of complex multiplications

• Run the code with intel compiler and compare the results –– For single run– For multiple runs

• Tabulate all the results

• Analyse these using vTune

Number of Complex multiplications

• Our results– (11/4)*nlog4(n) =8960

• Result on net– (3/4)*nlog4(n) = 3840

• The inner loop is trivial and does not require any “complex multiplications”

Inner loop of our Algorithm

TA[k+j]Uw*A[k+j+m/4]Vw*w*A[k+j+m/2]Xw*w*w*A[k+j+3*m/4]A[k+j]T+U+V+XA[k+j+m/4]T+(i)U-V-(i)XA[k+j+2m/4]T-U+V-XA[k+j+3m/4]T-(i)U-V+(i)XWw*w_m

Total number of multiplications n this loop : 11

New Inner loop of our Algorithm

• TA[k+j]• Utwiddle[k]*A[k+j+m/4]• Vtwiddle[2*k]*A[k+j+m/2]• Xtwiddle[3*k]*A[k+j+3*m/

4]• A[k+j]T+U+V+X• A[k+j+m/4]T+i*U-V-i*X• A[k+j+2m/4]T-U+V-X• A[k+j+3m/4]T-i*U-V+i*X

Total number of multiplications n this loop : 3

(3/4)*nlog4(n) =3840

Stuff we tried

• Improved the “bit reversal”– Better than the last time

• Though inefficient (O(nlogn)), still works faster than the previous implementation

• Still there exists many fast algorithms

System Specifications

• Processor: Intel Pentium 4 CPU 3.00Ghz

• Cache Size: 1MB

• RAM: 1GB

• Flags supported : sse, sse2

Results

0

1

2

3

4

5

6

recursive our best FFTW

icpc

g++

User time(ms) for 1024 points (single iteration)

Results

05

101520253035404550


icpc

g++

User time(ms) for 1024 points (10 iterations)

Results

0

5

10

15

20

25


icpc

g++

User time for 4096 points (single iteration)

Results

0

50

100

150

200

250


icpc

g++


Results

0200400600800

100012001400160018002000


icpc

g++

User time(ms) for 262144 points (single iteration)

Results

0

5000

10000

15000

20000

25000


icpc

g++


Analysis

• Results are comparable due to the following reasons– Change in bit reversal– Number of computations

• FFTW : compiling option gcc

• Got to re-write the code for arbitrary number of points

Tabular Representation(1024 points)

Time (ms) Recursive (single run on icpc)

Recursive (single run on g++)

Final (single run on icpc)

Final (single run on g++)

FFTW (single run on icpc)

FFTW (single run on g++)

Recursive (10 runs on icpc)

Recursive (10 runs on g++)

Final (10 runs on icpc)

Final (10 runs on g++)

FFTW (10 runs on icpc)

FFTW (10 runs on g++)

Real 11 13 10 9 10 9 28 56 10 17 11 10

User 4 6 1 2 3 1 21 46 2 10 4 1

System 2 2 4 4 5 5 5 6 4 5 4 7

Tabular Representation(4096 point)













Real 18 29 10 13 11 10 96 221 12 49 13 12

User 10 23 3 5 3 4 90 215 5 41 4 4

System 4 3 3 6 4 2 3 5 3 6 5 6

Tabular Representation(262144 point)













Real 889 1971 108 430 90 87 9541 21652 583 3836 601 604

User 779 1835 82 402 60 61 8400 20493 556 3811 579 578

System 111 132 22 25 22 22 1138 1029 23 22 18 21

Vtune Analysis

• TODO

• Vtune (not available)

Further Improvements

• Fast digit reversal

• Fast “twiddle compute”

• TODO:– Comparison with Intel Math Kernel library– Study FFTW implementation– Vtune Analysis

• Try winograd algorithm

• Code more efficiently

References

• Alan H. Karp “Bit Reversal on Uniprocessors”

• Angelo A. Yong “A better FFT Bit-reversal Algorithm”

Thank You

FFT Accelerator Project

Documents

g fftw single run

ak j m4vw

ak j m4vtwiddle2

ak j m2xw

ak j m2xtwiddle3

icpcrecursive single

icpcfftw single run

icpcfinal single run