FFT Accelerator Project

Rohit Prakash(2003CS10186)Anand Silodia(2003CS50210)

Date : February 23,2007

Current Objectives

• Validate the number of complex multiplications

• Run the code with intel compiler and compare the results –– For single run– For multiple runs

• Tabulate all the results

• Analyse these using vTune

Number of Complex multiplications

• Our results– (11/4)*nlog4(n) =8960

• Result on net– (3/4)*nlog4(n) = 3840

• The inner loop is trivial and does not require any “complex multiplications”

Inner loop of our Algorithm

TA[k+j]Uw*A[k+j+m/4]Vw*w*A[k+j+m/2]Xw*w*w*A[k+j+3*m/4]A[k+j]T+U+V+XA[k+j+m/4]T+(i)U-V-(i)XA[k+j+2m/4]T-U+V-XA[k+j+3m/4]T-(i)U-V+(i)XWw*w_m

Total number of multiplications n this loop : 11

New Inner loop of our Algorithm

• TA[k+j]• Utwiddle[k]*A[k+j+m/4]• Vtwiddle[2*k]*A[k+j+m/2]• Xtwiddle[3*k]*A[k+j+3*m/

4]• A[k+j]T+U+V+X• A[k+j+m/4]T+i*U-V-i*X• A[k+j+2m/4]T-U+V-X• A[k+j+3m/4]T-i*U-V+i*X

Total number of multiplications n this loop : 3

(3/4)*nlog4(n) =3840

Stuff we tried

• Improved the “bit reversal”– Better than the last time

• Though inefficient (O(nlogn)), still works faster than the previous implementation

• Still there exists many fast algorithms

System Specifications

• Processor: Intel Pentium 4 CPU 3.00Ghz

• Cache Size: 1MB

• RAM: 1GB

• Flags supported : sse, sse2

Results

recursive our best FFTW

User time(ms) for 1024 points (single iteration)

Results

101520253035404550

User time(ms) for 1024 points (10 iterations)

Results

User time for 4096 points (single iteration)

Results

0200400600800

100012001400160018002000

User time(ms) for 262144 points (single iteration)

Results

Analysis

• Results are comparable due to the following reasons– Change in bit reversal– Number of computations

• FFTW : compiling option gcc

• Got to re-write the code for arbitrary number of points

Tabular Representation(1024 points)

Time (ms) Recursive (single run on icpc)

Recursive (single run on g++)

Final (single run on icpc)

Final (single run on g++)

FFTW (single run on icpc)

FFTW (single run on g++)

Recursive (10 runs on icpc)

Recursive (10 runs on g++)

Final (10 runs on icpc)

Final (10 runs on g++)

FFTW (10 runs on icpc)

FFTW (10 runs on g++)

Real 11 13 10 9 10 9 28 56 10 17 11 10

User 4 6 1 2 3 1 21 46 2 10 4 1

System 2 2 4 4 5 5 5 6 4 5 4 7

Tabular Representation(4096 point)

Real 18 29 10 13 11 10 96 221 12 49 13 12

User 10 23 3 5 3 4 90 215 5 41 4 4

System 4 3 3 6 4 2 3 5 3 6 5 6

Tabular Representation(262144 point)

Real 889 1971 108 430 90 87 9541 21652 583 3836 601 604

User 779 1835 82 402 60 61 8400 20493 556 3811 579 578

System 111 132 22 25 22 22 1138 1029 23 22 18 21

Vtune Analysis

• TODO

• Vtune (not available)

Further Improvements

• Fast digit reversal

• Fast “twiddle compute”

• TODO:– Comparison with Intel Math Kernel library– Study FFTW implementation– Vtune Analysis

• Try winograd algorithm

• Code more efficiently

References

• Alan H. Karp “Bit Reversal on Uniprocessors”

• Angelo A. Yong “A better FFT Bit-reversal Algorithm”

Thank You

FFT Accelerator Project

g fftw single run

ak j m4vw

ak j m4vtwiddle2

ak j m2xw

ak j m2xtwiddle3

icpcrecursive single

icpcfftw single run

icpcfinal single run

Documents

Accelerator Controls Renovation Project “ACCOR”

PAS 2080 Accelerator Project

FFT Accelerator Project

An FFT/IFFT Accelerator for OCT Application

Project-X Accelerator and Proposed Accelerator Requirements....

FFT Accelerator Project Rohit Prakash Anand Silodia Date:...

Parallel Processing Final Project Parallel FFT using to...

Final DCM Project Accelerator Pitch_Mudit Agarwal

Report on the NASA FFT Project Feasibility study, Software.....

fft vlsi project

Seed Accelerator Rankings Project -SxSW

TURKISH ACCELERATOR CENTER (TAC) PROJECT...

Parallel accelerator project

PHYSICAL PROJECT OF BOOSTER FOR NICA ACCELERATOR COMPLEX

Advanced Modeling for Particle Accelerator · SciDAC1-...

Project X (Accelerator) Update: Goals, Status, and Strategy