Top Banner
FFT Accelerator Project Rohit Prakash(2003CS10186) Anand Silodia(2003CS50210) Date : February 23,2007
21

FFT Accelerator Project

Jan 03, 2016

Download

Documents

amena-head

FFT Accelerator Project. Date : February 23,2007. Rohit Prakash(2003CS10186) Anand Silodia(2003CS50210). Current Objectives. Validate the number of complex multiplications Run the code with intel compiler and compare the results – For single run For multiple runs Tabulate all the results - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: FFT Accelerator Project

FFT Accelerator Project

Rohit Prakash(2003CS10186)Anand Silodia(2003CS50210)

Date : February 23,2007

Page 2: FFT Accelerator Project

Current Objectives

• Validate the number of complex multiplications

• Run the code with intel compiler and compare the results –– For single run– For multiple runs

• Tabulate all the results

• Analyse these using vTune

Page 3: FFT Accelerator Project

Number of Complex multiplications

• Our results– (11/4)*nlog4(n) =8960

• Result on net– (3/4)*nlog4(n) = 3840

• The inner loop is trivial and does not require any “complex multiplications”

Page 4: FFT Accelerator Project

Inner loop of our Algorithm

TA[k+j]Uw*A[k+j+m/4]Vw*w*A[k+j+m/2]Xw*w*w*A[k+j+3*m/4]A[k+j]T+U+V+XA[k+j+m/4]T+(i)U-V-(i)XA[k+j+2m/4]T-U+V-XA[k+j+3m/4]T-(i)U-V+(i)XWw*w_m

Total number of multiplications n this loop : 11

Page 5: FFT Accelerator Project

New Inner loop of our Algorithm

• TA[k+j]• Utwiddle[k]*A[k+j+m/4]• Vtwiddle[2*k]*A[k+j+m/2]• Xtwiddle[3*k]*A[k+j+3*m/

4]• A[k+j]T+U+V+X• A[k+j+m/4]T+i*U-V-i*X• A[k+j+2m/4]T-U+V-X• A[k+j+3m/4]T-i*U-V+i*X

Total number of multiplications n this loop : 3

(3/4)*nlog4(n) =3840

Page 6: FFT Accelerator Project

Stuff we tried

• Improved the “bit reversal”– Better than the last time

• Though inefficient (O(nlogn)), still works faster than the previous implementation

• Still there exists many fast algorithms

Page 7: FFT Accelerator Project

System Specifications

• Processor: Intel Pentium 4 CPU 3.00Ghz

• Cache Size: 1MB

• RAM: 1GB

• Flags supported : sse, sse2

Page 8: FFT Accelerator Project

Results

0

1

2

3

4

5

6

recursive our best FFTW

icpc

g++

User time(ms) for 1024 points (single iteration)

Page 9: FFT Accelerator Project

Results

05

101520253035404550

recursive our best FFTW

icpc

g++

User time(ms) for 1024 points (10 iterations)

Page 10: FFT Accelerator Project

Results

0

5

10

15

20

25

recursive our best FFTW

icpc

g++

User time for 4096 points (single iteration)

Page 11: FFT Accelerator Project

Results

0

50

100

150

200

250

recursive our best FFTW

icpc

g++

User time(ms) for 4096 points (10 iterations)

Page 12: FFT Accelerator Project

Results

0200400600800

100012001400160018002000

recursive our best FFTW

icpc

g++

User time(ms) for 262144 points (single iteration)

Page 13: FFT Accelerator Project

Results

0

5000

10000

15000

20000

25000

recursive our best FFTW

icpc

g++

User time(ms) for 262144 points (10 iterations)

Page 14: FFT Accelerator Project

Analysis

• Results are comparable due to the following reasons– Change in bit reversal– Number of computations

• FFTW : compiling option gcc

• Got to re-write the code for arbitrary number of points

Page 15: FFT Accelerator Project

Tabular Representation(1024 points)

Time (ms) Recursive (single run on icpc)

Recursive (single run on g++)

Final (single run on icpc)

Final (single run on g++)

FFTW (single run on icpc)

FFTW (single run on g++)

Recursive (10 runs on icpc)

Recursive (10 runs on g++)

Final (10 runs on icpc)

Final (10 runs on g++)

FFTW (10 runs on icpc)

FFTW (10 runs on g++)

Real 11 13 10 9 10 9 28 56 10 17 11 10

User 4 6 1 2 3 1 21 46 2 10 4 1

System 2 2 4 4 5 5 5 6 4 5 4 7

Page 16: FFT Accelerator Project

Tabular Representation(4096 point)

Time (ms) Recursive (single run on icpc)

Recursive (single run on g++)

Final (single run on icpc)

Final (single run on g++)

FFTW (single run on icpc)

FFTW (single run on g++)

Recursive (10 runs on icpc)

Recursive (10 runs on g++)

Final (10 runs on icpc)

Final (10 runs on g++)

FFTW (10 runs on icpc)

FFTW (10 runs on g++)

Real 18 29 10 13 11 10 96 221 12 49 13 12

User 10 23 3 5 3 4 90 215 5 41 4 4

System 4 3 3 6 4 2 3 5 3 6 5 6

Page 17: FFT Accelerator Project

Tabular Representation(262144 point)

Time (ms) Recursive (single run on icpc)

Recursive (single run on g++)

Final (single run on icpc)

Final (single run on g++)

FFTW (single run on icpc)

FFTW (single run on g++)

Recursive (10 runs on icpc)

Recursive (10 runs on g++)

Final (10 runs on icpc)

Final (10 runs on g++)

FFTW (10 runs on icpc)

FFTW (10 runs on g++)

Real 889 1971 108 430 90 87 9541 21652 583 3836 601 604

User 779 1835 82 402 60 61 8400 20493 556 3811 579 578

System 111 132 22 25 22 22 1138 1029 23 22 18 21

Page 18: FFT Accelerator Project

Vtune Analysis

• TODO

• Vtune (not available)

Page 19: FFT Accelerator Project

Further Improvements

• Fast digit reversal

• Fast “twiddle compute”

• TODO:– Comparison with Intel Math Kernel library– Study FFTW implementation– Vtune Analysis

• Try winograd algorithm

• Code more efficiently

Page 20: FFT Accelerator Project

References

• Alan H. Karp “Bit Reversal on Uniprocessors”

• Angelo A. Yong “A better FFT Bit-reversal Algorithm”

Page 21: FFT Accelerator Project

Thank You