Introduction to FFT Processorstwins.ee.nctu.edu.tw/courses/vsp_04/handout/FFT...1 Introduction to FFT Processors Chih-Wei Liu VLSI Signal Processing Lab Department of Electronics Engineering

1

Introduction to FFT Processors

Chih-Wei LiuVLSI Signal Processing LabDepartment of Electronics EngineeringNational Chiao-Tung University

FFT DesignFFT

• Consists of a series of complex additions and complex multiplications

Algorithm• Cooley-Tukey decomposition for power of two

length FFT

Architecture• Systematic mapping procedure

2

Algorithm LevelCooley-Tukey decomposition

Radix-2, decimation-in-frequency

∑∑

∑∑−

=+

−

=

++

−

=+

−

=

−==

+==

12/

02/2/

1

0

)12(12

12/

02/2/

1

0

22

)(

)(

N

n

nkN

nNNnn

N

n

knNnk

N

n

nkNNnn

N

n

knNnk

WWxxWxA

WxxWxA

Variants based on CT algorithmFixed radix: Radix-2, Radix-4, Radix-8, Radix-22

Mixed radix: Split-radix, Radix-2/8, Radix-2/4/8Number of addition

• Same for any mixed-radix or fixed-radix algorithm.Number of multiplication

• Depends on the reduction of trivial multiplications.

WNn

A2k+1

A2k

xn+N/2

xn

-1

Hence, increase additions

FFT AlgorithmsReview of Radix-2r algorithm

DIF(decimation in frequency) and DIT(decimation in time) versionRadix-2 algorithmRadix-4 and Radix-22 algorithmRadix-8 and Radix-23 algorithmSplit-radix 2/4 and Split-radix 2/8

3

FFT Algorithms

∑∑−

=

−

=

−≡=

1

0

1

0

2

)()()(N

n

knN

N

n

knN

jWnxenxkX

π

.1,...1,0, −= Nk

4/3NNW

8/7 NNW

0NW

8/NNW

4/NNW

8/3NNW

2/NNW

8/5NNW

)]()[(22*)(

)]()[(22*)(

)1(22

)1(22

1

38

18

8/78/3

8/58/

4/34/

2/0

abjabWjba

abjbaWjba

jWW

jWW

jWW

WW

NN

NN

NN

NN

NN

NN

NNN

+−−=+

−++=+

+−=−=

−=−=

−=−=

=−=

DFT

FFT AlgorithmsRadix-2 Algorithm

DIF Radix-2 Algorithm

⎪⎪⎩

⎪⎪⎨

⎧

+−=+

++=

∑

∑−

=

−

=

12/

0 2

12/

0 2

)]2/()([)12(

)]2/()([)2(

N

n

nkN

nNl

N

n

nkNl

l

l

WWNnxnxkX

WNnxnxkX

.12/,,1,0 −= Nkl K

Butterfly of Radix-2 Algorithm

DIF Form

4

FFT AlgorithmsRadix-4 Algorithm

Radix-22 Algorithm

∑−

=

×++++×++=+14

N

0k

nk

4N

nlN

l34

l24

l41

1WWW4N3nxW

2NnxW

4Nnxnxlk4X ])()()()([)(

112112`1

112121212

4

)2(14/

0

14

0 4

)2(364

244

24

121

)]}4/3()1()4/([)()1()]2/()1()({[

])4/3()2/()4/()([

)24(

nkN

llnN

lllN

n

l

N

k

nkN

llnN

llllll

WWNnxNnxjNnxnx

WWWNnxWNnxWNnxnx

llkX

+−

=

−

=

++++

+−++−−++−+=

×++++×++=

++

∑

∑

;3,2,1,0=l

;1,0, 21 =ll

;14/~01 −= Nk

.14/~01 −= Nk

FFT Algorithms


x(n)

x(n+N/4)

x(n+N/2)

x(n+3N/4)

a(n)

a(n+N/4)

a(n+N/2)

a(n+3N/4)

WN

0 n

WN

2 n

WN

1 n

WN

3 n

l = 0

l = 1

l = 2

l = 3

(Data Ordering: Digit Reversed)

5

k 1k 0

0

1

2

3

0123012301230123

X(0)

x( , )k1 k0

X( , )k0 k1

X(4)X(8)X(12)X(1)X(5)X(9)X(13)X(2)X(6)X(10)X(14)X(3)X(7)X(11)X(15)

FFT Algorithms

Data Ordering of Radix-4 (N=16)

00 0001 0010 0011 00

00 0000 0100 1000 11…

……

……

……

..

……

……

……

….

0k 1k

Digit-reversed ordering

x(n)

x(n+N/4)

x(n+N/2)

x(n+3N/4)

a(n)

a(n+N/4)

a(n+N/2)

a(n+3N/4)

l =01l =02

l =01

l =11

l =11

l =12

l =02

l =12

WN

0 n

WN

2 n

WN

1 n

WN

3 n

W4

1

FFT Algorithms

Butterfly of radix-22 Algorithm

(Data Ordering: Bit Reversed)

6

FFT Algorithms

0000100001001100…

……

……

……

…

0000000100100011…

……

……

……

…

0

0

0

0

0

0

0

01

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

0

0

0

1

1

1

)( 0123 kkkkx

3k 2k 1k)( 3210 kkkkX

X(0)

X(8)

X(12)X(4)

0k

X(2)X(10)

8421

X(6)X(14)X(1)X(9)

X(13)X(5)

X(3)X(11)

X(15)X(7)

Data Ordering of Radix- (N=16)22

Bit-reversed ordering

nkN

nlN

llll

N

n

lll

m

nkN

nlN

lmN

n

m

N

n

nmNlkN

N

n

nlkN

WWWWNnxWNnxWNnxNnx

WNnxWNnxWNnxnx

WWWmNnx

WmNnxWnxlkX

8/842

44

18/

04

244

7

0 88

18/

0

7

0

18/

0

)8/)(8(1

0

)8(

}])8

7()8

5()8

3()8

([

])8

6()8

4()8

2()({[

])8

([

)8

()()8(

−

−

=

−

=

−

=

=

−

=

++−

=

+

++++++++

++++++=

+=

+==+

∑

∑ ∑

∑ ∑∑

FFT AlgorithmsDIF Radix-8 Algorithm

;7,6,5,4,3,2,1,0=l .18/~0 −= Nk

7

nkN

lllnN

llllll

N

n

llll

nkN

nlN

llll

N

n

lll

WWWNnxWNnxWWNnxWNnx

NnxWNnxWWNnxWnx

WWWWNnxWNnxWNnxNnx

WNnxWNnxWNnxnx

8/)24(2

82422

18/

02422

8/842

44

18/

04

244

123121121

1121

}))]8

7()8

3(())8

5()8

([(

))]8

6()8

2(())8

4()({[(

}])8

7()8

5()8

3()8

([

])8

6()8

4()8

2()({[

+++

−

=

−

−

=

−

++++++++

++++++=

++++++++

++++++=

∑

∑

FFT AlgorithmsDIF Radix-23 Algorithm

)248( 123 lllkX +++

;1,0,, 321 =lll .18/~0 −= Nk

FFT Algorithms


x ( n + 4 N /8 )

x ( n )

x ( n + N /8 )

x ( n + 2 N /8 )

x ( n + 3 N /8 )

x ( n + 5 N /8 )

x ( n + 6 N /8 )

x ( n + 7 N /8 )

l= 0

l= 1

l= 2

l= 3

l= 4

l= 5

l= 6

l= 7

W N

0 n

W N

1 n

W N

2 n

W N

3 n

W N

4 n

W N

5 n

W N

6 n

W N

7 n

8

l = 01

l = 01

l = 01

l = 01

l = 11

l = 11

l = 11

l = 11

l = 02

l = 12

l = 02

l = 12

l = 02

l = 12

l = 02

l = 12

l = 03

l = 13

l = 03

l = 13

l = 03

l = 13

l = 03

l = 13

W N

0 n

W N

4 n

W N

2 n

W N

6 n

W N

1 n

W N

5 n

W N

3 n

W

x(n)

x(n+N/8)

x(n+2N/8)

x(n+3N/8)

x(n+4N/8)

x(n+5N/8)

x(n+6N/8)

x(n+7N/8)N

7n

W 4

1

W 4

1

W 8

0

W 8

2

W 8

1

W 8

3

FFT Algorithms


⎪⎪⎪

⎩

⎪⎪⎪

⎨

⎧

+−+++−=+

+−+−+−=+

++=

∑

∑

∑

−

=

−

=

−

=

nkN

nN

N

n

nkN

nN

N

n

nkN

N

n

WWNnxNnxjNnxnxkX

WWNnxNnxjNnxnxkX

WNnxnxkX

4314/

0

414/

0

2

12/

0

)]}4

3()4

([)4

2()({)34(

)]}4

3()4

([)4

2()({)14(

])4

2()([)2(

FFT Algorithms

DIF Split-Radix 2/4 Algorithm

k in X(2k) is from 0 to N/2-1, and in X(4k+1) and X(4k+3) are from 0 to N/4-1

9

FFT Algorithms

Butterfly of Split-Radix 2/4 Algorithm

W 4

1

W N

n

W N

3n

x (n )

x (n+N /4)

x (n+2N /4)

x(n+3N /4 )

FFT AlgorithmsAdvantage of Radix-2/4 Algorithm

Low Computational ComplexityFlexible as radix-2 algorithmBit reversed output (when normally ordered input)

10

⎪⎪⎪

⎩

⎪⎪⎪

⎨

⎧

++++++++

++++++=+

++=

−

−

=

−

−

=

∑

∑

nkN

nlN

llll

N

n

lll

nkN

N

n

WWWWNnxWNnxWNnxNnx

WNnxWNnxWNnxnxlkX

WNnxnxkX

8/842

44

18/

04

244

212/

0

}])8

7()8

5()8

3()8

([

])8

6()8

4()8

2()({[)8(

])4

2()([)2(

FFT Algorithms

DIF Split-Radix 2/8 Algorithm

7,5,3,1=l

x(n)

x(n+N /8)

x(n+2N /8)

x(n+3N /8)

x(n+4N /8)

x(n+5N /8)

x(n+6N /8)

x(n+7N /8)

-j

-j

W81

W83

FFT Algorithms

Butterfly of Split-Radix 2/8 Algorithm

11

Multiplicative ComplexityTrivial multiplications in FFT

Multiplied by• Radix-2: ±1 removed• Radix-4: ±1 and ±j (partially) removed• Split-radix(2/4): ±1 and ±j removed• Radix-8: ±1, ±j, (1±j)/√2 (partially) removed• Radix-2/8: ±1, ±j, (1±j)/√2 removed

Radix-4 Signal Flow Graph

12

Split-Radix Signal Flow Graph

Multiplicative Complexity

19952896439761724736308

120441640

Radix-2/8

Const. Mul

Const. Mul

Radix-8Split-Radix

Radix-4Radix-2N

1204681922399228218320874096281925494 40961016812744139961843440962494 1536479256906487819420481126 76821042504273235861024494 384824108212391538512214 128376456492642256

94 6415218621525812838 32487276986414 820263134326 468810162 202328

How to obtain regular SR FFT architecture?

13

Architecture LevelMapping procedure

Systolic array techniques• Operation scheduling, resource sharing

Pipeline architecture• One-dimensional linear array• Delay-feedback vs. Delay-commutator.

Single PE architecture• Shared-memory, Single Processing Element (PE)

0

16w4

16w

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

Z-8 Z-4

Z-4A

Z-2

Z-2

SW1 SW2Z-1

Z-1

SW3

Stage 1 Stage 2 Stage 3 Stage 4

VerticalProjection

x(0)

x(1)

x(2)

x(3)

x(4)

x(5)

x(6)

x(7)

x(8)

x(9)

x(10)

x(11)

x(12)

x(13)

x(14)

x(15)

X(0)

X(8)

X(4)

X(12)

X(2)

X(10)

X(6)

X(14)

X(1)

X(9)

X(5)

X(13)

X(3)

X(11)

X(7)

X(15)

016w116w216w316w416w516w616w716w

016w216w416w616w

016w216w416w616w

0

16w4

16w

0

16w4

16w

0

16w4

16w

R2MDCR2MDC RRadixadix--22 MMultiulti--PPath ath DDelay elay CCommutatorommutator

Delay Delay CommutatorCommutator ororDelayDelay--SwitchSwitch--DelayDelay

14

11stst and 2and 2ndnd stages in R2MDC (stages in R2MDC (NN=16)=16)

)( nx)(1 nx

B

C

F

G

D

E

H

I

Z -8 Z -4

Z -4A

SW1

Stage 1 Stage 2

7 6 5 4 3 2 1 015 14 13 12 11 10 9 8

3 2 1 0

11 10 9 8

7 6 5 415 14 13 12

(1)(2)

Input pairs : N/2

Stage 1 Stage 2

(3)

Input pairs : N/4

(4)

(5)

(6)

(7)

Input pairs : N/8

(8)

Input pairs : N/16

(9)

(10)

(11)

(12)

(13)

(14)

(15)

Stage 3 Stage 4

0

16w4

16w

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

x(0)

x(1)

x(2)

x(3)

x(4)

x(5)

x(6)

x(7)

x(8)

x(9)

x (10)

x (11)

x (12)

x (13)

x (14)

x (15)

X(0)

X(8)

X(4)

X(12)

X(2)

X(10)

X(6)

X(14)

X(1)

X(9)

X(5)

X(13)

X(3)

X(11)

X(7)

X(15)

016w116w216w316w416w516w616w716w

016w216w416w616w

016w216w416w616w

0

16w

4

16w

0

16w4

16w

016w

4

16w

8 3 2 1

12 11 10 9

7 6 515 14 13

0

49 8 3 2

13 12 11 10

7 615 14

1 0

5 411 10 9 8

15 14 13 12

3 2 1 0

7 6 5 4

R4MDCR4MDC RRadixadix--44 MMultiulti--PPath ath DDelay elay CCommutatorommutator

0

0

00

0

0

0

0

0

0

00

0

0

00

0

0

0

00

1

2

30

2

4

6

0

3

6

9

x(0)

x(1)

x(2)

x(3)

x(4)

x(5)

x(6)

x(7)

x(8)

x(9)

x (10)

x (11)

x (12)

x (13)

x (14)

x (15)

X(0)

X(4)

X(8)

X(12)

X(1)

X(5)

X(9)

X(13)

X(2)

X(6)

X(10)

X(14)

X(3)

X(7)

X(11)

X(15)

Stage 1 Stage 2

12

8

4

BF4

3

2

1

COMMUTATOR

1

2

3

BF4

Coefficients Coefficients

COMMUTATOR

Stage 1 Stage 2

Inputs

A B

ControlControl

15

RrMDCRrMDC

Input stageInput stage

k k thth stagestage

stagesNr

r 1−

stagesNr

r 2−

Nr1

Computational Element

InputOutputs

Coefficients

11

−

−krN

rr

Computational Element

COMMUTATOR

1

2−

−krN

rr

1

1−kr

Nr

krN

rr 1−

krN

rr 2−

krN

r1

Outputsfrompreviousstage

Tonextstage

Coefficients

Commutator Control

(a)

(b)

Delay FeedbackDelay FeedbackR2SDFR2SDF

R4SDFR4SDF

R2R222SDFSDF

16

R2SDF(R2SDF(NN=16) =16) RRadixadix--22 SSingleingle--PPath ath DDelay elay FFeedbackeedback

R2SDF R2SDF (N=16)(N=16) vs. R4SDF vs. R4SDF (N=128)(N=128)

BF4 BF4 BF4 BF4

646464

161616

444

111

17

Buffer Styles of pipeline architecture• R2 delay-commutator: inefficient (50%) MEM

usage. (R2MDC)

• R2 delay-feedback: 100% MEM usage.(R2SDF)

single BF_PE radix-2 shared memory architecture

RAM

BF

1

BF_PE

Single PE Architecture

18

Concluding RemarksThe Split-Radix algorithm has less computation complexity, comparing with the fixed Radix algorithm. However, its butterfly operation is irregular (L-shape).The processing speed of pipeline architecture is faster than single-PE architecture. However, the single PE architecture is the most area-efficient, especially for long length FFT/IFFT application.

Review Traditional FFT DesignSteps

1. Given N-point FFT spec., choose fixed-radix algorithm2. Design radix-r butterfly, multiplier, etc.3. Cascade logrN stages to compute N point FFT.

Arbitrary radix can be used Base on Cooley-Tukey decomposition for any composite number

19

Problem of Traditional Approach

Cannot drive architecture for mixed-Radix algorithmThe processing speed is no longer the critical issue any more nowadays.The chip area and the power consumption dominate the design quality. Re-configurable FFT/IFFT architecture design is necessary for various applications.

A length-scalable and latency-specified FFT/IFFT core is necessary.

Proposed Solution

We implement FFT module by single PE architecture

Radix-rButterfly

Processing ElementReg Reg

Mutiple-portMemory

Pre-fetchbuffer

20

Design IssuePerformance-enough, Chip area, power consumption.Scalable processing element.Limited Storage block(s).Efficient memory address generator.

Algorithm Level

We adopt split-radix 2/4 algorithm to realize the FFT module.

⎩⎨⎧

⋅+= ⋅−

=+∑ kn

N

N

nNnn WXX 2/

12/

02/ )(2kA

⎪⎪⎩

⎪⎪⎨

⎧

⋅⋅⋅−−⋅+=

⋅⋅⋅+−⋅−=

⋅−

=++++

⋅−

=++++

∑

∑kn

Nn

N

N

nNnNnNnn

knN

nN

N

nNnNnNnn

WWXjXXjX

WWXjXXjX

4/3

14/

04/32/4/

4/

14/

04/32/4/

)(

)(

34k

14k

A

A

21

The Kernel of Processing Element

0A

1A

2A

3A

4A

5A

6A

7A

8A

10A

11A

12A

13A

14A

15A

9A

1−1−1−1−1−1−1−1− 1−

1−1−1−

1−1−1−1−

1−1−

1−1−

1−1−

1−1−

1−

1−

1−

1−

1−

1−

1−

1−j−j−j−j−

j−

j−

j−

08W1

8W0

8W3

8W

116W

016W

216W3

16W0

16W3

16W6

16W9

16W

j−j−

0X

1X

2X

3X

4X

5X

6X

7X

8X

10X

11X

12X

13X

14X

15X

9X

Folded Butterfly UnitsComparing with Radix-2/Radix-22, it saves half memory access times.

Butterfly unit

Butterfly unit

MuxMux

MuxMux

Feedback path

22

Storage Blocks

We use multiple single-port memory banks to replace the multi-port memory.The concept of conflict-free memory. (Vertex coloring problem)

x(0)

x(1)

x(2)

x(3)

x(4)

x(5)

x(6)

x(7)

0

4

26

3

7

5

1

x(0)

x(3)

x(5)

x(6)

x(1)

x(2)

x(4)

x(7)

Bank0 Bank1

Scalable Memory Address GeneratorThere must exist a solution for such vertex coloring problem.The best solution --- The proposed Interleave Rotated Data Allocation (IRDA) algorithm.

RAM-0

Address Switcher (AS)

Address Generator(AG)for 64-length

AT AT AT AT

RAM-1 RAM-2 RAM-3 RAM-0

Rotator

Address Generator(AG)for 16-length

RAM-1 RAM-2 RAM-3

23

The IRDA ConceptA conflict-free memory banks.Simple and length-scalable design.The circular shift rotator.

00 01 02 0307 04 05 0610 11 08 0913 14 15 1219 16 17 1822 23 20 2125 26 27 2428 29 30 3134 35 32 3337 38 39 3640 41 42 4347 44 45 4649 50 51 4852 53 54 5559 56 57 5862 63 60 61

RAM-A RAM-B RAM-C RAM-D

Length-Scalable FFT/IFFT Core

Reg

Reg

Reg

Reg

Mux

Mux

Mux

Mux

Radix-2butterfly

processing elementRadix-2butterfly

processing element

Rotator

Mux

Mux

Mux

Mux

Rotator

Reg

Reg

Reg

Reg

Adder Reg

Addressgenerator

RAM-D

RAM-C

RAM-B

RAM-A

24

Further Performance Improvement

Multiple PEs architecture.2 pipeline PEs, for example.

RAM 0 RAM 1 RAM 2 RAM 3

00 02 04 06

14 08 10 12

20 22 16 18

26 28 30 24

RAM 4 RAM 5 RAM 6 RAM 7

01 03 05 07

15 09 11 13

21 23 17 19

27 29 31 25

Group1 Group2

The Cached-FFT Algorithm

25

Overview1. Input data are loaded into an N-word main memory.2. C of the N words are loaded into the cache.3. As many butterflies as possible are computed using the data

in the cache.4. Processed data in the cache are flushed to main memory.5. Steps 2-4 are repeated until all N words have been processed

once.6. Steps 2-5 are repeated until the FFT has been completed.

Processor cache Main Memory

Result 0x

15x7x11x3x13x5x9x1x14x6x10x2x

12x4x8x

0X

1X

2X

3X

4X

5X

6X

7X

8X

9X

10X

11X

12X

13X

14X

15X

W

W

W

W

W

W

W

W

WW

WW

WW

W

W

WWWW

WW

WW

WWWWWWWW

26

N=64, E=2, Radix-2 Cached-FFT

Introduction to FFT Processorstwins.ee.nctu.edu.tw/courses/vsp_04/handout/FFT...1 Introduction to FFT Processors Chih-Wei Liu VLSI Signal Processing Lab Department of Electronics Engineering

Documents