FFT — fRISCy fourier transforms?€¦ · of frequency components - 'spectral-leakage'.When we filterthe noise components, we will remove some of the spread-outdata frequency components.This

______________Tutorial

FFT - fRISey Fouriertransforms?

M R Smith

This is an applications tutorial oriented towards the practical useof the discrete Fourier transform (Off) implemented via the fastFourier transform (FFf) algorithm. The Off plays an importantrole in many areas of digital signal processing, including linearfiltering, convolution and spectral analysis. The first part of thearticle is a practical industrial example and takes the readerthrough the thought process an engineer might take as Offfamiliarity is gained. If standard software packages did notprovide the necessary performance, the engineer would need toport the application to specialized hardware. The second part ofthe tutorial discusses the theoretical concepts behind the FFfalgorithm and a processor architecture suitable for high speedFFf handling. Rather than examining the standard digital signalprocessors (OSP) in this situation, the final section looks at howthe reduced instruction set (RISC) processors perform. TheAdvanced Micro Devices scalar Am290S0 and the super-scalarIntel i860 processors are detailed. Comparison of the OSP andRISCprocessors is given showing that the more generalized RISCchips do well, although changes in certain aspects of the RISCarchitecture would provide for considerable improvements inperformance.

Keywords: FFT, digital signal processing, RiSe, Am29050processor

Many digital signal processing (DSP) algorithms are nowavailable in application packages.The 'standard' engineer(such as myself) would approach any new DSP problemby starting with a quick glance in the user manual to get aflavour of how to solve the problem and usethe package.Next would come testing with a few simple examples toensure that the concepts were understood. Thiswould befollowed by more complex examples where the expected'results' were known and then the algorithm would be

Department of Electrical and Computer Engineering, University of Calgary,Calgary, Alberta, Canada. Email: smlthe-enel.ucalgarv.caPaper received: 22 June 1992. Revised: 11 February 1993

tried on the 'real' data. The success of this approachdepends on the expertise andjudgement of the engineer.Thesefactorsmust be tempered by the following descriptions of 'judgement' and 'experts'.

Good judgement is gained from experience.Experience is gained from bad judgement.

An expert is made of two parts: 'X' - a has-been, and'spurt' - a drip* under pressure.

This tutorial examines the implementation of a majordigital signal processing technique, the discrete Fouriertransform (OFT). It is intended to provide a personunfamiliar with using the OFT with the experience toavoid anyunnecessary pitfalls.The OFT playsanimportantrole in many areas of digital signal processing, includinglinear filtering, correlation analysis and spectrum analysis.A major reason for its importance is the existence ofefficient algorithmsfor computing the OFT and athoroughunderstanding of the problems in its application.

The first part of the article is a practical industrialexample involving digital filtering to remove an unwantedsignal. This section discusses the importance of datawindowing and the trick of deliberate synchronoussampling to avoid problems when usingthe efficient fastFourier transform (FFT) algorithm to calculate the OFT.This section would be useful for the engineer intending touse the FFT algorithm from a standard package. (Anexcellent source of algorithms and their theoreticalbackground can be found in the book by Press et al. 1 Adiskette with (debugged) source code is available.)

If using a standard package did not provide therequired performance, the engineer would need a morethorough understanding of the FFT algorithm and howcurrent processors match the required resources for thisalgorithm. The second part of this tutorial provides a briefanalysis of the fast Fouriertransform. The characteristicsneeded for the efficient implementation of the FFT arediscussed in terms of chip architecture in general. Thearchitecture of specialized OSP chips (Texas Instrument

'Drip is a colloquialism for idiot.

0141-9331/93/090507-15 «J 1993 Butterworth-Heinemann Ltd

Microprocessors and Microsystems Volume 17 Number 9 November 1993 507

FFT-fRISCy Fourier transforms?: M R Smith

TMS32020, 32025 and 32030 t , Motorola DSP56001 andDSP56002* and Analog Devices ADSP-2100 family§) anda number of RiSechips (the superscalar Intel i860 and thescalar Advanced Micro Devices Am29050 RISe1J) arecompared.

Since the FFT performance of DSP processors is welldocumented in application notes, the final part of thetutorial provides a detailed analysis of the ease (andproblems) of implementing DSP algorithms on RISCchips. The RISC performances are compared to those ofthe DSP processors.

The reason for examining the RiSe chip in DSPapplications is that many systems already have a highperformance RISe as the main engine or a coprocessor.What hasto be taken into account to get maximum (DSP)performance from these processors? In addition, there area number of low-end RISC processors appearing on themarket- the Intel RISC processor i960SAand the AdvancedMicro Devices RISC Am29240 controller. Although notyet up to the top-end DSP chips in terms of performance,future variants of these chips, based around the sameRISC core, may be.

INDUSTRIAL APPLICATION OF THE OFT

In this section, we discuss a simple practical applicationthrough the eyes of an engineer gaining greater familiaritywith usingthe DFT.The first part sets the scene of how thedata were gathered. The unwanted signal or 'noise' on thedata hasa particuladrequency characteristic. This suggeststhat if the data were transformed into the frequencydomain using a DFT, this 'noise' would appear at oneparticular location in the frequency spectrum. It couldthen be 'cut out' (filtered) and the spectrum transformedback into the original data domain. The resulting 'noiseless'data could then be analysed.

The problem

Beta Monitors and Controls Ltd. is a typical smallcompany servicing the oil and gas industry in Alberta,Canada. One particular problem they handle on a dailybasis is monitoring the performance of the heavyreciprocating compressors used in the natural gas industry.This analysis requires the determination of the compressor's effective input and exhaust strokes. This isobtained by measuring the 'pressure' as a function of'crankshaft angle' for a complete stroke. This crankshaftangle is then converted by a non-linear transform to a'stroke volume'. The compressor's efficiency isdeterminedfrom the area under the pressure versus volume curve.

Experimental measurement technique

The pressure is obtained by attaching a transducer to thecompressor (Figure 1). The transducer's output is fed

tTMS32020, TMS32025, TMS32030 are trademarks of Texas Instrumentsltd.*DSP56001 and DSP96002 are trademarks of Motorola Ltd.§ADSP"2100 is a trademark of Analog Devices.'Am29050 and Am29240 are trademarks ofAdvanced Micro Devices Ltd.

Figure 1 Measurement of the pressure is obtained by attaching atransducer to the heavy compressor cylinder (Figure courtesy of BetaMonitors and Controls Ltd., Calgary,Alberta, Canada)

through a low-pass anti-aliasing analogue filter beforebeing digitized by an analogue-to-digital (A/D) converter.The analogue filter is an important part of the measurement system as it removes all frequency components(such as high frequency random noise) greater than halfthe digital sampling frequency. This ensures that thedigitized signal accurately represents the analogue signalbeing converted/ and avoids the problems of signal'aliasing'. Signal aliasing is when one signal appears, onsampling, as another. For example a 7 kHz signal sampledat an 8 kHz rate will be indistinguishable from a 1 kHzsignal sampledat the samerate. An unnecessary degradationof the signal-to-noise ratio occurs if high frequency noiseis aliased on sampling.

Actual measurements are shown in Figure 2. Although

508 Microprocessors and Microsystems Volume 17 Number 9 November 1993


290 2500

0 25

Channel resonance,

0.05 0 .1 0.15 0.2Normal ized frequ ency

Spectru m of I cycle padded 10 512pts

20 00

500

1500{l.a.15,e~1000

40 0300100 200Crankshaft angle

I cycle of actua I data

150:t-O---.--....-:-:r::--+--~-~~---+-----+---+

190

170

250

270

Figure 2 The pressure as a function of the crankshaft angle ismeasured for a Natural Gas reciprocating compressor. There are data,compressor-related pulsations and unwanted channel noise components

Figure 3 The transform of the data from Figure2.The noiseorchannelresonance frequency components are ind icated

the basic curve is simple and has a high signal-to-noiseratio, the measurements are distorted by important(wanted) low frequen cy 'compressor related pulsation s'and (unwanted) high frequency noise components. Theunwanted noise arises from the transducer which isattached to the cylinder via a small channel or pipe (seeFigure 1). Justas a bottle will whistle if you blow across thetop, this channel will resonate during the cylinder stroke,appearing as rapid vibrations du ring one part of themeasured cycle. Even a 2% error in the measurement ofthe comp ressor performance can mean under-production(loss of profits) or over-load (premature failure of thecompressor). The problem is made more difficult by thenon-linear transform from 'angle' to 'volume' whichdistorts the noise oscillations.

The way to remove this noise is to transform the datausing the OFT into the frequency domain where itsfrequency components can be identified and removed.By inverse transforming the modified spectrum, it shouldbe possible to get the data without the noise component.This data can be converted to 'vo lume' and analysed asrequi red.

Attempt 1 - Bull-at-the-gate approach

The scient ific computer libraries and applicationspackages typically include an efficient implementation ofthe OFT, the fast Fourier transform (FFT) algorithm, basedon a data length M that is a power of 2 (M ::: 16, 32, . .. ,512, 1024 etc.). Since the pressureversus angle data havea length of 360 points (1 cycle), it seems appropriate topad the datawith zeros to size 512and then apply the FIT.

Transforming the original data (Figure 2) produces aspectrum (Figure 3)with the channel resonancefrequencycomponents fairly evident above a background signal.The frequency scale has been normalized (frequency!DFT-points) so that spectra from different sized DFTs canbe compared. Large frequency componentsaredisplayedso that the smaller componentsare more easilyseen.The

noise frequency components can be zeroed (filtered out)for normalized frequencies 0.12 to 0.16/and the modifiedspectrum transformed back into the data domain for thecompressoranalysis. It can be seen from the resulting data(Figure 4) that the majority of the noise oscillations havebeen removed by the filtering, but there are now differentdistortions that were not there be fore.

There are many books that will explain the problemsduring this simple fil tering2, 3; however, the followingargument outlines th e underlying principles. The newdistortions can be understood at a number of levels.Because of the finite amount of data, there are discontinuities at its boundaries on padding with zeros". These'sharp edges' have frequency components all across thespectrum (the background signal of Figure 3). This meansthat the data is no longer confined to the low frequencies .When the noise frequ ency components are removed , sois a significant part of the 'spread-out ' data components.On inverse transforming, the removed data componentsmean that the filtered signal w ill be incorrect, particularlynear the discontinu it ies. In addi tion, since the noisecomponents were also spread out, they are more difficultto identify and (correctly) remove.

There is also a second, less obv ious problem. Ifdiscontinuities in the data lead to a range of frequencies inthe spectrum, then di scontinuities in the spectrum willlead to a range of original data values. When we removedthe noise frequency components by setting them to zero,this created discontinuities in the spectrum which canlead to additional distortions in the filtered data. Properapplication of the OFT can remove or reduce many ofthe se artifacts.

Attempt 2 - OFT with windowing

The previo us section described the problems associatedwith blindly applying the Off. The difficulties are deeper

'To make the problem more obvious, the data's DC offset wasdellb eratolv increased.

Microprocessors and Microsyslems Volume 17 Number 9 November 1993 509


300 applied to get the 'finite' data record lead to a wide rangeof frequency components - 'spectral-leakage'. When wefilter the noise components, we will remove some of thespread-out data frequency components. This will producedistortion in the filtered signal when an inverse transformis performed.

Windowing is a fundamental limitation to the OFf.There is no way around it; the best that you can do is toapply different windows to minimize distortion. Thesecret is to modify the window on your data so that thespectral leakage (side-lobes) of the window is reduced.This means that you will be able to better discern smallsignals (the channel resonance) in the presence of largersignals (the frequency components associated with thedata edges). However, applying the window to reduce thesidelobes must be balanced against the fact that eachspectral peak is Widened, resulting in a loss of resolution.An excellent paper on the properties of the OFf andwindows has been given by Harris",

Applying the window to reduce distortions introducesdifferent distortions. By gathering additional data (say 2.5cycles padded to 1024 points, Figure 6) it is possible tominimize the effect of these new distortions. A windowwith smoothly changing edges is applied to this extendeddata before calculating the OFf. This window will allowthe noise frequency components to be more clearlyidentified and filtered. The spectrum can then be inversetransformed and the window removed.

Suppose that you have M data points, xtrn): 0 <;n <; M,which you intend to filter. The steps in generating thefiltered signal are:

Xwindowed(m) = x(m) X Wwindow(m); 0 <: m < M (1)

X(f) = FT[xwindowed(m)]; 0 <: m, f < M (2)

Xfiltered(f) = X(t) X Ffilter(f) (3)

Xfiltered (rn) = Fr-1 [Xfiltered (f)] (4)

xcorrected(m) = Xfiltered(m)/Wwindow(m) (5)

First the windowed data points (Xwindowed (m)) aregenerated from the original data points using one of the

400100 200 300Crankshaft angle

Filteredsignal of I cycle padded to 512

100+--__I--I-->_--+- -+- ~__+_--...-.-..,__+

o

Figure 4 The original data isshown as the upper trace. When thedataispadded with zeros before filtering, the resulting filtered signal has newdistortions at its edges

and more complicated than what appears on the surface.We 'think' we are trying to transform the signal shown inFigure 2. However, when we use the OFf what we areactually tryin gto do is to transform an infinitely long signalof which we only know a small part. This subtle effect isknown as 'windowing' and has a very pronounced effecton a signal's spectrum. If we had an 'infinite' amount ofdata taken from the compressor, there would be nodiscontinuities and no distortions introduced whencalculating its spectrum.

Figure 5 . (top) shows an 'infinitely-long' complexsinusoid and its spectrum, a single spike. Figure 5 (middle)shows a 'windowed' sinusoid and its (magnitude) spectrum.It can be seen that the single spike hasspread into a widecentre lobe and there are a number of high side-lobes.Everyfrequency component in the data shown in Figure 2undergoes a similarspreading. The discontinuities associated with the effective 'rectangular' sampling window

Infinite

Infinite!-- --1. _

OJ-0

..;a.E«

Windowed

Windowed

Synchronouslywindowed Synchronously

Windowed

o 200 400 600 800Time

Effect of windows on sinusoid

1000 1200 o 200 400Normalized frequency

Effect of windowing on spectrum

600

Figure 5 (Top) An 'infinitely-long' sinusoid and itsspectrum; (middle) a 'windowed' sinusoid and itsspectrum. Note how thewindowing produces'spectral leakage'; (bottom) a 'synchronously sampled' sinusoid. Note how this spectrum appears not to have any 'spectral leakage'



where (bandstop (m) is the (inverse)discrete Fouriertransfonnof the frequency domain notch filter. Windowing,

(5)

0 .250 .2

Channel resonance

!

XSimple(m) = Lxtv) (bandstop(m - v)

v

0 .1 0 .15Normalized frequency

Spectrum of 2.5 cyclespodded to1024

Figure 7 The upper spectrum is from 2.5 cycles of data padded to1024 points ..Note the large background because of 'spectral leakage'from the main data components. The lower spectrum is obtained bywindowing the data before filtering

F(f) = 1- ao - al cos (2Tr (f - (8 + P)/2))8 I

- a2cos (~ 2«( - (B + P)!2));

P - 8/2 <.f < P+ 8/2

F(f) = 1; P+ 8/2 <' f < M

Note that for most data the noise will have componentsat two locations (Pand at N - P),because of the way theDFf generates the data spectrum , so that two bandstopfilters must be applied. There is also some reasonableargument to recommend (smoothly) removing all thefrequency components P - 8/2 <: n < N - P+ 8/2 asthese will mainly contain unwanted random noise.However, if there are some valid high-frequency components present in the data, then removing them willdegrade any sharp edges actually present in the data.

Figure 7 shows the un-windowed spectrum for 2.5cycles of data padded with zero out to 1024 points. Thespreading of the DC components is very evident. Bycomparison, in the widowed data spectrum, the channelresonance and pulsations become very clear asthe sidelobes are removed. When the noise ;s removed, themodified spectrum inverse transformed and the w ind owremoved, the filtered signal shown in Figure 6 is obtained.

It might be assumed that multiplying by a windowWwindow(m) and later div iding by the same window,cancels out the effect of the window. This is not the casebecause the frequency domain filtering changes thewindow so that the division and multipl ication effects nolonger cancel. Filtering in the frequency domain isequivalent to performing a convolution on the data. Thesimple 'non-widowed' filtering can be expressed as:

'" 150 0'CJ.gaen

~ 1000

500

2500

2000

where

300

F(f)=1; O<.f<P-B/2

200 400 600 800Crankshaft angle

Filtered signal of 2.5 cycles podded to 1024

Figure (, Byapplying windowing techniques to a number of cycles ofdata (top) before using the DFTIt is poss ible to generate the filtered signal(bottom). By throwing away the distorted ends , the analysis can beperformed on the undistorted centre cycle

250

filter window shapes (W window(m» suggested by Harris.The windowed data is then transformed (FT minto thefrequency domain (X(f)), where the unwanted noisecomponents are removed using a band -stop filter (Ffilter(f)).

The filtered frequency domain signal (Xfiltered(f)) is inversetransformed (FT-' D)back to th e data domain (Xfiltered(m))where the original window is removed to produce therequired signal (xcorrected (m)).

There are a number of popular windows, chosenbecause they are simple to remember and becauseapplying any window is often better than none. A simplew indow is given by:

F(m)=:ao+a1cos(~m} O<.m<M (1)

where

ao = 0.54; al = -0.46 (2)

I prefer to use a slightly more complex window knownas the 'Blackmann-Harris 3 term window':

F(m) =: ao + al cos (~ m ) + a2cos (~ 2m );

0 <' m < M (3)

ao =0.44959; a, = -0.49364; a2 = 0.05677 (4)

which was designed to have a reasonable main lobe widthand minimum side -lobe height.

It should also be remembered that it is often importantto filter out the noise frequency components rather thanzeroing them out, again to avo id discontinuities. Supposethat the spectrum X(f) has been evaluated using M pointsand that the noise frequency components are centredaround location P with a bandwidth of B, then a suitablebandstop filter with which to multiply the spectrumwould be:

Microprocessors and Microsystems Volume 17 Number 9 Novembe r 1993 511

FFT-fRISCy Four ier transforms?: M R Smith

applying the notch filter, and inverse windowing isequivalent to:

xcorrected (rn) =

Lx(v) wwindow (v) {bandstop (m - V)/Wwindow (rn) (6)

v

which is not the same expression. Since 1/Wwindow (rn) isvery large near the edge of the window, the divisionoperation will amplify any noise on the data, leading topossible distortion near the data edges.This means that itis necessary to use a number of cycles of the data and'throwaway' the parts (near the edges) that remaindistorted.

Attempt 3 - OFT with deliberate synchronoussampling

When Beta Monitors discussed with me their filteringproblems, they had already empirically attempted theapproaches suggested in Attempts 1 and 2. However,they wanted more.They wanted to remove the windowingproblems but, at the same time, reduce the number ofpoints measured. Normally, this is not possible as afundamental OFT limitation is the window effects of someform or another. You must simp ly decide wh ich of thedistortions (resolution lossor side-lobes) you can best livewith in practice.

One way of reducing the effect of the window can betaken for data of the form obtained by Beta Monitors. Byremoving the large DC offset and shifting the start of thesampling, it can be seen that the signal is 'naturally'windowed with very few edge discontinuities (Figure 8).Rlteringthe spectrum produced bythe naturallywindowedsignal is also shown in Figure 8. Note that there are edgeeffects still present, but they are less evident.

Most of the time, the data cannot be naturallywindowed and you must live with the effects. However,

there is one very special and unusual situation wheresomething can be done. Fortunately for Beta Monitors,their data could be manipulated into the required format.Figure 5 (lower) shows a 'windowed ' sine wave and itsspectrum obtained by using a DFT.This spectrum appearsto contradict all that was said in the previous section.Where are the side-lobes from the w indow?

When you are applying the OFT you do not calculatethe continuous spectrum assuggested in Figure5 (top andmiddle). Instead you determine only certain parts of thatcontinuous spectrum. The 'windowing' in Figure 5(bottom) has been chosen so that a whole number ofcycles of the sinusoid are included in the sampling period,called synchronous sampling. This has the effect thatwhen you apply the OFT you only sample the 'spectralleakage' at the central maximum and at all places wherethe 'leakage' is zero (Figure 9). This means that if you canachieve synchronous sampling, then it as if there was notany leakage when using the OFT.

True synchronous sampling of all components of yourdata is not something that can be readily achieved. It isnormally only done by mistake when students choose apoor example for use with thei rspectral analysis programsin OS? courses. By accidentally synchronously sampling,their algorithms will appear to perform much better thanthey would in real life. However, in the case of BetaMonitor's data, both the data and the noise wererepeatable every 360 po ints asthe fixed speed compressormade one rotation. By performing a 720 point OFT ratherthan astandard 1024 point OFT, it waspossible to achievesynchronous sampling and generate the spectrum shownin Figure 10. The spectral components just jump out atyou . It is now very easy to identify and remove the noisefrequency components without disturbing the main datacomponents and achieve 'perfect fi ltering'. After inversetransforming, the signal shown in Figure 11 was obtained.A further improvement in the signal could be obtained byadding two cycles to average out the effect of randomnoise.

2000 .06

Figure 8 Less distortion arises from filtering a 'naturally widowed'signal obtained byremoving the zero offset and adjusting the position ofthe signal in the window

600

0 .05

0 .04

200 400Normalized frequency

Synchronous sampling- evaluatedat peak andat zeros

Figure 9 When windowing asynchronously sampled signal, the widemain lo?~ and .the side-lobes are in fact present. However the 'spectralleakage signal IS sampled only at the centre of the main lobeand atthezero-crossing points between the side-lobes

! 0 .03

1 0 .02

400300100 200Crankshaft angle

Filtered signalof I cyclenaturallywindowed

5

150

e~IOO

£



2.500

2000

(9)M -1

x(m) = ~ LX(f)WMmf, 0 <m < M

f= 0

Since these equations are basically equivalent, we shalldiscuss only the OFT implementation.

A number of steps can be taken to speed the directimplementation of the OFT. Firstthe coefficients Wtp canbe pre-calculated and stored for reuse(in ROM or RAM).The calculation time for all the sine and cosine values cantake almost as long as the OFT calculation itself. If theinputvaluesx(m) arereal,a furthertimesavingof2 can bemade usingthe OFTsince halfthe spectrumcan be derivedfrom the other half rather than being calculated.

XU) =af +jb f 0 <f <M/2

X(M - f) =af - jbf 0 <f < M/2

Despite all these 'special' fixes, basically the directimplementation of the OFTisa realnumber-cruncher. Foreach value off, we require M complex multiplications (4Mreal multiplications) and M - 1 complex additions(4M - 2 real additions). Thismeansthat to compute all Mvalues of the DFT requires M2 complex multiplicationsand M 2

- M complex additions. Take M = 1024 as arealistic data size and you have over 4 million multiplications and 4 million additions, which isalot of CPU time,even with the current high-speed processors.

There are a number of ways around this problem,based on a divide and conquer concept whlch leads to agroup of FFT algorithms. One such algorithm is the Mpoint decimation-in-frequency* radix 2 FFT.

Consider computing the OFT of the data sequencextm) with M = 2r points. These data can be split into oddand even series:

geven(m) = x(2m)

godd(m) = x(2m + 1); 0 <;m MI2

These series can be used in calculating the DFTof xlrn).

M-l

XU) '" I x(m)W~, ocr < Mm=O

750

0.25

500250

0.05 0.1 0.\5 0.2Normalized frequency

Spectrumof 2 cyclessynchronously sampled

CrankshaftangleFiltered signal of 2 cyclessynchronously sampled

500

300

250

Figure 11 This filtered data set was achieved by usinga 720 point DFTon two data cycles

Figure 10 The spectrum from a 720 point 'synchronously sampled'data set. Note the sharp spectral spikes

e~ 2.000..

<ll 1500]<=go:;; 1000

THEORY BEHIND THE FFT ALGORITHM= L x(m)WIT + L x(m)WIT

m even m odd

Before moving on to detail a suitable processor uponwhich to implement the OFT efficiently, we need todetermine what the processor must handle. The basiccomputational problem for the OFT is to compute the(complex number) spectrum XU); 0 <f < M, given theinput sequence x(m); 0 <m < M according to theformula:

(M/2) - 1 (MI2) - 1L x(2n)W~f+ L X(2n+1)W~2n+1)

n=O n=O

(MI2) -1

I geven(n)Wtp/2

n=O

M-1

XU) = L x(m)W~, 0 «t < Mm = 0

(7)

(M/2) -1

+W~ L gOdd(n)W~/2n=O

where

WIT = e-j2ITfmIM = cos (2lTfm/M) - jsin (2lTfm/M) (8)

In general the data sequence x(m) may also be complexvalued. The inverse OFT is given by:

*DSP and gratuitous violence - the word 'decimate' comes from theancient Roman method of punishing mutinous legions of soldiers bylining them up and killing every 10th soldier.



A>::::A+B

e (A-B) "p (-j211"p/N)BUTTERFlY

approach is more general. For example, the 720 point FFTneeded for the Beta Monitor data discussed in the lastsection might be obtained by decimating the data intofour groups of two, two groups of three and one group offive". Forfurther information on Radix 2, 4, 8, split radix or720 point FFT algorithms see References 2 and 3.

As can be seen from Figure 2, a disadvantage to thisapproach to the FFT algorithm isthat the results are storedat a location whose address is 'bit-reversed' to the correctaddress. Thus the data value X(%011) is stored at location%110. This requires a final pass through the data to sortthem into the correct order. As will be shown in the finalpart of this tutorial, this plays an important role in theefficient implementation of the FFT on RISC chips.

o

h BIT ~I-------FFT-----· REVERSEAODFfESSING

OUTPUT____________• X [oj

<, /-" X[I]--':::_--".L--- x [21.... ....,.'";,...,..:::........ ".. X[3]...»< ...~~::: .... x[4]

......... .....-. ....---"..::"_':. ---. X [5J",.. .....,... -,........... X(6)___________-0 X [7J

INPUTx [0]x[I]X[2]X[3]X[4]X[5]X[6J<[71

Figure 12 The flow chart for an eight point Radix 2 FFT algorithm

* *"' .. *"'1'''''''''''''** I

int i, i, k, m, inc, Ie , La, nl, n2;float xrt, xit, c , S;

DOFPr(xr, xi, wr, wi, n, m)float xr I l . xi[], wr[], wifJ;int n , mtI

PROCESSOR ARCHITECTURE FOR THE FFTALGORITHM

1* common * I

1* offset */

1* upper */

xrt; :;:: xr[ij - xr(m];xit = xHiJ -xd Iml i

xr l 1) += xr Iml ;xi[i] += xi [m] i

for (i = j 1 i c: nr i += nl) (m = i + n2,

for (j .. 0i j < n2; j ++) {I * sine and cosine values ""I

C '" wr l LeI i s = wdl Lel ria := ia + Le r ;- next address offset *1

n2 = 0;for (k = 0; k < rnl k1-+) { /* outer-loop" I

n1 ~ n2; n2 = n2 I 2; ie = n I n1;ia '= 1;

/*************.****"xr -- array of real part of dataxi -- array af imag part of datawr - array of precalculated cosine vafuas for n pointswi - array of precalculated sine ve iuee for n pointsm = log' (n)

2

A recent article'' discusses in detail how DSP and RISCchips handle various OSP algorithms. This article pointsout that although the FFT is a specialized algorithm, itmakes a fairly good test-bed for investigating the architecture required in OSP applications. Examiningthe Radix 2'C-code' shown in Figure13 provides a good indication ofwhat the processor should handle in an efficient way:

• The algorithm is multiplication and addition intensive.• The precision should be high to avoid round-off errors

as values are reused.• There are many accesses to memory. These accesses

should not compete with instruction access to avoidbottlenecks.

• The algorithm uses complex arithmetic.

where we have used the property that w~ "" WM/2' Thisanalysis means that the M-point OFT X(f) can becalculated from the Ml2-point OFTs G even (f) and Godd (f).This does not seem much until you realize that geven(rn)and godd(m) can also be broken up into their odd andeven components, and these components into theirs.This breaking up and calculating a higher OFTfrom a set ofother DFTsis demonstrated in Figure 12 for an eight-pointOFT.

The advantage of this approach can be seen by the factthat calculating an M-point OFT from known M/2-pointOFTs requires only M/2 additional complex computations.

X(f) = Ceven(f) + wl, Godd(f); 0 <J< MI2

X(f + M/2) = GevenW - wl, Godd(f)

This operation is known asan FFT butterfly and, ascan beseen from Figure 12, forms the basisof the FFT calculation.Calculating the M/2-point Geven (f) and Codd (f) from theirM/4-point components takesasimilar numberof multiplications. Thus by using this divide and conquer approach,it is possible to calculate an M-point OFT usin§(M/2)log2M complex multiplications rather than the Mrequired for the direct method.

The time saving that this new approach provides isenormous as can be seen from Table 1, which comparesthe number of complex multiplications forthe direct andRadix 2 OFT algorithms.

The speed improvements rapidly increase as thenumber of points in creases. The C-code for implementingthis Radix 2 algorithm is given in Figure 13 (modified fromReference 2). Breaking up the data into four components(Radix 4) also provides some speed improvement, whichfor 1024 poi nts gives an additional 30% advantage.

The divide-and-conquer approach is most often usedfor data numbers that are a power of 2. However, the

Table 1 Comparison of complex operations required for direct andradix 2 OFTalgorithms

xr Iml = c * xrt + s - xit;xi [m] = c * xit - s * xrt i

;- lower */

Points Direct Radix 2 Speedimprovement Figure 13 The 'C-code' for a simple three-loop non-custom N point

Radix 2 FFT algorithm

4321281024

161024163841048576

4804485120

400%1280%2130%20488%

'On the basisof 'if it ain't broke, don't fix it', if I had avery fast processorand only had to do a 720 point DFTa veryfew times, I would be temptedto simply code astraightOFT.That wasthe approach I took for this paper.I also forgot to turn on the maths coprocessor and was really remindedjust how slow a 'slow OFT' algorithm is,


• The algorithm hasa number of loops, which should notcause 'dead-time' in the calculations.

• There are many address calculations, which should notcompete with the data calculations for the use of thearithmetic processor unit (APU) or the floating pointunit (FPU).

• There are a number of values that are reused. Ratherthan storing these out to a slower external memory, itshould be possible to store these values in fast onboard registers (or data cache).

• There are fixed coefficients associated with thealgorithm.

• Speed, speed and more speed.

The DSP processor is designed with this sort of problemin mind - all the resources needed are provided on thechip. Typically, DSP processors run at one instructionevery two clock cycles. In that time, they might performan arithmetic operation, read and store values to memoryand calculate various memory addresses. Bycomparison,RiSe chips are more highly pipelined and can completeone instruction every clock cycle. When there is anequivalent instruction (for example the highly pipelinedmultiply and accumulate instruction or a basic add) thisgives the RiSe processor the edge. It loses the edge whenmany RiSe instructions are needed to imitate a complexDSP instruction. Depending on the algorithm and thearchitecture of the particular chips, the DSP and Riseprocessors come out fairly even in DSP appllcatlons".

None of the processors on the market has true'complex arithmetic' capability with data paths and ALUsduplicated to simultaneously handle the real and imaginarydata values. Since complex arithmetic is notthat common,adding the additional resources to a single chip is notworthwhile. This is the realm of expensive custom chipfabrication, microprogrammabie DSP parts" or multiprocessor systems.

Many DSP applications have extensive looping. Thiscan be handled by hardware zero overhead loop(s) (TexasInstruments TMS320 DS?family and Motorola DSP96002).On RiSe processors, the faster instruction cycle, thedelayed branch and unfolding loops (straight line coding)remove the majority of the delay problems with usingbranches. This is particularly true for algorithms, such asthe FFT, where the loops are long.

Nor is there significant difference in the availableprecision on the DSP and the RiSe processors. Forexample, the DSP56001 has a 24 bit wide on-chipmemory but uses56 bits forsum-of-products' operationsto avoid loss of precision. The i860 and Am29050processors have 32 bit wide data buses and can use 64bits for single cycle sum-of-products operations. Many ofthe RiSe and DSP chips now come with floating pointcapability at no time penalty. Although not strictlynecessary, the availability of floating point makes DSPalgorithms easier to develop and can provide improvedaccuracy in many applications.

There is one area in which the DSP chips appear tohave a significant advantage, and that is in the area ofaddress calculation. The FFT algorithm requires 'bitreversal' addressing to correct the positions of the data inmemory, a standard DSP processor mode. This modemust be implemented in software on the RiSe chips.

FFT-fRISCy Fourier transforms?: M R Srnith

However, as was pointed out to me at a recent DSPseminar", it is possible to avoid the overhead for bitreverseaddressingby modifying the FFT algorithm sothatit does not do the calculation in place.

Auto-incrementing addressing modesare alsostandardas part of the longer DSP processor instruction cycle. Onthe Am29050 processor this must be done in software (atthe faster instruction cycle) or by taking advantage of thisprocessor'sability to bring in burstsof data from memory(see the section on 'Efficient handling of the RISC'spipelined FPU'). The super-scalar i860 RiSe is almost aelse chip in some of its available addressingmodes.

When comparing the capabilities of RiSe and DSPprocessors it is important to consider the possibility thatthe processor is about to run out of resources. Forexample the TMS32025 has sufficient data resources onboard to perform a 256 point complex FFT on-chip in1.8 ms with a 50 MHz clock. The available resources areinsufficient for 1024 complex points, which takes 15 msrather than the 9 ms if the 256 point timing is simplyscaled? The various processors have different breakpoints depending on the algorithmchosen". The evaluationis, however, difficult because of the different paralleloperations that arepossible on the variouschips, some ofthe time.

Thereareas manysolutionsto the problemof instruction/data fetch bus clashes as there are processors. The DSPchips have on-chip memory while the RiSe chips havecaches (i860), Harvard architecture and large registerfiles(Am29050). However, it is often possible to find a(practical) number of points that will cause any particularprocessor to run out of resources and reintroduce busclashes.

Many DSP chips conveniently havethe FFT coefficients(up to sizeM = 256 or 1024)stored 'on-chip'. Bycontrast,the RiSe chips must fetch these coefficients from mainmemory (ROM or RAM). Provided the fetching of thesecoefficients can be done efficiently, there is no speedpenalty. Again there are problems of 'running out ofresources' if the appropriate number of points beingprocessed is sufficiently large on either type of processor.

Changes in the architecture of RiSeand DSP chips canbe expected in the next couple of years asthe fabricatorsrespond to the market and what they perceive asadvantages present in other chips. For example theTMS32030 can perform a 1024 complex FFT in 3.23 mswith a 33 MHz clock" compared to 1.58 ms with a40 MHz clock for a DSP960029• The advantage of theDSP96002 is not just in clock speed. It has the capabilityof simultaneously performing a floating point add and asubtract on two registers. This is particularly appropriatefor the FFT algorithm, which is made up of many suchoperations. The advantagethat it gives the DSP96002 canbe seen from the fact that it performs an FFT butterfly infour instructions compared to 7-30 instructions on theother RiSe and DSP processors. With this sort ofadvantage, can it be long before the samefeature is seenon other chips?

The FFT implementation on the DSP processors iswelldocumented in the data books and will not be furtherdiscussed here. In the next section, we shall examine in

'Analog Devices mini-DSP seminar, Calgary, [anuarv 1993.



some detail the less familiar problems associated withefficiently implementing the FFT on RISC chips.

EFFICIENT FFT IMPLEMENTATION ON THERISC PROCESSORS

The butterfly is given by:

C =A + WB

D =A - WB

which must be split into the real and imaginary componentsbefore being processed:

The valuesA and B arethe complex input valuesto thebutterfly and W the complex Fourier multiplier. Theoutput valuesC and 0 must be calculated and then storedat the same memory address from which A and B wereobtained.

Forthe sakeof initial simplicity we shall assumethat allthe components of A, Band Ware present in the RISCregisters. The Am29050 processor has192 general-purposeregisters available and, more importantly, directly wired tothe pipelined FPU. The problems of efficiently getting theinformation into those registers will be discussed later.

At first glance, calculation of the butterfly requi resa total of eight multiplications and eight additions!subtractions. Since the Am290S0 processor is capable ofinitiating and completing a floating point operation everycycle, it appearsthat the FFT butterfly would take 16 FPUoperations and therefore 16 cyclesto complete. Howeverby rearrangingthe butterfly terms the number of instructions can be reduced to 10 instructions.

TmPre == (W re * Bre) - (Wim*B im)

Tmpim == (Wre * Bim) + (Wim *Bre)

ere == Are + Tmo.;

Ore == Are - TmPre

eim == Aim + Tmpim

Dim == Aim - Tmpim

which can be implemented in Am29050 RISC code as:

FMUL TR, WR, BR ; TmPre == Wre *B re

FMUL T1, WI, BI ; T1 == Wim *BimFMUL TI, WR, BI ; Tmpim = Wre *BimFMUL T2, WI, BI ; T2 = Wim »e.;FSUB TR, TR, T1 ; Tmp.; - == T1FADD TI, TI, T2 ; Tmpim + == T2FADO CR, AR, TR ; ere =Are + TmpreFSUB OR, AR, TR ; Ore =Are - Ttnp.,FAOO CI, AI, TI ; eim = Aim + TmpimFSUB 01, AI, TI ; Dim = Aim - Tmpim

Figure 14 shows the floating point unit architecture ofthe Am29050 processor.The multiplier unit is made up ofa three-stage pipeline (MT, PS and RU). The adder unit isalso a three-stage pipeline (ON, AD and RU). Figure 15shows the passage through the Am29050 FPU of theregister values used in the butterfly (based on staginginformation provided in Reference 12). The fact that thetwo pipelinesoverlapandthat the steps are interdependentmeans the butterfly is a mixture of veryefficient pipelineusageintermingled with stalls. These stalls arecompletely

One of the reasons that DSPprocessors perform so well inOSP applications is that full use is made of their resources.To get maximum performance out of a RiSe processor asimilar programming technique must be taken. Althoughin due time, good 'DSP-intelligent' C compilers willbecome available for RISC processors, best DSP performance iscurrently obtained in Assembler by a programmerfamiliar with the chip's architecture.

In terms of available instructions there is little majordifference between the scalar and super-scalar chip. Boththe scalar Am29050 and super-scalar i860 RISCs have aninteger and a floating point pipeline that can operate inparallel. The major advantage of super-scalar chips is thatthey can initiate (get started) both a floating point and aninteger operation at the same time. However, otherarchitectural features, such as the large floating pointregister window on the Am29050 processor, can sometimes be used to avoid the necessity of issuing dualinstructions. The relative advantages depends on the DSPalgorithm. For example, in a real-time FIRapplication, thescalarAm290S0 processor outperformed the super-scalari860, and both RISCs outperformed the DSP chips11.

The practical considerations of using RiSe chips fortheOFT algorithm could equally be explained using the i860and the Am29050 processors. However, for a personunfamiliar with FPU pipelining and RISe assemblerlanguage code, the Am29050 processor has the moreuser-friendly assembly language"? and is the easier tounderstand. The information on the i860 performance isbased on References 13 and 14. The Am29050 processorFFT results are based on my own research in modellingand spectral analysis in medical imaging6, 1 S-17 and theuse of the Am290S0 processor in a fourth year computerengineering course on comparative architecture whichdiscusses ClSC, DSP and RISC architectures.

Efficient handling of the RiSe's pipe lined FPU

The basic reasonsthat DSP and RISC chips perform so wellare associated with the pipelining of all the importantresources. However, there is an old saying 'you don't getsomething for nothing'. This FPU speed is available if andonly if the pipeline can be kept full. If you cannot tailoryour algorithm to keep the pipelines full, then you do notget the speed. A 95 tap finite impulse filter (FI R) DSPapplication11 is basically 95 multiply-and-accumulateinstructions one afterthe other and is fairly easyto customprogram formaximum performance. The FFT algorithm onthe other hand is a miscellaneous miss-modge of floatingpoint add (FAOD), multiplication (FMUL), LOAD andSTORE instructions, which is far more difficult to codeefficiently.

The problems can be demonstrated by considering abutterfly from a decimation-ln-tlrne FFT algorithm. Similarproblems will arisefrom the decimation-in-frequency FFTdiscussed earlier.

Cre == Are + WreBre - WimBim

C im = Aim + WreB im + WimBre

D re = Are + WreB re + WimBim

Dim =Aim - WreB im - WimBre

(10)

(11)

516 Microprocessors and Mlcrosyslems Volume 17 Number 9 November 1993


Efficient management of the RiSe's registers

The previous section showed how to efficiently handlethe Am29050 RiSe FPUbased on the assumption that thenecessaryvalues forthe FFT butterflies were stored in theon-board registers. While the super-scalar i860 can bringin four 32-bit registers at a time from the (limited) datacache in parallel with floating point operations, this is notthe situation for the Am29050 and MC88100 processors.The i960SA and Am29240 processor variants suggested inthe introduction for low end embedded DSPapplicationsare also scalar rather than super-scalar. The followingdiscussion explains how to handle the situation on theAm29050 processor.

Consider again the butterfly equations (10) and (1').Scalar RISC chips have essentially very simple memoryaccess instructions. The Am29050 processor has thecapability of a LOAD or a STORE of a register value usingthe address stored in another register. For simplicity, weshall assume that these instructions can be written as:

LOAD RegValue, RegAddress ; RegValue = Memory[RegAddressJSTORE RegValue,RegAddress; MemorylRegAddressJ = RegValue

The actual Am29050 syntax is a little more complex asthe LOAD and STORE instructions are also used to

(32). However, for the FFT algorithm, the dual instructioncapability of this processor allows integer operations(such as memory fetches) to be peformed in parallel withFPUoperations. Does this give the super-scalar processoran advantage over the scalar RISC processor?The generalanswer isadefinite 'maybe' for many algorithms as it isnotalways possible to find suitable parallel operations.However, the FFT can be performed with great advantagesince the super-scalar dual instruction capability allowsfloating point calculations to be moved into the stalls withthe integer operations (memory moves) occurring for'free'. In addition, although the Am29050 processor hassome capability of simultaneously performing FADD andFMULT instructions, it does not have the depth ofinstructions available on the super-scalar i8bO. Using allthe i860 resources to overlap integer/floating point andFADD/FMULT operations, a custom (overlapped) FFTbutterfly effectively takes seven cycles13.

B

transparent to the programmer, but that does not makethe algorithm execute any faster. The' 0 FPU instructionstake' 5 cycles to complete. However, it is possible to fillthe stall positions with address calculations, memoryfetches or additional FPU instructions from anotherbutterfly. Although this example was for the Am29050processor, a similar analysis holds for the Motorola scalarMC88100 RISC processor t. The problem is slightly morecomplicated forthe MC88' 00 asthe pipelines aredeeper(four and five steps) and there are only 32 registers.

By comparison with the' 92 registers on the Am29050processor, the i8bO has only a few floating point registers

A u.

B Bus

tD 61 ID -B2 I

7 7

~-

~ ~u,I c'(&f", r I yD .:::. ,~,~ I DeiBNillzer 32-by.:l2 r ~:'o~~ - IlAullip/ior (l.ln

:>- " ~~, ...~ ~ >;,.;~,;~. D . ' ,;. 1I RenormaJizel -1 Adder

~=~sr(RN) (AD)

I> " I > > q

1 •RND Bus l

AoundUnh(RU)

~ !{>-To registerU. ACCO

andACC1torwardlnglogicACC2

ACC3

Noto : AI &\ai p4ths aN 64 bilSwideunlG.., ulhet'Ni se nolttd.

Figure 14 The floating point unit architecture of the Am290S0processor. © 1991 Advanced Micro Devices, lnc"

"Advanced Micro Devices reserves the right to make changes in itsproducts without notice in order to improve design or performancecharacteristics.This publication neither statesnor implies any warranty ofany kind, including but not limited to implied warranties of merchantability or fitness for a particular application.t Me88l 00 is a registered trademark of Motorola Ltd.

fpuflow. fig wed Jan 27 09:39z09 1993 1

LOCAL RlQ:I8'1'UlSINSTll.

FMULFMULFMULFMULFSUB-s-FADD-s-FADDFSOBFADDFSUB-s--s--s-

no BOSBS IIJl'RIIHAL nu RlQIS'1':UIB

BusA BusB )IT PS DII AD au------- --..-_.-Wr Br WrBrwi Bi WiBi WrBrWr Bi Wrai WiBi "rBrWi Br lriBr "rBi "18iTr '1'1 "iBr 'l'rT1 WrBi

Tr'1'l "18rTi '1'2 '1'iT2 'l'rT1

'I'iT.2Ar Tr Ar'l'r '1'iT2Ar '1'r Ar'1'r Ar'l'rAi Ti AiTi Ar'l'r ArTrAi '1'i Ai'l'i Ai'l'i Ar'l'r

Ai'1'! UTiAiTi

DBS'1'

CrDrCiDi

"rBr

lIrai

TrT1

:1'1'1'2

'1'1

"1B1

'1'2

"iBr

Figure 15 Passage of the registervalues through the Am290S0 processor FPU during the execution of a single decimation-in-time FH butterfly. Thetransparent stalls (-s-) must be filled with other instructions to get maximum performance from the RiSe



; Ar• =MIAtmp++]; Aim = M [Atmp]

; Br• = M[Btmp++]; Bim= M[Btmp]

; Wr• = M [Wtmp++]; W1m =M [Wtmp]

Listing 2 Reusing addresses stored in the large Am29050 register cutscalculation time

; M [Aaddressl = Cre; M [Aimaddress] = eim

; Are = M [Aaddress]; Aimaddress = Aaddress+4 II; Aim = M [Aimaddress]

; Are = M [Aaddress]; Aimaddress = Aaddress+4; Aim = M [Aimaddress]

STORE Cre, AaddressSTORE Clrn, Aimaddress

LOAD Are, AaddressADD Aimaddress, Aaddress, 4LOAD Aim, Aimaddress

LOAD Are, AaddressADD Aimaddress, Aaddress, 4LOAD Aim, Aimaddress

Listing 3 One approach to bringing in real and imaginary datacomponents from memory to the Am29050 register window

The scalar Am29050 has a LOAOM (load multiple)instruction which will bring in adjacent locations ofmemory into adjacent registers automatically. This isnothing more than another name for anauto-incrementingaddressing mode. Thus the code to bring in the real andimaginary parts of A (Listing 3) can be replaced by theinstructions shown in Listing 4*.

the 10 instructions on the OSP take 20 clock cycles andthe 33 instructions on the RiSetake 33 clock cycles. Thetime required is now only off by a factor of 1.65.

The reasonwhy the OSP chips perform well is that theirFIT implementation takes full advantage of the availableresources. A similar thing must be done to get the bestfrom the Am29050 RiSe architecture. The major problemwith the address calculations is all the time manipulatingpointers usingsoftware. Thishasto be moved to hardwareto achieve any speed improvement.

The first problem to fix is the fact that it is necessarytocalculate the addressfor Aim and Cim despite the fact thatthey are the same address. Let us use additionaltemporary registers to store these calculated addresses(see Listing2). Since the same thing can be done for the 8and 0 addresses, this reduces the addressing calculationsby six out of 23 cycles: performance is now down only by1.35X.

Prepare to load the registersADD Atmp, Aaddress, 0 ; make a copy of the starting addressesADD Btmp, Baddress, 0ADD Wt, Waddress, 0

LOAD Are, AtmpADD Atrnp, Atmp, 4LOAD Aim, AtmpLOAD Bre, BtmpADD Btmp, Btmp, 4LOAD Bim, BtmpLOAD Wre, WtmpADD Wtmp, Wtmp, 4LOAD Wim, Wtmp

communicate with coprocessors and various memoryspacesl",

The real and imaginary components of A, 8 and Wwould be stored in adjacent memory locations (complexarray) since this ismore efficientthan the separaterealandimaginary components given earlier. It can be seen fromFigure 12 that the FFT memory accesses follow a verydefinite pattern. We assume that the addresses for A, 8and Ware stored in registers Aaddress etc., and theincrements needed to move onto the next butterflyaddresses are stored in registers Ainc etc.. It can be seenthat a basic requirement for efficient FFT implementationon a RiSe chip is either a multitude of registers (e.g.Am29050 processor) orthe ability to be able to reload theregisters on the fly (e.g. i860).

A simple 'bull-at-the-gate' approach to fetching andstoring the values for the butterfly of Equations (10) and(11) would generate code something like Listing1. Howdoes this match up againstthe OSP processor with all thenecessary resources to handle OSP algorithms (particularlythe addresscalculations)?We can get a rough comparisonby supposing that the OSP chip takes the same 10 FPUinstructions asdoes the RiSechip operation and assumeit requires no additional cycles to handle the addressing.On a RiSe chip the same 10 FPU instructions plusassociated memory handling require 33 instructions. Atfirst sight, things do not look promising for the RISC chipas it executes nearly 3.3 times more instructions.

; M[Atmp++] '= Cr.; M[Atmp++] '= Clm

; M[Btmp++] '= Ore

; M[Btmp++] '" Dim

Now handle the butterfly calculations using the FPU

Prepare to store the resuItsADD Atmp, Aaddress, 0 ; make a copy of the starting addressesADD Btmp, Baddress, 0

STORECre, AtmpADD Atmp, Atmp, 4STORECirn, AtmpSTORE Ore, BtmpADD Btmp, Btmp, 4STORE Dim, Btmp

Prepare for next butterflyADD Aaddress, Aaddress, AincADD Baddress, Baddress, BincADD Waddress, Wad dress, Winc

Listing 1 Bull-at-gate approach to developing FFT algorithm forAm290S0 msc processor

At second glance,things become more promising. OSPchips run at one instruction every two clock cycles, RiSeat one instruction every clock cycle if the pipeline can bekept full. The addressing instructions can be placed asuseful instructions in the Rise FPU pipeline stalls. Thus

LoadMemoryCounter 2 ; prepare to fetch 2 memory valuesLOADM Are, Aaddress ; Are = M[Aaddress]; Aim = M[Aaddress+4]

Listing 4 An alternative approach to bringing in data components intothe Am290S0 registers

This iooks like a further improvement to two cycles fromthree, but is not. With the LOAD instruction it is possibleto bring in one registerwhile anotheris used. The LOAOMinstruction, however, makes more extensive use of all theAm29050 processor resources and therefore stalls theexecution of other instructions for (COUNT - 1) cycleswhile the data is brought in (an Am29050 processorweakness in my opinion).

However, suppose that instead of bringing in enoughdata to perform one butterfly, we take further advantageof the 192 Am29050 registers and bring in and storeenough data to perform four butterflies. This means thatthere will be only one LoadMemoryCounter and oneAdjust Aaddress calculation for all the A value fetches

• 'LoadMemoryCounter value' is a macro for 'mtsrim cnt, (value - 1)'.


during those butterflies, a considerable saving.The codethen becomes" as shown in Listing 5 t.

LoadMemoryCounter 8LOADM Are, Aaddress ; Bring in 4 complex A numbersLoadMemoryCounter 8LOADM 8re, Baddress ; Bring in 4 complex B numbers

.setn=O

.rep 4ADD Wreaddress, Wreaddress. WincLOAD offset (Wre + n), WreaddressADD Wimaddress, Wimaddress, WincLOADoffset (Wim + n), Wimaddress ; Bringin the four complex W numbers.set n = n + 1.endrep

"FPU usage*** ; Do 4 intermingled FFT butterflies

LoadMemoryCounter 8STOREM Cre, Aaddress ; Output 4 complex C numbersLoadMemoryCounter 8STOREM Ore, Baddress ; Output 4 complex D numbers

ADD Aaddress, Aaddress, Ainc ; Get ready for next butterfly setAOO Baddress, Baddress, Blnc

listing 5 An efficient approach to loading the Am29050 registers frommemory

This reduces the total time for four butterflies from 132clock cycles to 94, or approximately 24 per butterfly. Withthe faster instruction cycle of the Am29050, it isperformingwithin 1.2X of a DSP chip for the FFf algorithm: not badfor a general purpose scalar RISC processor. It has beenshown'' that for equivalent clock rates, the Am29050handled DSP algorithms with between 50% and 200% ofthe efficiency of a specialized DSP chip depending on thealgorithm chosen.

Taking the same FFf programming approach with thei860 RISC processorwouldjust make it curl up and die. Itsregisters are just not designed to be used the same way asthose of the Am29050 chip. The Am29050 processor has192 registers attached to the FPU, the i860 only 32. Adifferent approach that makes use of the dual instructioncapability of the super-scalari860 must betaken. The i860has the capability of overlapping the fetching of fourregisters from a data cache over a 128 bit wide bus (aninteger operation) with the use of a different bank ofregisters in conjunction with the FPU (a floating pointoperation). By combining two butterflies, and makinggood use of its ability to fetch four registers at a time andits really convoluted (flexible) FPU architecture, this givesan extremely efficient seven cycles per butterfly. This andthe faster instruction cycle givesthe i860 a 2X performanceedge over most DSP chips for the FFT algorithm.

Full details on implementing the FFf algorithm on theAm29050 processor are beyond the scope of this article.Further information can be found in Reference 18.

How can you improve the DSP performance ofRISe chips?

In the previous section, we indicated that the RiSe chipscan already give equivalent or better performance than

t 'LOADoffset REG1, REC2' is a macro equivalent to 'LOAD %%(&REC, +n),REG2'.

FFT-fRISCY Fourier transtormsv: M R Smith

DSP chips in providing an efficient platform for implementing the FFf algorithm. Table 2 provides details of theFFf performance of anumber of integer and floating pointDSP and RISC processors.

The FFT algorithm is not just 'butterflies' in the FPU.There are loop and bit-reverse addressing overheads tobe considered. The performance figuresare gleaned fromthe various processor data books with the timings scaledup to the latest announced clock rates. The Am29050timings are based on my own research using a 25 MHzYARC card and an 8 MHz STEB board scaled up to a40 MHz clock. The standalone STEB board is an inexpensive evaluation board configured for single cycle memorybut, to keep costs down, with overlapped instruction anddata buses so that Am29050 performance will bedegraded by bus conflicts. The PC processor YARC cardavoids the bus conflicts but uses multi-cycle datamemory.

Comparison of the timings for the FFT on the DSPandRISC .processors is rather like comparing apples andoranges. Some of the code is more custom than othersand the full details on the limitations or set up conditionsare not always obvious - so take the timings with a heavypinch of salt. If the timings differ by 10%, then there isprobably nothing much to choose between the chips interms of DSP performance. If the difference is more than50% then perhaps the processor has something that willvery shortly be stolen and added to the other chips (or theconditions were non-obviously different).

In Reference 5 the (fictitious) DSP-oriented comprehensive reduced instruction set processor (the Smith'sCRISP) wasintroduced. This isascalarRISC because of thecost associated with using the current super-scalar RISCsin embedded DSP applications. It was essentially anAm29050 processor, with its large registerfile, combinedwith some elements from the i860. The major improvement recognized was the need to have sufficientresources to allow the memory and FPU pipelines to beoperated in parallel for more of the Am29050 processorinstructions. Improvementswere alsosuggested in allowingmore instructions where the FADDand FMULToperationswere combined (cl la i860). The FFf performance for theCRISP was simulated and is given in Table 2.

If you read the literature on RISC and DSP chips, youwill notice that the inverse DFf takes longer than thestraight DFf. This is becausethe inverseDFf requires thateach data point be normalized by dividing by M, thenumber of data points. Division takes a long time whendoing floating point numbers. What is needed is a fastfloating point divide by 2.0 equivalent to the integer shiftoperations. This is available as an FSCAlE instruction onthe DSP96002 and as a one ortwo cyclesoftwareoperationon the Am29050 and i860 processors". However, inmany applications the scalefactor is just ignored as beingirrevelant to the analysis.

Reference 5 suggested that there was one major flawwith both the Am29050 and the i860 chips as DSPprocessors made evident by the implementation of theFFT algorithm. Neither of the architectures will supportbit-reversed memory addressingwhich is required in thelast pass of the FFT to correctly reorder the output values.Done in software, this requires an additional 20% overhead. It was suggested that the overhead be reduced byadding an external address (bit reverse) generator. If you



Table 2 Comparison of the timings for Radix 2 and Radix 4 FFT algorithms on a number of current DSPand RiSe processors. Based on Reference 5

DSP RISC

TMS32025 TMS32030 DSP56001 DSP96002 ADSP2100 AOSP21000 i860 Am290S0 CRISP

Type Integer FP Integer FP Integer FP FP FP FP

Clock speed (MHz) 50 40 33 40 16.7 33.3 40 40 50lnstr, cycle (ns) 80 50 60 50 60 30 25 25 20

Radix 2256 Complex (rns) 1.8 0.68 0.94 0.32 0.85 0.135 0.18 0.69 0.36256 Bit rev. (rns) 0.22 0.791024 Complex (ms) 15.6 1.97 1.04 1.11 3.741024 Bit rev. (rns) 1.11 3.74

Radix 4256 Complex (ms) 1.2 0.53 0.45 0.121 0.44 0.26256 Bit rev. (ms) 0.541024 Complex (rns) 2.53 1.81 2.23 0.577 2.13 1.21024 Bit rev, (ms) 2.52

are in need for a fix 'now', it would not be that difficult fora fixed size FIT application to add a little externalhardware to 'flip' the effective address line positions atthe appropriate time during the final FFT pass.

However, it was recently pointed out to me that theflaw was not in the processors but in the algorithm. Thereare FFT algorithms other than the decimation-in-time/frequency ones discussed here. This includes algorithmsthat use extra memory (not-in-place FFT) and avoid thenecessity to petform bit-reverse addressing. Thesealgorithms seemto have dropped out of sight over the last20 years. My next task will be to investigate both OSP andRiSe processors using these algorithms to determine ifanything is to be gained by revisiting them. After all, theFFT algorithm was known to many authors over manyhundreds of years before Cooley and Tukey 'discovered'it!

CONCLUSION

In this applications oriented tutorial article, we havecovered a very wide range of topics associated withimplementing the discrete Fourier transform (DFT) usingthe efficient fast Fourier transform (FFT) algorithm. (Makesure you can explain the difference between the OFTandthe FFT, if you want to make a DSP expert happy.)

We first examined an industrial application of the OFTand explained the importance of correctly handling thedata if the results were to mean anything. The theoreticalaspects of generating an efficient algorithm to implementthe OFTwere then discussed and a classof FFT algorithmsdetailed. We examined the architectural featuresrequired in a processor to handle the FFT algorithm anddiscussed how the current crop of processors meet theserequirements.

This final section was dedicated to details of implementing the FFT on a scalar (Am290S0) and a super-scalar(i860) RISC chip. It was shown that by taking into accountthe architecture of the RISe chips, it was possible to getFFT performance out of these chips approaching or better

than current DSP chips. Methods for improving the OSPpetformance of current RISC processorswere indicated. Ithas been stated in the literature'' that the lack of a bitreversed addressing mode penalized the RISC chips asthis was a standard DSP processor capability. However,this is not a problem with the RISC processors but ratherwith the choice of FFT algorithm. There are many FFTalgorithms that do not require a bit-reverse addresscorrection pass, although these have been ignored in theliterature formanyyears. Bymaking properuseofthe RISCprocessor's deep pipelined FPU, large register bank orequivalent dual instruction capability and specializedMMU functionality, it was quite obvious that FFT reallydid stand for fRISey Fourier transforms!

ACKNOWLEDGEM ENTS

The author would like to thank the University of Calgaryand the Natural Sciences and Engineering ResearchCouncil (NSERC) of Canada for financial support. BryanLong of Beta Monitors and Control Ltd, Calgary, Canadawas kind enough to provide the data for the 'IndustrialApplication' section and to make constructive commentson this note. The Am29050 processor evaluation timingswere generated using a YARC card and a STEB boardsupplied courtesy of 0 Mann and RMcCarthy under theauspices of the Advanced Micro Devices UniversitySupport Program.

NOTE ADDED IN PROOF

AMD hasrecently announced (lune 1993) an integer RISCmicrocontroller, the Am29240, which has an on-boardsingle cycle hardware multiplier. This processor performsthe FFT algorithm in cycle times close to that of theAm29050 floating point processor (preliminary results).Integer RISC processors will be the subject of a futurearticle.

520 Microprocessors and Mlcrosystems Volume 17 Number 9 November 1993

REFERENCES

Pre", W H, Flannery, B P, Teukolsky, S A and Vettrling, W T,Numerical Recipes in C - TheArt of Scientific Computing CambridgeUniversity Press, Cambridge (1988)

2 Burrus, C 5 and Parks, T W DFT/FFT and Convolution Algorithms Theory and Implementation John Wiley, Toronto (1985)

3 Proakis, JG and Manolakis, 0 G Digital SignalProcessing- Principles,Algorithms and Applications Maxwell Macmillan, Canada, Toronto(1992)

4 Harris, FJ 'On the use of windows for harmonic analysis with thediscrete Fourier Transform' Proc. IEEE, Vol 66 (1978) pp 51-83Smith, M R'How RISCy is DSP?' IEEE Micro Mag. (December 1992)pp 10-23

6 Smith, M R, Srnit, T J, Nichols,S W, Nichols, S T, Orbay, HandCampbell, K 'A hardware implementation of an autoregressivealgorithm' Meas. Sci. Techno!. Vol 1 (1991) pp 1000-1006

7 Papamichalis, P and So, J 'Implementation of fast fourier transformon the TMS32020' in Digital Signal Processing Applications withTMS320 Family Vol 1, Texas Instruments (1986) pp 69-92

8 Papamichalis, P 'An implementation of FFT, DCT and othertransforms on the TMS320C30' in Algorithms and ImplementationsVol 3, Texas Instruments (1990)

9 Sohie, G Implementation of Fast Fourier Transforms on Motorola'sDSP56000lDSPS6001 and DSP96002 Digital Signal ProcessorsMotorola Inc. (1989)

10 'Analog devices' in Digital Signal Processing Applications using theADSP-2100 family Vol 1, Prentice-Hall, Englewood Cliffs, NJ(1992)

11 Smith, M R'To DSPor not to OSP'Comput Appl. J. No 28 (August/September 1992) pp 14-25

12 Am29050 32-8it Streamlined Instruction Processor, User's Manual(1991)

13 Atkins, M FAST Fourier Transforms in the i860 Microprocessor IntelApplication Note Ap-435, Intel Corporation (1990)


14 Margulis, N 1860Microprocessor Architecture Osbourne McGrawHill (1990}

15 Smith, M R, Nichols, S T, Henkelman, R M and Wood, M L'Application of autoregressive moving averageparametric modelingin magnetic resonance image reconstruction IEEE Med. Imag. VolMI5 No 3 (1986) pp 132-139

16 Mitchel, 0 K, Nichols, S T, Smith, M Rand Scott, K 'The use of bandselectable digital filtering in magnetic resonance Image enhancement' Magn. Reson. Med. Vol 19 (1989) pp 353-368

17 Smith, M R, Nichols, S T, Constable, RT and Henkelman, RM 'Aquantitative comparison of the TERA modeling and OFT magneticresonance image reconstruction technique,' Magn. Reson. Med.Vol 21 (1991) pp 1-19

18 Smith, M R The FFT On the Am29050 Processor AMD Applicationnote PIO 17245 (to be published)

19 Smith, M Rand tau, C 'Fast floating point scaling operations on RISCand DSP processors' Comput. App!. J. (to be published)

Dr Michael Smith emigrated from the UK toCanada in 1966 to undertake a PhD ;n physics,during which he was introduced to computerprogramming. After spending several years indigital signal processing applications in thepathology area, he left research to teach juniorand high school science and mathematics. Hejoined the Electrical and Computer EngineeringDepartment at the University of Calgary,Canada

in 1981. His current interests are ;n the software and hardwareimplementation of ARMA modelling algorithms ior use in resDlutionenhancement and artifact reduction of magnetic resonance (MRI)images.


FFT — fRISCy fourier transforms?€¦ · of frequency components - 'spectral-leakage'.When we filterthe noise components, we will remove some of the spread-outdata frequency components.This

Documents