New Fast and Area-Efficient Pipeline 3-D DCT Architectures...3-D DCT, including visual tracking, video coding and watermarking [25-29]. Numerous 1-D and 2-D DCT architectures have

1

NewFastandArea-EfficientPipeline3-DDCTArchitectures

SaadAl-Azawia,OmarNiboucheb,SaidBoussaktac,GayeLightbodydaCollegeofEngineering,DiyalaUniversity,Diyala,Iraq;b,dFacultyofComputingandEngineering,UlsterUniversity,

UK;cSchoolofEngineering,NewcastleUniversity,[email protected],[email protected],[email protected],

[email protected]

AbstractTheefficient implementationof 3-D transforms is a challenging taskdue to the computation complexity,memory and arearequirementsofsuchtransforms.Oneimportant3-Dtransformisthe3-DDiscreteCosineTransform(3-DDCT)usedinmanyimageandvideoprocessingsystems.Inthispaper,twonewpipelinearchitecturesforthe3-DDCTcomputationusingthe3-DDCTVector-Radixalgorithm(3-DDCTVR)arepresented.Thesearchitecturesarescalableandparameterisablewithregardstodifferentwordlengthsandpipelininglevels.Theirarithmeticcomponentrequirementsarereducedtotheorderof!(#$%&')incontrastwith!(')for3-DDCTarchitecturesintheliterature,whileatthesametimetheycankeepsimilarorbetterarea-timecomplexity.Keywords:3-DDiscreteCosineTransform(DCT),Row-Column(RC),Row-Column-Frame(RCF),Vector-Radix,FPGA1.Introduction

Transforms such as the Fourier Transform (FT) [1-5],WaveletTransform(WT)[6-9],andCosineTransform(CT)[10-14]playacriticalpartinvariousDigitalSignalProcessing(DSP)applications,includingaudio,imageandvideosystems.Muchof the usefulness of these transforms arises from theirfrequencyandtime-frequencyrepresentationsandpropertiesincluding the decorrelation property, energy compactness,and theavailabilityof fastalgorithms for their computation.Nevertheless,eventhefastalgorithmsthatimplementthesetransformsarestillverycomputationallyintensive.Thus,thesetransformscanbecomeabottleneckintermsofthesystem’sspeed,andcontributegreatlytotheirareausageandpowerconsumption[1-3,6-8,10-19].Foritsroleinmanyimageandvideo applications, including the JPEG, MPEGx and H.26xcompressionstandards,theDCThasreceivedagreatdealofresearch interest [20-24]; the 1-D and2-DDCT arenow theestablished transforms formanyapplicationsand standards.Further,therearemanynewandemergingapplicationsforthe3-D DCT, including visual tracking, video coding andwatermarking[25-29].

Numerous 1-D and 2-D DCT architectures have beensuggestedintheliterature[30-39].Exploitingtheseparabilityprincipleofthetransform,2-DDCTcoresbasedonthe1-DDCTRow-Column(RC)approacharesuggestedin[33-36];yetveryfewarchitectures that implement the3-DDCTcanbe found[38-45].Traditionally,the3-DDCThasbeenimplementedbycascading stages of the 1-DDCT as in thewell-known Row-Column-Frame (RCF) approach. Noteworthy differencesbetween architectures in the literature are their level ofparallelisation in terms of the number of stages and thenumber of 1-DCT cores per stage, which leads to differenttrade-offs between circuit complexity and throughput. Onecommon architecture employs three stages of one 1-D DCTcore and N3+N2–word transpose memory [40-42].Parallelisationcanbeapplied to the first two1-DDCTcores

whichinfactimplementsa2-DDCTtransform,leadingtotheutilisation of 2N+1 1-D DCT processors and N3+N-wordmemory[41,42].Anotherclassofthe3-DDCTarchitecturesmultiplexes the 1-D DCT transforms involved in itscomputationontoasingle1-DDCTarchitecture.Suchaclassof architecture requires N3-word memory [41, 42]. Thereductionachievedinhardwareutilisationcomesatthecostofalowerthroughput.Usingthree1-DCTcorestoimplementthe 3-DDCT achieves a throughput three times higher thanwhenemployingasingle1-DDCTprocessor.ThethroughputisN-foldaugmentedviaparallelisationofthe1-DDCTprocessors[42].The1-DDCTcoresemployedinthe3-DDCTarchitecturecanusethetransform’sfastalgorithm,distributedarithmeticorROMbaseddesigns[38].Sucharchitecturesexhibitirregularstructures, lackofmodularity,andcomplexcontrol.Anotherclassof the3-DDCTarchitecturereliessolelyonthesystolicapproachwithitswellestablisheddesignmethodology[43].In[44, 46], high speed and low complexity pipeline n-D DCTarchitectures are proposed using the regular 1-D DCT andtensor product operations. The proposed architectures arebasedonthe1-Dand2-DDCTarchitecturesin[47].

In thispaper, twonewpipelineandscalablearchitecturesthat implement the 3-D Discrete Cosine Transform Vector-Radix (3-D DCT VR) are introduced. The presentedarchitecturesareparameterisableintermsofwordlengthandpipeline stages. Further, no block memory is used for datatransposition. These architectures have been implementedandtested;forinstance,anFPGA-basedimplementationofa512×512×8-worddatausingatransformlengthof8×8×8-wordcube size and a 14-bit wordlength has achieved a workingfrequencyof330MHzandaprocessingtimeof6.4ms.Thus,80000framescanbeprocessedineverysecond.

The remainder of this paper is organised as follows. Insection2,thebackgroundoftheDCTtransformand3-DDCTVRalgorithmareprovided.Sections3,4and5presentthetwonew architectures for 3-D DCT computation. The results

2

obtainedarediscussedinsection6andconclusionsaregiveninsection7.

2.Backgroundandthe3-DDCTVRAlgorithm

The 3-D DCT coefficients of a N×N×N data cube arecomputedasfollows:) *+, *&, *-

=801+01&01-

'- 3 4+, 4&, 4-

56+

789:

56+

7;9:

56+

7

3

Fortheremainingcombinationsofodd/evenindices*+,*&and*-, one can divide the computation of the transform asfollows:

The set of equations (6)-(13) represents a single butterflycomputationofaDecimationInFrequency(DIF)VRalgorithmasshowninFigure2.Itcomputeseightpoints;abutterflycanreceive aN×N×N data cube at its input and outputs 8 datacubes of N/2×N/2×N/2-word each; the process can be

repeated until58

Rdata cubes of 2× 2× 2-word each are

computed. Thus, the flow graph of the whole butterfly

computationconsistsof#$%&'stageswith58

Rbutterfliesper

stage[25].Theoutputfromthelastbutterflystageisfedtothepostadditionstagesthenitisbit-reversed.Thepostadditionoperations, shownby the termsoutsidebraces in the setofequations (6)-(13), are then carried out. Further to the

)(2*+, 2*&, 2*- + 1) = gh h h [23c::+(4+, 4&, 4-) cos ∅-]

[

789:

[

7;9:

[

7

4

reductioninarithmeticcomplexityandprocessingtime,the3-DDCTVRalgorithmdoesnotrequiretransposememoryandexhibitsaregularbutterflystructurewhichismoresuitableforhardware and software implementation than the RCFapproach.Furtherdetailsaboutthe3-DDCTVRalgorithmcanbefoundin[25].

3.New3-DDCTVectorRadixArchitectures

Two new architectures are presented; namely the SinglePath Data Flow 3-D DCT Architecture (SPDFA) and theDualPathDataFlow3-DDCTArchitecture(DPDFA).Thedifferencebetweenthemliesinthenumberofwordsfedtotheadjacentbutterfly and how the arithmetic operations are scheduledwithineachbutterfly stage; thishas led to thederivationoftwo structures with different hardware requirements. Botharchitecturesarebuiltaccordingtothegenericblockdiagramof Figure 1. The butterfly calculation consists of #$%&'parameterisedandscalablestagesasdescribedbythesetofequations (6)-(13) and illustrated in Figure 2. The datareordering is common to both SPDFA andDPDFA, however,theinternalarchitectureofthebutterfly,post-additionstagesand the 3-D Bit Reverse Ordering (3-D BRO) stage arearchitecture-dependent. Of the two presented structures,SPDFA exhibits a single line of data between neighbouringbutterfly stages. It is more efficient in memory usage asintermediate results are fed back to memory elements;however, using these feedback loops prevents any furtherpipelining. DPDFA is a dual-path data flow feed-forwardarchitecture. There are two data lines between adjacentbutterflies and further pipelining is a simple task; thearchitecturehoweverrequiresmorememorythanSPDFA.

The presented architectures both partition the inputsequenceintocubesofN×N×N-wordorN-blocksofN×N-word.The inputdata is reorderedaccording to (2).The reorderingprocessisperformedbyshufflingwordsalongtherow,columnand frame dimensions. It includes dividing data into odd-indexedandeven-indexedwordsandretrogradeindexing.As

anexample,forNindicesarrangedas“0,1,2,3,4,5,6,…N-1”,thereorderedsequencewillbe“0,2,4,6,…,N-2,N-1,N-3,N-5,…..,1“.ThisstageisimplementedusingadualportblockRAM which permits writing and reading operations to beperformedondifferentlocationsduringthesamecycle.Thus,for anN×N×N-word cube, thememory size required for the

reorderingoperationis5

&+ 1 '

&–wordwithalatencyof58

&

cycles as onlywriting operations are carried out during thisperiod.

4.SinglePathDataFlow3-DDCTArchitecture

SPDFA is composed of a 3-D reordering stage, #$%&(')butterflycomputationstages, threepostadditionsub-stagesanda3-DBitReverseOrder(3-DBRO)stage.

4.1ButterflyStages

Thereordereddatafromthe3-Dreorderingstageisfedtothebutterflystagesatarateofonewordperclockcycle.AsshowninFigure3,#$%&'butterflystages(m=1,2,3,..,#$%&')areused.Eachbutterflystagecanbefurtherdividedinto threesub-stagesandamultiplierasshowninFigure4.Asub-stageconsists of two add/subtract elements for carrying outadditionandsubtractionoperationsbetweenthetwohalvesofeach inputalong the threedimensionsofdata, a registerand a switch. The multiplier is used to multiply the output

Figure2.Singlebutterflyofthe3-DDCTDIFVRalgorithm

5

words by appropriate Twiddle Factors (TFs) which are pre-computedandstoredinalookuptable(LUT).

For the sake of explaining, thewords3 4+, 4&, 4- at theinput of the first butterfly stage can be indexed as 3 4+ +4&×' + 4-×'

& . The first sub-stage performs addition andsubtractionbetweenthetwohalvesoftheinputdatacube;the

first half contains words indexed from 0to58

&− 1 and the

secondpartthewordswithindicesfrom58

&H$'

-− 1.During

thefirst58

&clockcycles,thefirstpartofthedataisstoredin

Register1(oflength58

&-word)beforebeingfedtoaddersalong

with the inputdata fromthesecondhalfduring thenext58

&

cycles.Duringthisperiod,thesubtractionoperationresultsarestored inRegister1whiletheadditionresultsare fedtothenext sub stage.Once this is completed, it is the turn of thesubtractionresultsstored inRegister1 tobefedtothenextsub-stage.TheselectionofwhichpartofthedatatobestoredinRegister1,fedtotheaddersorfedtothenextsub-stageismanagedbythecontrolsignalofSwitch1,whichchangesits

valueevery58

&cycles.

What the first sub-stagecarriesoutoncubesofdata, thesecond sub-stage performs it on blocks ofN×N-word of thesamedatacube.Omittingchangesalong4-,eachblockisagaindividedintotwohalves;onehalfincludesindices4+ + 4&×'

from0H$5;

&− 1whilethesecondhalfincludeswordsofthe

sameblockwithindicesfrom5;

&H$'

&− 1.Thebehaviourof

the second sub-stage is similar to the first one; except that

Register2 isof length5;

&-wordandtheperiodofthecontrol

signalforSwitch2is'&cycleswithadutycycleof50%.The third sub-stage implements addition and subtraction

between the twohalvesofeachcolumn ineachblockusing

Register3(oflength5

&-word).Omittingchangesof4-and4&,

thedataineachcolumnisdividedintotwohalveswithindices

rangingfrom0H$5

&− 1andfrom

5

&H$' − 1.Thewordsof

thefirsthalfofthecolumnarestoredinRegister3,theyarethen fed to theaddersalongwith thecolumn’ssecondhalf.The results of the addition operation are multiplied by theappropriate TFs. After which, the results of the subtractionoperationwhichwerefirststoredinRegister3arefedtothemultiplier for the multiplication by the TFs. The multiplieroutputisinputtothenextbutterflystage.Aswithsub-stages1 and 2, Switch 3 multiplexes data and its control signal is

periodicandchangesitsvalueevery5

&cycles.

In thegeneralcaseof themthbutterflystage,data is split

into 2-u6- cubes ofv

&wx<

-

words; the first butterfly sub-

stage is used to perform the addition and subtractionoperationsbetweenthetwohalvesofeachinputdatacube;

the words involved are indexed from0to58

&y− 1 and from

58

&yto

58

&yx<.Duringthefirst

58

&ycycles,thedatacube’sfirsthalf

isstoredinRegister1;inthenext58

&ycycles,theadditionand

subtractionoperationstakeplace;theresultsoftheadditionoperationarefedtotheadjacentbutterflysub-stagewhiletheresultsofthesubtractionoperationsarestoredinRegister1

before being fed to the adjacent sub-stage in the next58

&y

cycles.Inasimilarwayandwithanappropriateswitching,thesecondandthirdbutterflysub-stagesimplementtheadditionand subtraction operations between the first and secondhalves of the data blocks and columns, respectively. The

lengthsofregisters,Register2andRegister3,is5;

&y-wordand

5

&y-word, respectively, which allows for storing half of each

block and columnofdata as appropriate. Themultiplicationoperation by a twiddle factor (TF) takes place once allarithmeticoperationsofsub-stage3havebeencarriedout.

ButterflyStagelog2N(m=log2N)

ButterflyStage2(m=2)

ButterflyStage1(m=1)

Reordereddata

Out

Figure3.Theblockdiagramofbutterflystages

+

-

N3/2m

Switch1

Register1

+

-

N2/2m

Switch2

Register2

+

-

N/2m

Switch3

Register3

×

Sub-stage1 Sub-stage2 Sub-stage3

in outTF

Figure4.SPDFAmthbutterflyinternalarchitecture

a.

-Register 1 -

Mux 1

PIn

PRegister 2

Register 3

-Mux 2

PRegister 4

-Mux 3

2POut

Post addition Sub-stage 1

P=N


P=1


P=N23-D BROIn Out

b.

Figure5.a.Aparameterisedpostadditionsub-stageforSPDFA.b.Apostadditionstageand3-DBRO

4.2PostAdditionand3-DBROStages

The third part of SPDFA is the post addition stage whichperforms the computation of the terms outside the curlybrackets in (6)-(13). Reflecting the three dimensions of theinputdata,thepostadditionstagecanbedividedintothreesub-stages; each sub-stage carries out addition operationsover a given dimension. In the first, second and third postaddition sub-stage, the addition operations are carried outwithin the same N×N-word block, the same column or the

6

same data cube; respectively. Hence, the length of theregisters,used inFigure5and labelledasparameterP,mayvary. Still the internal architecture of each sub-stage isidenticallythesame.

Theoutputofthethirdpostadditionstageisfedtothe3-DBRO stage which performs data reordering as the fastalgorithmused introducesabit reversalpermutationon thebinary indicesof the results.Bit reversal is performedalongeachrow,columnandframedirectionsineachN×N×N-worddata cube using a regular bit reversal algorithm [25]. Theoutputfromthisstagerepresentsthe3-DDCTcoefficientsof

the input data. It is implementedusing a3'

4− 1 '

2-word

dual-port block RAM. This stage is placed next to the postadditionstagetoactasabufferforthesubsequentsystemifrequired;forinstance,itcanbeintegratedwithaquantizerasinconventionaldatacompressionalgorithms.

5.DualPathDataFlow3-DDCTArchitecture

DPDFAisadualdatapatharchitectureforthe3-DDCTVRcomputation. It is devised toproduceahigh speed3-DDCTarchitecturewhichcanbeeasilyretimedandpipelined.DPDFAconsistsof3-Ddatareordering,butterflystages,postadditionand3-DBROstages.The3-Ddatareorderingstageisthesameasthatpresentedearlierinthepaper. 5.1ButterflyStages

The scheduling of arithmetic operations in DPFDA isdifferent from SPDFA. Rather than feeding the subtractionoperations intermediate results back to the same sub-stageregisterasinSPDFA,theresultsoftheadditionandsubtractionarefedforwardtothenextsub-stageortothenextstage.Thisreducesthetimeduringwhichregistersareutilisedforstoringpartial results, adds another lineof data for communicationbetweenadjacentstagesandsub-stages,andhenceincreasesthe number of required multipliers to cope with thecomputationoftwocoefficientsperclockcycle.However,thissimplifiespipeliningandretimingintheDPDFA.

DPDFA comprises #$%&' butterfly stages; each can bedivided into three sub-stages, registers, switches and twomultipliers.Agenericsub-stageconsistsof twoadd/subtract

elementsforcarryingoutadditionandsubtractionoperationsbetween the two halves of each input along the threedimensionsofdata.Italsocontainstworegistersandaswitchfordataorderingandmultiplexing; theexception is the firstsub-stageofthefirstbutterflywhichutilisesonlyoneregisterasshowninFigure6.Thefirstbutterfly internalarchitecturetakesintoaccountthefactthatdataisreceivedatitsinputattherateofonewordperclockcyclewhicharethenstoredandprocessedattherateoftwowordsperclockcycle.

The first sub-stage performs addition and subtractionbetween the two halves of the input data cube; Register 1stores the first half that contains words of indices from

0to58

&− 1thenfeedsittotheaddercomponentsduringthe

next58

&cycleswhentheseconddatacubepartthatcontains

wordsindexedfrom58

&to'

-− 1isalsoavailableattheinput

of the adder components. Data multiplexing is carried outusingSwitch1which isusedto twofoldparallelise theserialinput.Itscontrolsignalisperiodicwithaperiodof'-cyclesanda50%dutycycle.

Insub-stage2,theregistersRegister2andRegister3,andSwitch2re-orderdatawiththeaimtoimplementtheadditionand subtraction operations in each block of data; a block is

dividedintotwohalves;wordswithindicesfrom0to5;

&− 1

arestoredinRegister3whilethesecondhalfwhichincludes

words of the same block with indices from5;

&H$'

&− 1 is

storedinRegister2.Theflowofdatabetweensub-stages1and2andtheselectionofwhereandwhenresultsarestored inregistersRegister2andRegister3 iscarriedoutbySwitch2.Suchaswitchhasa50%dutycyclecontrolsignalwithaperiodof'&cycles.

When sub-stage 2 processes blocks of data of the samecube,inasimilarwaythethirdstagecarriesouttheadditionandsubtractionoperationsoncolumnsofdatabelonging tothesameblockofdata.For'/2cycles,theadditionresultsofsub-stage 2 are fed toRegister 5; the subtraction operationresultsstoredinRegister4arefedtotheaddercomponentsofsub-stage 3; during the next '/2cycles, Register 4 isconnected to Register 5 while the results of the addition

Figure6.a.ThefirstbutterflyofDPDFA,b.ThemthbutterflyofDPDFA

7

operationofsub-stage2arefedtotheaddercomponentsofsub-stage3.BothregistersRegister4andRegister5areofalengthof'/2-word.Switch3whichallowsfordataswitchingand controls the flow of partial results in sub-stage 3 has aperiodic control signal which changes its value every '/2cycles.Oncealladditionandsubtractionoperationshavebeencarriedoutbythefirstbutterflythreesub-stages,twofurthertaskshavetobecarriedout,namely;themultiplicationbytheappropriateTFsandre-arrangingdatainanordersuitableforthenextbutterflystageoperationstobeexecuted.

Re-arrangingdatainSPDFAbutterfliesissimplycarriedoutbyfeedbackregisters.However,tore-arrangedata inDPSFAone has to cancel out the data order engendered by theselection and switching behaviour of Switch 2, Switch 3,Register 2, Register 3, Register 4and Register 5. Thedesignapproach adopted in thiswork is to use the same set-upofregistersandswitchestore-arrangetheorderofdataandthentoretimeformemoryoptimization.Hence,thebehaviourofSwitch4andSwitch5issimilartothatofSwitch2andSwitch3, respectively.The impactofusingretiming isshown inthelengthofRegister7ofthefirstbutterflyofFigure6.aand inthe length ofRegister 1 in the first sub-stage of the secondbutterflystageillustratedinFigure6.b.Hencetheorderofdatawhen it leavesthefirstbutterfly issimilarto itsorderattheadderelementsofthefirstsub-stage.

Inthegeneralcase,thetwodatainputspresentedatthemthbutterfly stage are the two halves of the '--word cube;howevereachhalfdataisorderedas2{6&setsof2{6+×2{6+

interleaved data cubes of sizev

&wx<

-

words. The control

signalsofallswitchesinFigure6.bareperiodicwitha50%dutycycle.ThecontrolsignalperiodofSwitch1,Switch2andSwitch

3is58

&yx<cycles,

5;

&yx<cyclesand

5

&yx<cycles,respectively.By

carefully controlling the flow of partial results into registersRegister 1,Register 2,Register 3,Register 4,Register 5 andRegister 6 in Figure 6.b, all the addition and subtractionsoperationscanbecarriedoutalongthethreedimensionsofthe data. Switches Switch 4 and Switch 5, share the controlsignalsofSwitch3andSwitch2,respectively.TheirswitchingbehaviourandtheuseofregistersRegister7andRegister8,re-arrangedatatothesameorderitwasreceivedattheinputof the adder elements of sub-stage 1; the multiplicationoperationscanthentakeplace.

5.2PostadditionStageand3-DBROStages

The post addition stage can be divided into three sub-stages.Tocopewithprocessingtwowordsperclockcycle,thefirst two sub-stages are built using two of the sub-stagesshowninFigure5.a;thethirdsub-stageisdepictedinFigure7.aandiscomposedoffiveadd/subtractelements,registersand a multiplexer. The post addition sub-stages areparameterised.TheparameterPshowninFigure7,referstothelengthofregistersused.

AswithSPDFA,the3-DBROstageisrequiredtore-ordertheoutput;itadjustsforthebitreversalpermutationengenderedby the fast transform algorithm used. The 3-D BRO is

implemented using5

&− 1 '

&-word dual port block RAM.

Thismemoryelementcanbemergedwithsystemswherethepresented3-DDCTcoreisused.

Mux 2

1

Register 3

-

Mux 3

1Register 5

-

P

Register 7

-Mux 4

1

P

Register 6

Register 1

-Mux 1

1

P

Register 2 +

3P-1

1

Out

Register 4

Register 8

Register 9

In0

In1

a.


P=N


P=1


P=N23-D BRO Out

b.

In0

In1

Figure7:a.Thirdpostadditionsub-stageforDPDFA.b.A

postadditionstageand3-DBROforDPDFA.

6.PerformanceEvaluation

ThepresentedarchitectureshavebeendesignedusingXilinxSystem generator tool and they have been tested andimplementedonaXilinxVirtex55vlx50tff1136-3FPGAdevice.Variousvideosequencesandwordlengthshavebeenusedtotest and evaluate the presented architectures performanceandattributes.

6.1TestBenchandRateDistortionPerformanceDCTArepresentsthe3-DDCTofeachframecomputedusing

the presented architectures; as they implement the samealgorithm,bothstructuresexhibitvirtuallythesameaccuracy,asshowninTable1andTable2.Whenemploying|}~,theannotation(x,y)referstoafixed-pointwordlengthofx+y-bitswherexandyrepresentthenumberofbitsoftheintegerandfractionalparts,respectively.ResultsofDCTAimplementationusing(12,8),(12,4)and(12,2)-bitwordlengthsareshowninthis section. In comparison, DCTM represents coefficientscalculated usingMatlab code that implements the 3-DDCT,andÄ|}~[representsaMatlabimplementationoftheinverse3-DDCT.BothÄ|}~[and|}~[Matlabimplementationsarefloating-point based. For testing and validation purposes|}~[andDCTAhavebeenappliedonvariousMRIandvideosequences of 512×512×8-word [50]; the Ä|}~[ is thenappliedtotheoutputof|}~toyieldreconstructedframes.Thepeak signal-to-noise ratio (PSNR) and rootmean squareerror (RMSE) are used for evaluating the accuracy of thepresented architectures output. The RMSE between theoriginalandreconstructedframesisdefinedas:

8

Ç]ÉÑ *

=1

Ö×ÜÄ|}~[ |}~ L, X, * − Ä(L, X, *)

&

á

D9+

à

Y9+

(14)

WhereÄ(L, X, *) istheoriginalframeand* intherange 1 ≤* ≤ äistheframeindex.FisthenumberofframesinÄ,PandQ are the number of its rows and columns, respectively. Inaddition, the PSNR between the original and reconstructedframesiscomputedasfollows[51]:

ÖÉ'Ç * = 10#$%[ãå çé

è[êë 1

&

(15)

where]í3 Ä1 represents themaximum intensity value ofthekthframe.Further,theaverageofmaximumabsoluteerror(AvgMaxErr) of the coefficients for the presented 3-D DCTarchitectures |}~in comparison to the Matlabimplementation|}~[iscomputedas:

]í3ÑFF * = ]í3 íìM |}~[ L, X, * − |}~ L, X, *

(16)

AvgMaxErr =+

ö]í3ÑFF *

ö

19+ (17)

where]í3ÑFF * represents themaximum absolute errorforeachframe * .

Performance accuracy, for both presented architectures,wasstudiedoveraselectionofimplementationwordlengths.Table1andTable2showthatthePSNR increaseswhenthefractional part increases for both SPDFA and DPDFA,respectively.Assuchandasexpected,thehighestaccuracyisobtained using a 20-bit wordlength (namely, (12, 8)-bit),providing perfect accuracy. The presented architecturesproduce very good image quality using all the selectedwordlengths.TheaveragePSNRoftheeighttestsequencesforSPDFAare∞,57and45dBusing(12,8),(12,4)and(12,2)-bitwordlengths, respectively. DPDFA achieves very comparableresults.Further,inTable1andTable2,theAvgMaxErrofbotharchitecturesarealmostidentical.Fortheaimofusingvisualinspectionasasubjectivefidelitycriterion[52],theimagesoftheoriginalandreconstructedMRI2scansareshowninFigure8. For both architectures, the reconstructed images arecomputedusingwordlengths (12,2), (12,4)and (12,8)bits. Itcanbenoticedthatthe(12,2)-bitwordlengthproducedagoodquality imagewhere the visual error can hardly be noticed;longerwordlengthshoweverleadtoamuchbetterquality.

6.2AreaUsageandComputationTime

Thehardwareusage,speedandcomputationtimeofbotharchitecturesusingdifferentwordlengthsareshowninTable3.Itisimportantthatthepresentedarchitecturesareefficientin terms of area usage; in particular, in resource-limiteddevices such as FPGAs. As it can be seen from Table 3, theaveragedeviceresourcesusageofSPDFAandDPDFAisaslowas12%and18%,respectively.ThehardwareusageofDPDFA

ishigherthanthatoftheSPDFAduetoduplicatecircuitryformultiplication, addition and post addition stages. However,thisextrahardwareusageandthefactthatithasnofeedbackloops improve themaximumoperating frequency of DPDFAoverSPDFA.ItiseasiertoplaceandroutethecomponentsofDPDFA, including theFPGAdevicespecific resourcessuchasthe DSP elements for the implementation of arithmeticoperations.Assuch,thecomputationtimeof512×512×8-wordinDPDFAisshorterthanthatofSPDFA.Itisworthpointingoutthatthememoryrequirementsofbotharchitecturesarelowin comparison with other architectures due to the in-placecomputation and the lowmemory requirement for theBROand3-Dreorderingoperations.Thememoryelementsof5N2and3N

2-wordhavebeenusedfor3-BROforSPDFAandDPDFArespectively. Further, a memory of 5N2-word for 3-Dreordering operation has been used in both architectures.Thus, the total number of block memory used in eacharchitecture is less than 25% of the available memoryresourcesofthe5vlx50tff1136-3FPGAdevice.

Table1.AccuracyanddistortionperformanceofSPDFA

Reconstructedandoriginalframes 3-DDCTCoefficients

PSNR(dB) RMSE AvgMaxErr

Video (12,8)(12,4)(12,2) (12,8) (12,4)(12,2) (12,8) (12,4) (12,2)

MRI1 ∞ 59 48 ≈0 0.28 1.05 0.01 0.18 0.77

MRI2 ∞ 56 45 ≈0 0.40 1.49 0.01 0.23 0.88

Akiyo ∞ 58 47 ≈0 0.31 1.16 0.01 0.21 0.89

Stefan ∞ 56 45 ≈0 0.40 1.48 0.01 0.23 0.97

Suzie ∞ 56 45 ≈0 0.40 1.46 0.01 0.21 0.90

Bus ∞ 56 45 ≈0 0.40 1.49 0.02 0.23 0.92

Flower ∞ 56 45 ≈0 0.39 1.45 0.02 0.23 0.97

Mobile ∞ 56 45 ≈0 0.40 1.49 0.01 0.21 0.92

Average ∞ 57 45 ≈0 0.37 1.38 0.01 0.22 0.90

Table2.AccuracyanddistortionperformanceofDPDFA

Reconstructedandoriginalframes 3-DDCTCoefficients

PSNR(dB) RMSE AvgMaxErr

Video (12,8)(12,4)(12,2) (12,8) (12,4)(12,2) (12,8) (12,4) (12,2)

MRI1 ∞ 60 48 ≈0 0.25 1.05 0.01 0.18 0.77

MRI2 ∞ 57 45 ≈0 0.36 1.49 0.01 0.23 0.86

Akiyo ∞ 59 47 ≈0 0.28 1.16 0.01 0.21 0.89

Stefan ∞ 57 45 ≈0 0.35 1.48 0.02 0.23 0.97

Suzie ∞ 57 45 ≈0 0.35 1.46 0.01 0.21 0.91

Bus ∞ 57 45 ≈0 0.35 1.49 0.02 0.23 0.93

Flower ∞ 57 45 ≈0 0.35 1.45 0.02 0.23 0.97

Mobile ∞ 57 45 ≈0 0.35 1.40 0.01 0.21 0.92

Average ∞ 58 45 ≈0 0.33 1.37 0.01 0.22 0.90

9

Figure8.TheoriginalandreconstructedMRI2usingbothArchitecturesforvariouswordlengthsizes

6.3DynamicPowerConsumptionThepowerconsumptioninFPGAisclassifiedintostaticand

dynamicpower.Thestaticpowermainlycomesfromleakagecurrent,whereaschargingswitchcapacitorsandshortcircuitcurrentsarethemainsourcesofdynamicpower;henceitcanbe minimised by switching capacitance reduction [53]. Thedynamicpowerconsumptionofthepresentedarchitecturesisshown in Figure 9. The power consumption has beencomputed using Xilinx Xpower analyser for various clock

frequencies and different wordlengths. The dynamic powerconsumption ishigher inDPDFAbyaround25-100mWthanSPDFA for selected operating frequencies and wordlengths.The reason behind that is the additional multipliers in thebutterflystagesandtheduplicationofsomeresourcesinthefirst twopostadditionstages.Thus,SPDFA isoutperformingDPDFAintermsofpowerconsumptionwhichmakesitabetterchoiceforlowpowerconsumptionapplications.

Figure9:Dynamicpowerconsumptionofbotharchitectures.

6.4ComparisontoSimilarWork

The throughput of both architectures is 1 coefficient perclockcycle; thus,'--clockcyclesareneeded tocomputeallthe3-DDCTcoefficientsofa'--worddatacube.Acomparisonbetween the presented and similar architectures in theliterature isshowninTable4.OfSPDFAandDPDFA,Table4shows that the first architecture outperforms the second intermsofareausageasitrequiresfewermultipliers,addersandregisters. The extra hardware DPDFA utilises is needed toperform the dual line computation of the 3-D DCT.Nevertheless, DPDFA is easily pipelined and it has a lowerlatency andmemory requirement than SPDFA. Thememoryrequirements forSPDFAandDPDFA,as listed inTable4,areusedfordatareorderingandBROonly.

Thenumberofmultipliersandaddersemployed,memoryrequirements,controllercircuitscomplexity,andcomputationtimeofthepresentedarchitectures,arealsocomparedtotherequirementsandperformanceofthearchitecturesin[38-43].As shown in Table 4. It can be seen that the presentedarchitectures require the lowest number of multipliers andmemory usageof all architectures.Only #$%&' and2#$%&'multipliersarerequiredtoperformthe3-DDCTcomputationusing SPDFA and DPDFA, respectively; for instance, Nmultipliersarerequiredin[41,42].Inaddition,exceptforthearchitecturesin[43],thepresentedarchitecturescarryoutthe3-D DCT computation with the lowest latency. Table 4 alsoshows the performance of various architectures in terms ofcomputation time; although the presented architecturesexhibita longercomputationtimethanthework in[39,43],this is largely balanced by the presented architectures lowhardwareusage. This improvement over similarwork in the

20

70

120

170

220

270

320

370

420

200 100 66.67 50 40 33.33D

ynam

ic p

ower

(mW

)Clock Frequency (MHz)

SPDFA (12,8)

DPDFA (12,8)

SPDFA (12,2)

DPDFA (12,2)

Reconstructed Image (One Image)



Reconstructed; SPDFA; (x,y) = (12,2)



Original Image (One Image)

Original MRI2

Reconstructed; DPDFA; (x,y) = (12,2)






10

literatureismainlyduetothefactthatunlikethearchitecturesin[38-43],thefocusisonemployingandregularisingthedataflowofafastalgorithmwhiletraditionalDCTarchitecturesarebasedonthedirectalgorithm[25,38-43];thishoweverisnotthe only benefit of using a VR approach, in fact the controlcircuitsattachedtothepresentedarchitecturesaresimpleasthere is no data transpose. This makes the controllercomplexitycomparabletothatofparalleldirectapproachesinin[38,42]. 7.Conclusions

This paper has presented two new 3-DDCT architecturesbasedona3-DDCTVRalgorithm.Theuseofafastalgorithmhasyieldedarchitectureswithimprovedprocessingspeedanda reduced hardware usage as they both require the lowestnumberofarithmeticcomponentsandmemoryrequirementamongknownarchitecturesintheliterature;atthesametime,sucharchitectures avoid theneed formemory transpositionandhenceareeasytoimplementandemployasimplecontrolcircuitry.Thepresentedarchitecturesareparameterisable interms of word and transform lengths and exhibit variouspowerconsumption,hardwareusage,processingspeedsandlevels of pipelining, which provides the designer withmoreflexibility and a larger choice when selecting the rightarchitecturefortheapplicationunderconsideration.

References[1] M. Ayinala and K. K. Parhi, "Parallel pipelined FFT architectures withreducednumberofdelays,"inProceedingsoftheACMGreatLakesSymposiumonVLSI(GLSVLSI),2012,pp.63-66.[2]O.Nibouche,S.Boussakta,M.Darnell,andM.Benaissa,"Algorithmsandpipeline architectures for 2-D FFT and FFT-like transforms," Digital SignalProcessing:AReviewJournal,vol.20,pp.1072-1086,2010.[3]M.Ayinala,M.Brown,andK.K.Parhi,"PipelinedparallelFFTarchitecturesviafoldingtransformation,"IEEETransactionsonVeryLargeScaleIntegration(VLSI)Systems,vol.20,pp.1068-1081,2012.[4]S. Saponara and B. Neri, "Radar Sensor Signal Acquisition andMultidimensional FFT Processing for Surveillance Applications in TransportSystems,"IEEETransactionsonInstrumentationandMeasurement,vol.66,pp.604-615,2017.[5]S.SaponaraandB.Neri,"Designofcompactandlow-powerX-bandRadarfor mobility surveillance applications," Computers & Electrical Engineering,vol.56,pp.46-63,2016.[6]A.Das,A.Hazra,andS.Banerjee,"Anefficientarchitecturefor3-Ddiscretewavelet transform," IEEE Transactions on Circuits and Systems for VideoTechnology,vol.20,pp.286-296,2010.[7]B. K.Mohanty and P. K.Meher, "Memory-efficient architecture for 3-DDWT using overlapped grouping of frames," IEEE Transactions on SignalProcessing,vol.59,pp.5605-5616,2011.[8]B. K. Mohanty and P. K. Meher, "Memory efficient modular VLSIarchitectureforhighthroughputandlow-latencyimplementationofmultilevellifting 2-DDWT," IEEE Transactions on Signal Processing,vol. 59, pp. 2072-2084,2011.[9]S. Al-Azawi, "Low-Power, Low-Area Multi-level 2-D Discrete WaveletTransformArchitecture,"Circuits,Systems,andSignalProcessing,vol.37,pp.444-458,2018.[10] R.E.Atani,M.Baboli,S.Mirzakuchaki,S.E.Atani,andB.Zamanlooy,"Design and implementation of a 118 MHz 2D DCT processor," in IEEEInternationalSymposiumonIndustrialElectronics,2008,pp.1076-1081.[11] M.JridiandA.Alfalou,"Alow-power,high-speedDCTarchitectureforimage compression: Principle and implementation," in 18th IEEE/IFIP VLSISystemonChipConference(VLSI-SoC),2010,pp.304-309.

[12] M.ElAakif,S.Belkouch,N.Chabini,andM.M.Hassani,"Lowpowerandfast DCT architecture usingmultiplier-lessmethod," in2011 Faible TensionFaibleConsommation(FTFC),2011,pp.63-66.[13] B.Z.Guo,L.Niu,andZ.M.Liu,"Implementationof2-DDCTbasedonFPGA," in Proceedings of SPIE - The International Society for OpticalEngineering,2010.[14] G. K. a. S. V. Khurram Bukhari, "DCT and IDCT implementations ondifferentFPGAtechnologies,"ComputerEngineeringLab,DelftUniversityofTechnology,2009[15] O.Nibouche,S.Boussakta,andM.Darnell,"PipelineArchitecturesforRadix-2NewMersenneNumberTransform,"IEEETransactionsonCircuitsandSystemsI:RegularPapers,vol.56,pp.1668-1680,2009.[16] H.L.P.A.Madanayake,R. J.Cintra,D.Onen,V.S.Dimitrov,andL.T.Bruton, "Algebraic integerbased8×82-DDCT architecture for digital videoprocessing,"inIEEEInternationalSymposiumonCircuitsandSystems(ISCAS),2011,pp.1247-1250.[17] A.M.Shams,A.Chidanandan,W.Pan,andM.A.Bayoumi, "NEDA:Alow-powerhigh-performanceDCTarchitecture," IEEETransactionsonSignalProcessing,vol.54,pp.955-964,2006.[18] E.D.KusumaandT.S.Widodo,"FPGAimplementationofpipelined2D-DCT and quantization architecture for JPEG image compression," in 2010InternationalSymposiuminInformationTechnology(ITSim),2010,pp.1-6.[19] S.Al-Azawi,Y.A.Abbas,andR.Jidin,"LowcomplexitymultidimensionalCDF5/3DWTarchitecture," inCommunicationSystems,Networks&DigitalSignalProcessing(CSNDSP),20149th InternationalSymposiumon,2014,pp.804-808.[20] G. K. Wallace, "The JPEG still picture compression standard," IEEETransactionsonConsumerElectronics,vol.38,pp.xviii-xxxiv,1992.[21] D. J. Le Gall, "The MPEG video compression standard," in CompconSpring'91:DigestofPapers,1991,pp.334-335.[22] W. Li, "Overview of fine granularity scalability in MPEG-4 videostandard,"IEEETransactionsonCircuitsandSystemsforVideoTechnology,vol.11,pp.301-317,2001.[23] A.MadisettiandA.N.Willson,Jr.,"DCT/IDCTprocessordesignforHDTVapplications,"inInternationalSymposiumonSignals,Systems,andElectronics(ISSSE'95),1995,pp.63-66.[24] T.Wiegand,G.J.Sullivan,G.Bjøntegaard,andA.Luthra,"Overviewofthe H.264/AVC video coding standard," IEEE Transactions on Circuits andSystemsforVideoTechnology,vol.13,pp.560-576,2003.[25] S.BoussaktaandH.O.Alshibami,"Fastalgorithmforthe3-DDCT-II,"IEEETransactionsonSignalProcessing,vol.52,pp.992-1001,2004.[26] X.Li,A.Dick,C.Shen,A.vandenHengel,andH.Wang,"Incrementallearningof3D-DCTcompactrepresentationsforrobustvisualtracking,"IEEETransactionsonPatternAnalysisandMachine Intelligence,vol.35,pp.863-881,2013.[27] S.SawantandD.A.Adjeroh,"Balancedmultipledescriptioncodingfor3DDCTvideo,"IEEETransactionsonBroadcasting,vol.57,pp.765-776,2011.[28] H.Y.Huang,C.H.Yang,andW.H.Hsu,"Avideowatermarkingtechniquebased on pseudo-3-D DCT and quantization index modulation," IEEETransactionsonInformationForensicsandSecurity,vol.5,pp.625-637,2010.[29] R. Atta and M. Ghanbari, "Spatio-temporal scalability-based motion-compensated3-Dsubband/DCTvideocoding," IEEETransactionsonCircuitsandSystemsforVideoTechnology,vol.16,pp.43-55,2006.[30] S. C. Chan and K. L. Ho, "Direct methods for computing discretesinusoidaltransforms," IEEProceedingsonRadarandSignalProcessing,vol.137,pp.433-442,1990.[31] S. An and C. Wang, "Recursive algorithm, architectures and FPGAimplementationofthetwo-dimensionaldiscretecosinetransform,"IET,ImageProcessingvol.2,pp.286-294,2008.[32] G. Jiun-In and L. Chih-Chen, "A generalized architecture for the one-dimensionaldiscretecosineandsinetransforms,"IEEETransactionsonCircuitsandSystemsforVideoTechnology,vol.11,pp.874-881,2001.[33] Z.Wu, J. Sha, Z.Wang, L. Li, andM. Gao, "An improved scaled DCTarchitecture,"IEEETransactionsonConsumerElectronics,vol.55,pp.685-689,2009.[34] C. Yuan-Ho and C. Tsin-Yuan, "A high performance video transformenginebyusing space-time scheduling strategy," IEEETransactionsonVeryLargeScaleIntegration(VLSI)Systemsvol.20,pp.655-664,2012.

11

[35] S. Al-Azawi, S. Boussakta, and A. Yakovlev, "High precision and lowpower DCT architectures for image compression applications," in IETConferenceonImageProcessing(IPR),2012,pp.1-6.[36] A.AggounandI.Jalloh,"Two-dimensionalDCT/IDCTarchitecture,"IEEProceedingsonComputersandDigitalTechniques,vol.150,pp.2-10,2003.[37] J. I. Guo, "Efficient parallel adder based design for one-dimensionaldiscretecosinetransform,"IEEProceedingsonCircuits,DevicesandSystems,vol.147,pp.276-282,2000.[38] I. Jalloh, A. Aggoun, and M. McCormick, "3D DCT architecture forcompressionof integral3D images," in IEEEWorkshoponSignalProcessingSystems,SiPS:DesignandImplementation,2000,pp.238-244.[39] A. Aggoun and I. Jalloh, "A parallel 3D DCT architecture for thecompressionofintegral3Dimages,"inThe8thIEEEInternationalConferenceonElectronics,CircuitsandSystems(ICECS)2001,pp.229-232vol.1.[40] M. Bakr and A. E. Salama, "Implementation of 3D-DCT based videoEncoder/Decoder system," inMidwest Symposiumon Circuits and Systems,2002,pp.II13-II16.[41] S.Saponara,L.Fanucci,andP.Terreni,"Low-powerVLSIarchitecturesfor3Ddiscretecosinetransform(DCT),"inIEEE46thMidwestSymposiumonCircuitsandSystems,2003,pp.1567-1570Vol.3.[42] S.Saponara,"Real-timeandlow-powerprocessingof3Ddirect/inversediscrete cosine transform for low-complexity video codec," JournalofReal-TimeImageProcessing,pp.1-11,2012.[43] Y. Ikegaki,T.Miyazaki,andS.G.Sedukhin,"3D-DCTprocessorand itsFPGA implementation," IEICETransactionson InformationandSystems,vol.E94-D,pp.1409-1418,2011.[44] L. Yuanyuan, C. Hexin, Z. Yan, and Y. Chuxi, "Three dimensional DCTsimilar butterfly algorithm and its pipeline architectures," in 2016 IEEEInformation Technology, Networking, Electronic and Automation ControlConference,2016,pp.506-510.[45] S. Al-Azawi, "Efficient Architectures for Multidimensional DiscreteTransforms in Image and Video Processing Applications," PhD Thesis,NewcastleUniversity,UK,2013.[46] L. Yuanyuan, C. Hexin, Z. Yan, and Y. Chuxi, "Device-saving pipelinearchitecturesofmulti-dimensionalDCTsimilarbutterflyalgorithm," in2016International Conference on Integrated Circuits and Microsystems (ICICM),2016,pp.339-344.[47] J. A. Nikara, J. H. Takala, and J. T. Astola, "Discrete cosine and sinetransforms—regularalgorithmsandpipelinearchitectures,"SignalProcessing,vol.86,pp.230-249,2006.[48] O.AlshibamiandS.Boussakta,"Fastalgorithmforthe3DDCT,"inIEEEInternational Conference on Acoustics, Speech, and Signal ProcessingProceedings(ICASSP'01),2001,pp.1945-1948vol.3.[49] M.C. Lee,R. K.W.Chan, andD.A.Adjeroh, "Fast three-dimensionaldiscretecosinetransform,"SIAMJournalonScientificComputing,vol.30,pp.3087-3107,2008.[50] Xiph.org, "Video Test Media [derf's collection]," Xiph.org, [Online].Available:https://media.xiph.org/video/derf/Accessedon20/09/2016.[51] Q.Huynh-Thu andM.Ghanbari, "The accuracy of PSNR in predictingvideoqualityfordifferentvideoscenesandframerates,"TelecommunicationSystems,vol.49,pp.35-48,2012/01/012012.[52] H. R. Sheikh, A. C. Bovik, and G. d. Veciana, "An information fidelitycriterion for image quality assessment using natural scene statistics," IEEETransactionsonImageProcessing,vol.14,pp.2117-2128,2005.[53] S.McKeownandR.Woods,"Lowpowerfieldprogrammablegatearrayimplementationoffastdigitalsignalprocessingalgorithms:Characterisationandmanipulationofdatalocality,"IETComputersandDigitalTechniques,vol.5,pp.136-144,2011.

SaadAl-AzawireceivedtheB.Sc.DegreeinElectrical Engineering from University ofBaghdadandM.Sc.degreeinElectronicandCommunication Engineering from Al-Mustansiriya University, Baghdad, Iraq. Hereceived his Ph.D. degree in Electrical andElectronic Engineering from Newcastle

University, Newcastle Upon Tyne, England, 2013. AssistantProfessor Dr. Saad is currently working as a head of the

departmentofElectronicEngineering,CollegeofEngineering,UniversityofDiyala,Diyala,Iraq.HisresearchinterestsincludeHardware architectures for signal and image processingalgorithms and transforms, Digital Image Processing, DigitalSignalProcessing.

Omar Nibouche received his BEng degree inElectronic Engineering form the PolytechnicSchool of Algiers and Ph.D. degree inComputer Science from Queen's UniversityBelfast. He is a lecturer in computing in theSchoolofComputingandMathematics,UlsterUniversity at Jordanstown. His research

interests include machine learning, applications of artificialintelligence,computervisionandbiometrics.

Said Boussakta received the PhD degree inElectrical Engineering from NewcastleUniversity, U.K., in 1990. Since 1990, he hasbeen working in academia, fully involved inboth research and teaching. From2000-2006hewasattheUniversityofLeedsasaReaderin Digital Communications and Signal

Processing. Since 2006, he has been with the School ofEngineering, Newcastle University as a Professor ofCommunications and Signal Processing, lecturing inCommunicationNetworksandSignalProcessingsubjects.Hehas supervised over 50 students to Ph.D. completion andpublished over 200 conference proceedings and journalarticles. His research interests are in the areas of fast DSPalgorithms,DigitalCommunications,CommunicationNetworkSystems, Cryptography, and Digital Signal/Image Processing.He has also served as Chair in conferences and presentedseveralinvitedtalksincommunications,signalprocessingandsecurity. Prof Boussakta is a Fellowof the IEE, and a SeniorMember of the Communications and Signal ProcessingSocieties.

Gaye Lightbody hasbeena Lecturerwithin the School of Computing inUlster University since 2006. ShereceivedanM.Eng.(1995)andaPhD(2000) in Electrical and ElectronicEngineering fromQueen’sUniversityofBelfast.HerPhD,HighPerformanceVLSIArchitecturesforRecursiveLeast

Squares Adaptive Filtering, involved the research anddevelopment into a scalable efficient architecture for highlyintensive adaptive beamforming applications. Gaye thenworked in industry from 2000 to 2006 for AmphionSemiconductorLimiteddevelopingintellectualpropertycoresforASICandFPGAsolutionsintheareasofaudio,imageandvideoprocessing.Sheisaco-authorofthebookFPGA-basedImplementation of Signal Processing Systems published byWiley in 2008. A second edition has been released in 2017which provides an update to reflect the latest iterations ofFPGA theory, applications, and technology. This revisionincludescoverageofFPGAsolutionsforBigDataApplications.

12

TABLE3.Hardwareutilisationrate,maximumoperatingfrequenciesandcomputationtimesforbotharchitecturesusingvariouswordlengths

SliceLogicUtilization Available

SPDFA DPDFA

HardwareusageforwordlengthsizesHardwareusageforwordlengthsizes

(12,8) (12,6) (12,4) (12,2) (12,8) (12,6) (12,4) (12,2)

Hardwareusage

NoofSliceRegisters 28,800 1840 1726 1593 1459 3691 3414 3120 2826

NoofSliceLUTs 28,800 2351 2171 1991 1811 3309 3044 2781 2517

NoofoccupiedSlices 7,200 743 607 588 605 1229 1096 1053 1008

NoofbondedIOBs 480 30 28 26 24 30 28 26 24

Noof36kBlockRAMused 60

1 - - - 1 - - -

Noof18kBlockRAMused 14 15 15 15 15 16 16 16NoofDSP48Es 48 9 8 8 8 16 12 12 12

Averageutilizationrate 12% 12% 11% 11% 18% 16% 15% 15%

Maximumfrequencies(MHz) 241 230 244 226 266 338 258 333

Computationtimesfor512×512×8-pixeldata(ms) 8.7 9.1 8.6 9.3 7.9 6.2 8.1 6.3

13

Table4.ComparisontoSimilarArchitecturesintheLiterature

Architectures Adders/Sub. Multipliers Memory Registers InitialLatencyComputationTime(cycles)

ControllerComplexity DCTAlgorithm

[38] 3" 3" "# " + 1 N/R* "& + 3" "& Simple Regular;Row-Column-Frame,cascaded

[39]

5"# + "2

5"#2

"&(transposememory)

"registerbetweenthe2-DDCTandthe1-D-DCT-

framedirection

"& + 32" "# Complex

Regular,Parallel;Row-Column-FrameN×N1-DDCT+1-DDCTforframedirection

[40] 3" − 3 3" "# " + 1 N/R > "# 6"&** Medium Regular1-DDCT;Row-Column-Frame[42]

FullParallel(FP) " 2" + 1 " 2" + 1 ∗∗∗ " "# + 1 N/R N/R 2"# Medium

1-DDCTRadix2Row-Column-FrameCascaded(CS) 3" 3"

∗∗∗ "# " + 1 N/R N/R 2"& Simple

HardwareMultiplexed(HM) " "

∗∗∗ "& N/R N/R 6"& Complex

[43] InputMem.

Sequential "& "& 2"& "& 3" + 4 "& 3" Complex

Regular1DDCT;Row-Column-Frame

Pipelined1 ≈ 2"& 2"& 2"& "& 3" + 4 "& 3" Complex

Piplined2 ≈ 3"& 3"& 2"& "& 3" + 8 "& 3" Complex

Block N&

8 N&8 2N

& ≈ 16N& 3N + 4 N& 3N Complex

SPDFA 12 + 6234#" 234#""4 + " "

#

ForreorderingandBRO

"& + 5"# + 5" + 4 ≅ 2"& "& Simple Vector-Radix3-DDCT

DPDFA 20 + 6234#" 2234#""&

ForreorderingandBRO

3"&2 + 6"

# + 11" + 8 ≅ 32"& "& Simple Vector-Radix3-DDCT

*N/R:Notreportedintheirpaper.**Computedfor4×4×4datablock.***Multiplicationisperformedbyaserialdistributedarithmeticarchitecture.

New Fast and Area-Efficient Pipeline 3-D DCT Architectures...3-D DCT, including visual tracking, video coding and watermarking [25-29]. Numerous 1-D and 2-D DCT architectures have

Documents