Top Banner
1 New Fast and Area-Efficient Pipeline 3-D DCT Architectures Saad Al-Azawi a , Omar Nibouche b , Said Boussakta c , Gaye Lightbody d a College of Engineering, Diyala University, Diyala, Iraq; b,d Faculty of Computing and Engineering, Ulster University, UK; c School of Engineering, Newcastle University, UK a [email protected], b [email protected], c [email protected], d [email protected] Abstract The efficient implementation of 3-D transforms is a challenging task due to the computation complexity, memory and area requirements of such transforms. One important 3-D transform is the 3-D Discrete Cosine Transform (3-D DCT) used in many image and video processing systems. In this paper, two new pipeline architectures for the 3-D DCT computation using the 3-D DCT Vector-Radix algorithm (3-D DCT VR) are presented. These architectures are scalable and parameterisable with regards to different wordlengths and pipelining levels. Their arithmetic component requirements are reduced to the order of !(#$% & ') in contrast with !(') for 3-D DCT architectures in the literature, while at the same time they can keep similar or better area-time complexity. Key words: 3-D Discrete Cosine Transform (DCT), Row-Column (RC), Row-Column-Frame (RCF), Vector-Radix, FPGA 1. Introduction Transforms such as the Fourier Transform (FT) [1-5], Wavelet Transform (WT) [6-9], and Cosine Transform (CT) [10- 14] play a critical part in various Digital Signal Processing (DSP) applications, including audio, image and video systems. Much of the usefulness of these transforms arises from their frequency and time-frequency representations and properties including the decorrelation property, energy compactness, and the availability of fast algorithms for their computation. Nevertheless, even the fast algorithms that implement these transforms are still very computationally intensive. Thus, these transforms can become a bottleneck in terms of the system’s speed, and contribute greatly to their area usage and power consumption [1-3, 6-8, 10-19]. For its role in many image and video applications, including the JPEG, MPEGx and H.26x compression standards, the DCT has received a great deal of research interest [20-24]; the 1-D and 2-D DCT are now the established transforms for many applications and standards. Further, there are many new and emerging applications for the 3-D DCT, including visual tracking, video coding and watermarking [25-29]. Numerous 1-D and 2-D DCT architectures have been suggested in the literature [30-39]. Exploiting the separability principle of the transform, 2-D DCT cores based on the 1-D DCT Row-Column (RC) approach are suggested in [33-36]; yet very few architectures that implement the 3-D DCT can be found [38-45]. Traditionally, the 3-D DCT has been implemented by cascading stages of the 1-D DCT as in the well-known Row- Column-Frame (RCF) approach. Noteworthy differences between architectures in the literature are their level of parallelisation in terms of the number of stages and the number of 1-DCT cores per stage, which leads to different trade-offs between circuit complexity and throughput. One common architecture employs three stages of one 1-D DCT core and N 3 +N 2 –word transpose memory [40-42]. Parallelisation can be applied to the first two 1-D DCT cores which in fact implements a 2-D DCT transform, leading to the utilisation of 2N+1 1-D DCT processors and N 3 +N-word memory [41, 42]. Another class of the 3-D DCT architectures multiplexes the 1-D DCT transforms involved in its computation onto a single 1-D DCT architecture. Such a class of architecture requires N 3 -word memory [41, 42]. The reduction achieved in hardware utilisation comes at the cost of a lower throughput. Using three 1-DCT cores to implement the 3-D DCT achieves a throughput three times higher than when employing a single 1-D DCT processor. The throughput is N-fold augmented via parallelisation of the 1-D DCT processors [42]. The 1-D DCT cores employed in the 3-D DCT architecture can use the transform’s fast algorithm, distributed arithmetic or ROM based designs [38]. Such architectures exhibit irregular structures, lack of modularity, and complex control. Another class of the 3-D DCT architecture relies solely on the systolic approach with its well established design methodology [43]. In [44, 46], high speed and low complexity pipeline n-D DCT architectures are proposed using the regular 1-D DCT and tensor product operations. The proposed architectures are based on the 1-D and 2-D DCT architectures in [47]. In this paper, two new pipeline and scalable architectures that implement the 3-D Discrete Cosine Transform Vector- Radix (3-D DCT VR) are introduced. The presented architectures are parameterisable in terms of wordlength and pipeline stages. Further, no block memory is used for data transposition. These architectures have been implemented and tested; for instance, an FPGA-based implementation of a 512×512×8-word data using a transform length of 8×8×8-word cube size and a 14-bit wordlength has achieved a working frequency of 330 MHz and a processing time of 6.4 ms. Thus, 80000 frames can be processed in every second. The remainder of this paper is organised as follows. In section 2, the background of the DCT transform and 3-D DCT VR algorithm are provided. Sections 3, 4 and 5 present the two new architectures for 3-D DCT computation. The results
13

New Fast and Area-Efficient Pipeline 3-D DCT Architectures...3-D DCT, including visual tracking, video coding and watermarking [25-29]. Numerous 1-D and 2-D DCT architectures have

Feb 02, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 1

    NewFastandArea-EfficientPipeline3-DDCTArchitectures

    SaadAl-Azawia,OmarNiboucheb,SaidBoussaktac,GayeLightbodydaCollegeofEngineering,DiyalaUniversity,Diyala,Iraq;b,dFacultyofComputingandEngineering,UlsterUniversity,

    UK;cSchoolofEngineering,NewcastleUniversity,[email protected],[email protected],[email protected],

    [email protected]

    AbstractTheefficient implementationof 3-D transforms is a challenging taskdue to the computation complexity,memory and arearequirementsofsuchtransforms.Oneimportant3-Dtransformisthe3-DDiscreteCosineTransform(3-DDCT)usedinmanyimageandvideoprocessingsystems.Inthispaper,twonewpipelinearchitecturesforthe3-DDCTcomputationusingthe3-DDCTVector-Radixalgorithm(3-DDCTVR)arepresented.Thesearchitecturesarescalableandparameterisablewithregardstodifferentwordlengthsandpipelininglevels.Theirarithmeticcomponentrequirementsarereducedtotheorderof!(#$%&')incontrastwith!(')for3-DDCTarchitecturesintheliterature,whileatthesametimetheycankeepsimilarorbetterarea-timecomplexity.Keywords:3-DDiscreteCosineTransform(DCT),Row-Column(RC),Row-Column-Frame(RCF),Vector-Radix,FPGA1.Introduction

    Transforms such as the Fourier Transform (FT) [1-5],WaveletTransform(WT)[6-9],andCosineTransform(CT)[10-14]playacriticalpartinvariousDigitalSignalProcessing(DSP)applications,includingaudio,imageandvideosystems.Muchof the usefulness of these transforms arises from theirfrequencyandtime-frequencyrepresentationsandpropertiesincluding the decorrelation property, energy compactness,and theavailabilityof fastalgorithms for their computation.Nevertheless,eventhefastalgorithmsthatimplementthesetransformsarestillverycomputationallyintensive.Thus,thesetransformscanbecomeabottleneckintermsofthesystem’sspeed,andcontributegreatlytotheirareausageandpowerconsumption[1-3,6-8,10-19].Foritsroleinmanyimageandvideo applications, including the JPEG, MPEGx and H.26xcompressionstandards,theDCThasreceivedagreatdealofresearch interest [20-24]; the 1-D and2-DDCT arenow theestablished transforms formanyapplicationsand standards.Further,therearemanynewandemergingapplicationsforthe3-D DCT, including visual tracking, video coding andwatermarking[25-29].

    Numerous 1-D and 2-D DCT architectures have beensuggestedintheliterature[30-39].Exploitingtheseparabilityprincipleofthetransform,2-DDCTcoresbasedonthe1-DDCTRow-Column(RC)approacharesuggestedin[33-36];yetveryfewarchitectures that implement the3-DDCTcanbe found[38-45].Traditionally,the3-DDCThasbeenimplementedbycascading stages of the 1-DDCT as in thewell-known Row-Column-Frame (RCF) approach. Noteworthy differencesbetween architectures in the literature are their level ofparallelisation in terms of the number of stages and thenumber of 1-DCT cores per stage, which leads to differenttrade-offs between circuit complexity and throughput. Onecommon architecture employs three stages of one 1-D DCTcore and N3+N2–word transpose memory [40-42].Parallelisationcanbeapplied to the first two1-DDCTcores

    whichinfactimplementsa2-DDCTtransform,leadingtotheutilisation of 2N+1 1-D DCT processors and N3+N-wordmemory[41,42].Anotherclassofthe3-DDCTarchitecturesmultiplexes the 1-D DCT transforms involved in itscomputationontoasingle1-DDCTarchitecture.Suchaclassof architecture requires N3-word memory [41, 42]. Thereductionachievedinhardwareutilisationcomesatthecostofalowerthroughput.Usingthree1-DCTcorestoimplementthe 3-DDCT achieves a throughput three times higher thanwhenemployingasingle1-DDCTprocessor.ThethroughputisN-foldaugmentedviaparallelisationofthe1-DDCTprocessors[42].The1-DDCTcoresemployedinthe3-DDCTarchitecturecanusethetransform’sfastalgorithm,distributedarithmeticorROMbaseddesigns[38].Sucharchitecturesexhibitirregularstructures, lackofmodularity,andcomplexcontrol.Anotherclassof the3-DDCTarchitecturereliessolelyonthesystolicapproachwithitswellestablisheddesignmethodology[43].In[44, 46], high speed and low complexity pipeline n-D DCTarchitectures are proposed using the regular 1-D DCT andtensor product operations. The proposed architectures arebasedonthe1-Dand2-DDCTarchitecturesin[47].

    In thispaper, twonewpipelineandscalablearchitecturesthat implement the 3-D Discrete Cosine Transform Vector-Radix (3-D DCT VR) are introduced. The presentedarchitecturesareparameterisableintermsofwordlengthandpipeline stages. Further, no block memory is used for datatransposition. These architectures have been implementedandtested;forinstance,anFPGA-basedimplementationofa512×512×8-worddatausingatransformlengthof8×8×8-wordcube size and a 14-bit wordlength has achieved a workingfrequencyof330MHzandaprocessingtimeof6.4ms.Thus,80000framescanbeprocessedineverysecond.

    The remainder of this paper is organised as follows. Insection2,thebackgroundoftheDCTtransformand3-DDCTVRalgorithmareprovided.Sections3,4and5presentthetwonew architectures for 3-D DCT computation. The results

  • 2

    obtainedarediscussedinsection6andconclusionsaregiveninsection7.

    2.Backgroundandthe3-DDCTVRAlgorithm

    The 3-D DCT coefficients of a N×N×N data cube arecomputedasfollows:) *+, *&, *-

    =801+01&01-

    '- 3 4+, 4&, 4-

    56+

    789:

    56+

    7;9:

    56+

    7

  • 3

    Fortheremainingcombinationsofodd/evenindices*+,*&and*-, one can divide the computation of the transform asfollows:

    The set of equations (6)-(13) represents a single butterflycomputationofaDecimationInFrequency(DIF)VRalgorithmasshowninFigure2.Itcomputeseightpoints;abutterflycanreceive aN×N×N data cube at its input and outputs 8 datacubes of N/2×N/2×N/2-word each; the process can be

    repeated until58

    Rdata cubes of 2× 2× 2-word each are

    computed. Thus, the flow graph of the whole butterfly

    computationconsistsof#$%&'stageswith58

    Rbutterfliesper

    stage[25].Theoutputfromthelastbutterflystageisfedtothepostadditionstagesthenitisbit-reversed.Thepostadditionoperations, shownby the termsoutsidebraces in the setofequations (6)-(13), are then carried out. Further to the

    )(2*+, 2*&, 2*- + 1) = gh h h [23c::+(4+, 4&, 4-) cos ∅-]

    [

    789:

    [

    7;9:

    [

    7

  • 4

    reductioninarithmeticcomplexityandprocessingtime,the3-DDCTVRalgorithmdoesnotrequiretransposememoryandexhibitsaregularbutterflystructurewhichismoresuitableforhardware and software implementation than the RCFapproach.Furtherdetailsaboutthe3-DDCTVRalgorithmcanbefoundin[25].

    3.New3-DDCTVectorRadixArchitectures

    Two new architectures are presented; namely the SinglePath Data Flow 3-D DCT Architecture (SPDFA) and theDualPathDataFlow3-DDCTArchitecture(DPDFA).Thedifferencebetweenthemliesinthenumberofwordsfedtotheadjacentbutterfly and how the arithmetic operations are scheduledwithineachbutterfly stage; thishas led to thederivationoftwo structures with different hardware requirements. Botharchitecturesarebuiltaccordingtothegenericblockdiagramof Figure 1. The butterfly calculation consists of #$%&'parameterisedandscalablestagesasdescribedbythesetofequations (6)-(13) and illustrated in Figure 2. The datareordering is common to both SPDFA andDPDFA, however,theinternalarchitectureofthebutterfly,post-additionstagesand the 3-D Bit Reverse Ordering (3-D BRO) stage arearchitecture-dependent. Of the two presented structures,SPDFA exhibits a single line of data between neighbouringbutterfly stages. It is more efficient in memory usage asintermediate results are fed back to memory elements;however, using these feedback loops prevents any furtherpipelining. DPDFA is a dual-path data flow feed-forwardarchitecture. There are two data lines between adjacentbutterflies and further pipelining is a simple task; thearchitecturehoweverrequiresmorememorythanSPDFA.

    The presented architectures both partition the inputsequenceintocubesofN×N×N-wordorN-blocksofN×N-word.The inputdata is reorderedaccording to (2).The reorderingprocessisperformedbyshufflingwordsalongtherow,columnand frame dimensions. It includes dividing data into odd-indexedandeven-indexedwordsandretrogradeindexing.As

    anexample,forNindicesarrangedas“0,1,2,3,4,5,6,…N-1”,thereorderedsequencewillbe“0,2,4,6,…,N-2,N-1,N-3,N-5,…..,1“.ThisstageisimplementedusingadualportblockRAM which permits writing and reading operations to beperformedondifferentlocationsduringthesamecycle.Thus,for anN×N×N-word cube, thememory size required for the

    reorderingoperationis5

    &+ 1 '

    &–wordwithalatencyof58

    &

    cycles as onlywriting operations are carried out during thisperiod.

    4.SinglePathDataFlow3-DDCTArchitecture

    SPDFA is composed of a 3-D reordering stage, #$%&(')butterflycomputationstages, threepostadditionsub-stagesanda3-DBitReverseOrder(3-DBRO)stage.

    4.1ButterflyStages

    Thereordereddatafromthe3-Dreorderingstageisfedtothebutterflystagesatarateofonewordperclockcycle.AsshowninFigure3,#$%&'butterflystages(m=1,2,3,..,#$%&')areused.Eachbutterflystagecanbefurtherdividedinto threesub-stagesandamultiplierasshowninFigure4.Asub-stageconsists of two add/subtract elements for carrying outadditionandsubtractionoperationsbetweenthetwohalvesofeach inputalong the threedimensionsofdata, a registerand a switch. The multiplier is used to multiply the output

    Figure2.Singlebutterflyofthe3-DDCTDIFVRalgorithm

  • 5

    words by appropriate Twiddle Factors (TFs) which are pre-computedandstoredinalookuptable(LUT).

    For the sake of explaining, thewords3 4+, 4&, 4- at theinput of the first butterfly stage can be indexed as 3 4+ +4&×' + 4-×'

    & . The first sub-stage performs addition andsubtractionbetweenthetwohalvesoftheinputdatacube;the

    first half contains words indexed from 0to58

    &− 1 and the

    secondpartthewordswithindicesfrom58

    &H$'

    -− 1.During

    thefirst58

    &clockcycles,thefirstpartofthedataisstoredin

    Register1(oflength58

    &-word)beforebeingfedtoaddersalong

    with the inputdata fromthesecondhalfduring thenext58

    &

    cycles.Duringthisperiod,thesubtractionoperationresultsarestored inRegister1whiletheadditionresultsare fedtothenext sub stage.Once this is completed, it is the turn of thesubtractionresultsstored inRegister1 tobefedtothenextsub-stage.TheselectionofwhichpartofthedatatobestoredinRegister1,fedtotheaddersorfedtothenextsub-stageismanagedbythecontrolsignalofSwitch1,whichchangesits

    valueevery58

    &cycles.

    What the first sub-stagecarriesoutoncubesofdata, thesecond sub-stage performs it on blocks ofN×N-word of thesamedatacube.Omittingchangesalong4-,eachblockisagaindividedintotwohalves;onehalfincludesindices4+ + 4&×'

    from0H$5;

    &− 1whilethesecondhalfincludeswordsofthe

    sameblockwithindicesfrom5;

    &H$'

    &− 1.Thebehaviourof

    the second sub-stage is similar to the first one; except that

    Register2 isof length5;

    &-wordandtheperiodofthecontrol

    signalforSwitch2is'&cycleswithadutycycleof50%.The third sub-stage implements addition and subtraction

    between the twohalvesofeachcolumn ineachblockusing

    Register3(oflength5

    &-word).Omittingchangesof4-and4&,

    thedataineachcolumnisdividedintotwohalveswithindices

    rangingfrom0H$5

    &− 1andfrom

    5

    &H$' − 1.Thewordsof

    thefirsthalfofthecolumnarestoredinRegister3,theyarethen fed to theaddersalongwith thecolumn’ssecondhalf.The results of the addition operation are multiplied by theappropriate TFs. After which, the results of the subtractionoperationwhichwerefirststoredinRegister3arefedtothemultiplier for the multiplication by the TFs. The multiplieroutputisinputtothenextbutterflystage.Aswithsub-stages1 and 2, Switch 3 multiplexes data and its control signal is

    periodicandchangesitsvalueevery5

    &cycles.

    In thegeneralcaseof themthbutterflystage,data is split

    into 2-u6- cubes ofv

    &wx<

    -

    words; the first butterfly sub-

    stage is used to perform the addition and subtractionoperationsbetweenthetwohalvesofeachinputdatacube;

    the words involved are indexed from0to58

    &y− 1 and from

    58

    &yto

    58

    &yx<.Duringthefirst

    58

    &ycycles,thedatacube’sfirsthalf

    isstoredinRegister1;inthenext58

    &ycycles,theadditionand

    subtractionoperationstakeplace;theresultsoftheadditionoperationarefedtotheadjacentbutterflysub-stagewhiletheresultsofthesubtractionoperationsarestoredinRegister1

    before being fed to the adjacent sub-stage in the next58

    &y

    cycles.Inasimilarwayandwithanappropriateswitching,thesecondandthirdbutterflysub-stagesimplementtheadditionand subtraction operations between the first and secondhalves of the data blocks and columns, respectively. The

    lengthsofregisters,Register2andRegister3,is5;

    &y-wordand

    5

    &y-word, respectively, which allows for storing half of each

    block and columnofdata as appropriate. Themultiplicationoperation by a twiddle factor (TF) takes place once allarithmeticoperationsofsub-stage3havebeencarriedout.

    ButterflyStagelog2N(m=log2N)

    ButterflyStage2(m=2)

    ButterflyStage1(m=1)

    Reordereddata

    Out

    Figure3.Theblockdiagramofbutterflystages

    +

    -

    N3/2m

    Switch1

    Register1

    +

    -

    N2/2m

    Switch2

    Register2

    +

    -

    N/2m

    Switch3

    Register3

    ×

    Sub-stage1 Sub-stage2 Sub-stage3

    in outTF

    Figure4.SPDFAmthbutterflyinternalarchitecture

    a.

    -Register 1 -

    Mux 1

    PIn

    PRegister 2

    Register 3

    -Mux 2

    PRegister 4

    -Mux 3

    2POut

    Post addition Sub-stage 1

    P=N

    Post addition Sub-stage 2

    P=1

    Post addition Sub-stage 3

    P=N23-D BROIn Out

    b.

    Figure5.a.Aparameterisedpostadditionsub-stageforSPDFA.b.Apostadditionstageand3-DBRO

    4.2PostAdditionand3-DBROStages

    The third part of SPDFA is the post addition stage whichperforms the computation of the terms outside the curlybrackets in (6)-(13). Reflecting the three dimensions of theinputdata,thepostadditionstagecanbedividedintothreesub-stages; each sub-stage carries out addition operationsover a given dimension. In the first, second and third postaddition sub-stage, the addition operations are carried outwithin the same N×N-word block, the same column or the

  • 6

    same data cube; respectively. Hence, the length of theregisters,used inFigure5and labelledasparameterP,mayvary. Still the internal architecture of each sub-stage isidenticallythesame.

    Theoutputofthethirdpostadditionstageisfedtothe3-DBRO stage which performs data reordering as the fastalgorithmused introducesabit reversalpermutationon thebinary indicesof the results.Bit reversal is performedalongeachrow,columnandframedirectionsineachN×N×N-worddata cube using a regular bit reversal algorithm [25]. Theoutputfromthisstagerepresentsthe3-DDCTcoefficientsof

    the input data. It is implementedusing a3'

    4− 1 '

    2-word

    dual-port block RAM. This stage is placed next to the postadditionstagetoactasabufferforthesubsequentsystemifrequired;forinstance,itcanbeintegratedwithaquantizerasinconventionaldatacompressionalgorithms.

    5.DualPathDataFlow3-DDCTArchitecture

    DPDFAisadualdatapatharchitectureforthe3-DDCTVRcomputation. It is devised toproduceahigh speed3-DDCTarchitecturewhichcanbeeasilyretimedandpipelined.DPDFAconsistsof3-Ddatareordering,butterflystages,postadditionand3-DBROstages.The3-Ddatareorderingstageisthesameasthatpresentedearlierinthepaper. 5.1ButterflyStages

    The scheduling of arithmetic operations in DPFDA isdifferent from SPDFA. Rather than feeding the subtractionoperations intermediate results back to the same sub-stageregisterasinSPDFA,theresultsoftheadditionandsubtractionarefedforwardtothenextsub-stageortothenextstage.Thisreducesthetimeduringwhichregistersareutilisedforstoringpartial results, adds another lineof data for communicationbetweenadjacentstagesandsub-stages,andhenceincreasesthe number of required multipliers to cope with thecomputationoftwocoefficientsperclockcycle.However,thissimplifiespipeliningandretimingintheDPDFA.

    DPDFA comprises #$%&' butterfly stages; each can bedivided into three sub-stages, registers, switches and twomultipliers.Agenericsub-stageconsistsof twoadd/subtract

    elementsforcarryingoutadditionandsubtractionoperationsbetween the two halves of each input along the threedimensionsofdata.Italsocontainstworegistersandaswitchfordataorderingandmultiplexing; theexception is the firstsub-stageofthefirstbutterflywhichutilisesonlyoneregisterasshowninFigure6.Thefirstbutterfly internalarchitecturetakesintoaccountthefactthatdataisreceivedatitsinputattherateofonewordperclockcyclewhicharethenstoredandprocessedattherateoftwowordsperclockcycle.

    The first sub-stage performs addition and subtractionbetween the two halves of the input data cube; Register 1stores the first half that contains words of indices from

    0to58

    &− 1thenfeedsittotheaddercomponentsduringthe

    next58

    &cycleswhentheseconddatacubepartthatcontains

    wordsindexedfrom58

    &to'

    -− 1isalsoavailableattheinput

    of the adder components. Data multiplexing is carried outusingSwitch1which isusedto twofoldparallelise theserialinput.Itscontrolsignalisperiodicwithaperiodof'-cyclesanda50%dutycycle.

    Insub-stage2,theregistersRegister2andRegister3,andSwitch2re-orderdatawiththeaimtoimplementtheadditionand subtraction operations in each block of data; a block is

    dividedintotwohalves;wordswithindicesfrom0to5;

    &− 1

    arestoredinRegister3whilethesecondhalfwhichincludes

    words of the same block with indices from5;

    &H$'

    &− 1 is

    storedinRegister2.Theflowofdatabetweensub-stages1and2andtheselectionofwhereandwhenresultsarestored inregistersRegister2andRegister3 iscarriedoutbySwitch2.Suchaswitchhasa50%dutycyclecontrolsignalwithaperiodof'&cycles.

    When sub-stage 2 processes blocks of data of the samecube,inasimilarwaythethirdstagecarriesouttheadditionandsubtractionoperationsoncolumnsofdatabelonging tothesameblockofdata.For'/2cycles,theadditionresultsofsub-stage 2 are fed toRegister 5; the subtraction operationresultsstoredinRegister4arefedtotheaddercomponentsofsub-stage 3; during the next '/2cycles, Register 4 isconnected to Register 5 while the results of the addition

    Figure6.a.ThefirstbutterflyofDPDFA,b.ThemthbutterflyofDPDFA

  • 7

    operationofsub-stage2arefedtotheaddercomponentsofsub-stage3.BothregistersRegister4andRegister5areofalengthof'/2-word.Switch3whichallowsfordataswitchingand controls the flow of partial results in sub-stage 3 has aperiodic control signal which changes its value every '/2cycles.Oncealladditionandsubtractionoperationshavebeencarriedoutbythefirstbutterflythreesub-stages,twofurthertaskshavetobecarriedout,namely;themultiplicationbytheappropriateTFsandre-arrangingdatainanordersuitableforthenextbutterflystageoperationstobeexecuted.

    Re-arrangingdatainSPDFAbutterfliesissimplycarriedoutbyfeedbackregisters.However,tore-arrangedata inDPSFAone has to cancel out the data order engendered by theselection and switching behaviour of Switch 2, Switch 3,Register 2, Register 3, Register 4and Register 5. Thedesignapproach adopted in thiswork is to use the same set-upofregistersandswitchestore-arrangetheorderofdataandthentoretimeformemoryoptimization.Hence,thebehaviourofSwitch4andSwitch5issimilartothatofSwitch2andSwitch3, respectively.The impactofusingretiming isshown inthelengthofRegister7ofthefirstbutterflyofFigure6.aand inthe length ofRegister 1 in the first sub-stage of the secondbutterflystageillustratedinFigure6.b.Hencetheorderofdatawhen it leavesthefirstbutterfly issimilarto itsorderattheadderelementsofthefirstsub-stage.

    Inthegeneralcase,thetwodatainputspresentedatthemthbutterfly stage are the two halves of the '--word cube;howevereachhalfdataisorderedas2{6&setsof2{6+×2{6+

    interleaved data cubes of sizev

    &wx<

    -

    words. The control

    signalsofallswitchesinFigure6.bareperiodicwitha50%dutycycle.ThecontrolsignalperiodofSwitch1,Switch2andSwitch

    3is58

    &yx<cycles,

    5;

    &yx<cyclesand

    5

    &yx<cycles,respectively.By

    carefully controlling the flow of partial results into registersRegister 1,Register 2,Register 3,Register 4,Register 5 andRegister 6 in Figure 6.b, all the addition and subtractionsoperationscanbecarriedoutalongthethreedimensionsofthe data. Switches Switch 4 and Switch 5, share the controlsignalsofSwitch3andSwitch2,respectively.TheirswitchingbehaviourandtheuseofregistersRegister7andRegister8,re-arrangedatatothesameorderitwasreceivedattheinputof the adder elements of sub-stage 1; the multiplicationoperationscanthentakeplace.

    5.2PostadditionStageand3-DBROStages

    The post addition stage can be divided into three sub-stages.Tocopewithprocessingtwowordsperclockcycle,thefirst two sub-stages are built using two of the sub-stagesshowninFigure5.a;thethirdsub-stageisdepictedinFigure7.aandiscomposedoffiveadd/subtractelements,registersand a multiplexer. The post addition sub-stages areparameterised.TheparameterPshowninFigure7,referstothelengthofregistersused.

    AswithSPDFA,the3-DBROstageisrequiredtore-ordertheoutput;itadjustsforthebitreversalpermutationengenderedby the fast transform algorithm used. The 3-D BRO is

    implemented using5

    &− 1 '

    &-word dual port block RAM.

    Thismemoryelementcanbemergedwithsystemswherethepresented3-DDCTcoreisused.

    Mux 2

    1

    Register 3

    -

    Mux 3

    1Register 5

    -

    P

    Register 7

    -Mux 4

    1

    P

    Register 6

    Register 1

    -Mux 1

    1

    P

    Register 2 +

    3P-1

    1

    Out

    Register 4

    Register 8

    Register 9

    In0

    In1

    a.

    Post addition Sub-stage 1

    P=N

    Post addition Sub-stage 2

    P=1

    Post addition Sub-stage 3

    P=N23-D BRO Out

    b.

    In0

    In1

    Figure7:a.Thirdpostadditionsub-stageforDPDFA.b.A

    postadditionstageand3-DBROforDPDFA.

    6.PerformanceEvaluation

    ThepresentedarchitectureshavebeendesignedusingXilinxSystem generator tool and they have been tested andimplementedonaXilinxVirtex55vlx50tff1136-3FPGAdevice.Variousvideosequencesandwordlengthshavebeenusedtotest and evaluate the presented architectures performanceandattributes.

    6.1TestBenchandRateDistortionPerformanceDCTArepresentsthe3-DDCTofeachframecomputedusing

    the presented architectures; as they implement the samealgorithm,bothstructuresexhibitvirtuallythesameaccuracy,asshowninTable1andTable2.Whenemploying|}~,theannotation(x,y)referstoafixed-pointwordlengthofx+y-bitswherexandyrepresentthenumberofbitsoftheintegerandfractionalparts,respectively.ResultsofDCTAimplementationusing(12,8),(12,4)and(12,2)-bitwordlengthsareshowninthis section. In comparison, DCTM represents coefficientscalculated usingMatlab code that implements the 3-DDCT,andÄ|}~[representsaMatlabimplementationoftheinverse3-DDCT.BothÄ|}~[and|}~[Matlabimplementationsarefloating-point based. For testing and validation purposes|}~[andDCTAhavebeenappliedonvariousMRIandvideosequences of 512×512×8-word [50]; the Ä|}~[ is thenappliedtotheoutputof|}~toyieldreconstructedframes.Thepeak signal-to-noise ratio (PSNR) and rootmean squareerror (RMSE) are used for evaluating the accuracy of thepresented architectures output. The RMSE between theoriginalandreconstructedframesisdefinedas:

  • 8

    Ç]ÉÑ *

    =1

    Ö×ÜÄ|}~[ |}~ L, X, * − Ä(L, X, *)

    &

    á

    D9+

    à

    Y9+

    (14)

    WhereÄ(L, X, *) istheoriginalframeand* intherange 1 ≤* ≤ äistheframeindex.FisthenumberofframesinÄ,PandQ are the number of its rows and columns, respectively. Inaddition, the PSNR between the original and reconstructedframesiscomputedasfollows[51]:

    ÖÉ'Ç * = 10#$%[ãå çé

    è[êë 1

    &

    (15)

    where]í3 Ä1 represents themaximum intensity value ofthekthframe.Further,theaverageofmaximumabsoluteerror(AvgMaxErr) of the coefficients for the presented 3-D DCTarchitectures |}~in comparison to the Matlabimplementation|}~[iscomputedas:

    ]í3ÑFF * = ]í3 íìM |}~[ L, X, * − |}~ L, X, *

    (16)

    AvgMaxErr =+

    ö]í3ÑFF *

    ö

    19+ (17)

    where]í3ÑFF * represents themaximum absolute errorforeachframe * .

    Performance accuracy, for both presented architectures,wasstudiedoveraselectionofimplementationwordlengths.Table1andTable2showthatthePSNR increaseswhenthefractional part increases for both SPDFA and DPDFA,respectively.Assuchandasexpected,thehighestaccuracyisobtained using a 20-bit wordlength (namely, (12, 8)-bit),providing perfect accuracy. The presented architecturesproduce very good image quality using all the selectedwordlengths.TheaveragePSNRoftheeighttestsequencesforSPDFAare∞,57and45dBusing(12,8),(12,4)and(12,2)-bitwordlengths, respectively. DPDFA achieves very comparableresults.Further,inTable1andTable2,theAvgMaxErrofbotharchitecturesarealmostidentical.Fortheaimofusingvisualinspectionasasubjectivefidelitycriterion[52],theimagesoftheoriginalandreconstructedMRI2scansareshowninFigure8. For both architectures, the reconstructed images arecomputedusingwordlengths (12,2), (12,4)and (12,8)bits. Itcanbenoticedthatthe(12,2)-bitwordlengthproducedagoodquality imagewhere the visual error can hardly be noticed;longerwordlengthshoweverleadtoamuchbetterquality.

    6.2AreaUsageandComputationTime

    Thehardwareusage,speedandcomputationtimeofbotharchitecturesusingdifferentwordlengthsareshowninTable3.Itisimportantthatthepresentedarchitecturesareefficientin terms of area usage; in particular, in resource-limiteddevices such as FPGAs. As it can be seen from Table 3, theaveragedeviceresourcesusageofSPDFAandDPDFAisaslowas12%and18%,respectively.ThehardwareusageofDPDFA

    ishigherthanthatoftheSPDFAduetoduplicatecircuitryformultiplication, addition and post addition stages. However,thisextrahardwareusageandthefactthatithasnofeedbackloops improve themaximumoperating frequency of DPDFAoverSPDFA.ItiseasiertoplaceandroutethecomponentsofDPDFA, including theFPGAdevicespecific resourcessuchasthe DSP elements for the implementation of arithmeticoperations.Assuch,thecomputationtimeof512×512×8-wordinDPDFAisshorterthanthatofSPDFA.Itisworthpointingoutthatthememoryrequirementsofbotharchitecturesarelowin comparison with other architectures due to the in-placecomputation and the lowmemory requirement for theBROand3-Dreorderingoperations.Thememoryelementsof5N2and3N

    2-wordhavebeenusedfor3-BROforSPDFAandDPDFArespectively. Further, a memory of 5N2-word for 3-Dreordering operation has been used in both architectures.Thus, the total number of block memory used in eacharchitecture is less than 25% of the available memoryresourcesofthe5vlx50tff1136-3FPGAdevice.

    Table1.AccuracyanddistortionperformanceofSPDFA

    Reconstructedandoriginalframes 3-DDCTCoefficients

    PSNR(dB) RMSE AvgMaxErr

    Video (12,8)(12,4)(12,2) (12,8) (12,4)(12,2) (12,8) (12,4) (12,2)

    MRI1 ∞ 59 48 ≈0 0.28 1.05 0.01 0.18 0.77

    MRI2 ∞ 56 45 ≈0 0.40 1.49 0.01 0.23 0.88

    Akiyo ∞ 58 47 ≈0 0.31 1.16 0.01 0.21 0.89

    Stefan ∞ 56 45 ≈0 0.40 1.48 0.01 0.23 0.97

    Suzie ∞ 56 45 ≈0 0.40 1.46 0.01 0.21 0.90

    Bus ∞ 56 45 ≈0 0.40 1.49 0.02 0.23 0.92

    Flower ∞ 56 45 ≈0 0.39 1.45 0.02 0.23 0.97

    Mobile ∞ 56 45 ≈0 0.40 1.49 0.01 0.21 0.92

    Average ∞ 57 45 ≈0 0.37 1.38 0.01 0.22 0.90

    Table2.AccuracyanddistortionperformanceofDPDFA

    Reconstructedandoriginalframes 3-DDCTCoefficients

    PSNR(dB) RMSE AvgMaxErr

    Video (12,8)(12,4)(12,2) (12,8) (12,4)(12,2) (12,8) (12,4) (12,2)

    MRI1 ∞ 60 48 ≈0 0.25 1.05 0.01 0.18 0.77

    MRI2 ∞ 57 45 ≈0 0.36 1.49 0.01 0.23 0.86

    Akiyo ∞ 59 47 ≈0 0.28 1.16 0.01 0.21 0.89

    Stefan ∞ 57 45 ≈0 0.35 1.48 0.02 0.23 0.97

    Suzie ∞ 57 45 ≈0 0.35 1.46 0.01 0.21 0.91

    Bus ∞ 57 45 ≈0 0.35 1.49 0.02 0.23 0.93

    Flower ∞ 57 45 ≈0 0.35 1.45 0.02 0.23 0.97

    Mobile ∞ 57 45 ≈0 0.35 1.40 0.01 0.21 0.92

    Average ∞ 58 45 ≈0 0.33 1.37 0.01 0.22 0.90

  • 9

    Figure8.TheoriginalandreconstructedMRI2usingbothArchitecturesforvariouswordlengthsizes

    6.3DynamicPowerConsumptionThepowerconsumptioninFPGAisclassifiedintostaticand

    dynamicpower.Thestaticpowermainlycomesfromleakagecurrent,whereaschargingswitchcapacitorsandshortcircuitcurrentsarethemainsourcesofdynamicpower;henceitcanbe minimised by switching capacitance reduction [53]. Thedynamicpowerconsumptionofthepresentedarchitecturesisshown in Figure 9. The power consumption has beencomputed using Xilinx Xpower analyser for various clock

    frequencies and different wordlengths. The dynamic powerconsumption ishigher inDPDFAbyaround25-100mWthanSPDFA for selected operating frequencies and wordlengths.The reason behind that is the additional multipliers in thebutterflystagesandtheduplicationofsomeresourcesinthefirst twopostadditionstages.Thus,SPDFA isoutperformingDPDFAintermsofpowerconsumptionwhichmakesitabetterchoiceforlowpowerconsumptionapplications.

    Figure9:Dynamicpowerconsumptionofbotharchitectures.

    6.4ComparisontoSimilarWork

    The throughput of both architectures is 1 coefficient perclockcycle; thus,'--clockcyclesareneeded tocomputeallthe3-DDCTcoefficientsofa'--worddatacube.Acomparisonbetween the presented and similar architectures in theliterature isshowninTable4.OfSPDFAandDPDFA,Table4shows that the first architecture outperforms the second intermsofareausageasitrequiresfewermultipliers,addersandregisters. The extra hardware DPDFA utilises is needed toperform the dual line computation of the 3-D DCT.Nevertheless, DPDFA is easily pipelined and it has a lowerlatency andmemory requirement than SPDFA. Thememoryrequirements forSPDFAandDPDFA,as listed inTable4,areusedfordatareorderingandBROonly.

    Thenumberofmultipliersandaddersemployed,memoryrequirements,controllercircuitscomplexity,andcomputationtimeofthepresentedarchitectures,arealsocomparedtotherequirementsandperformanceofthearchitecturesin[38-43].As shown in Table 4. It can be seen that the presentedarchitectures require the lowest number of multipliers andmemory usageof all architectures.Only #$%&' and2#$%&'multipliersarerequiredtoperformthe3-DDCTcomputationusing SPDFA and DPDFA, respectively; for instance, Nmultipliersarerequiredin[41,42].Inaddition,exceptforthearchitecturesin[43],thepresentedarchitecturescarryoutthe3-D DCT computation with the lowest latency. Table 4 alsoshows the performance of various architectures in terms ofcomputation time; although the presented architecturesexhibita longercomputationtimethanthework in[39,43],this is largely balanced by the presented architectures lowhardwareusage. This improvement over similarwork in the

    20

    70

    120

    170

    220

    270

    320

    370

    420

    200 100 66.67 50 40 33.33D

    ynam

    ic p

    ower

    (mW

    )Clock Frequency (MHz)

    SPDFA (12,8)

    DPDFA (12,8)

    SPDFA (12,2)

    DPDFA (12,2)

    Reconstructed Image (One Image)

    Reconstructed Image (One Image)

    Reconstructed Image (One Image)

    Reconstructed; SPDFA; (x,y) = (12,2)

    Reconstructed; SPDFA; (x,y) = (12,4)

    Reconstructed; SPDFA; (x,y) = (12,8)

    Original Image (One Image)

    Original MRI2

    Reconstructed; DPDFA; (x,y) = (12,2)

    Reconstructed; DPDFA; (x,y) = (12,4)

    Reconstructed; DPDFA; (x,y) = (12,8)

    Reconstructed Image (One Image)

    Reconstructed Image (One Image)

    Reconstructed Image (One Image)

  • 10

    literatureismainlyduetothefactthatunlikethearchitecturesin[38-43],thefocusisonemployingandregularisingthedataflowofafastalgorithmwhiletraditionalDCTarchitecturesarebasedonthedirectalgorithm[25,38-43];thishoweverisnotthe only benefit of using a VR approach, in fact the controlcircuitsattachedtothepresentedarchitecturesaresimpleasthere is no data transpose. This makes the controllercomplexitycomparabletothatofparalleldirectapproachesinin[38,42]. 7.Conclusions

    This paper has presented two new 3-DDCT architecturesbasedona3-DDCTVRalgorithm.Theuseofafastalgorithmhasyieldedarchitectureswithimprovedprocessingspeedanda reduced hardware usage as they both require the lowestnumberofarithmeticcomponentsandmemoryrequirementamongknownarchitecturesintheliterature;atthesametime,sucharchitectures avoid theneed formemory transpositionandhenceareeasytoimplementandemployasimplecontrolcircuitry.Thepresentedarchitecturesareparameterisable interms of word and transform lengths and exhibit variouspowerconsumption,hardwareusage,processingspeedsandlevels of pipelining, which provides the designer withmoreflexibility and a larger choice when selecting the rightarchitecturefortheapplicationunderconsideration.

    References[1] M. Ayinala and K. K. Parhi, "Parallel pipelined FFT architectures withreducednumberofdelays,"inProceedingsoftheACMGreatLakesSymposiumonVLSI(GLSVLSI),2012,pp.63-66.[2]O.Nibouche,S.Boussakta,M.Darnell,andM.Benaissa,"Algorithmsandpipeline architectures for 2-D FFT and FFT-like transforms," Digital SignalProcessing:AReviewJournal,vol.20,pp.1072-1086,2010.[3]M.Ayinala,M.Brown,andK.K.Parhi,"PipelinedparallelFFTarchitecturesviafoldingtransformation,"IEEETransactionsonVeryLargeScaleIntegration(VLSI)Systems,vol.20,pp.1068-1081,2012.[4]S. Saponara and B. Neri, "Radar Sensor Signal Acquisition andMultidimensional FFT Processing for Surveillance Applications in TransportSystems,"IEEETransactionsonInstrumentationandMeasurement,vol.66,pp.604-615,2017.[5]S.SaponaraandB.Neri,"Designofcompactandlow-powerX-bandRadarfor mobility surveillance applications," Computers & Electrical Engineering,vol.56,pp.46-63,2016.[6]A.Das,A.Hazra,andS.Banerjee,"Anefficientarchitecturefor3-Ddiscretewavelet transform," IEEE Transactions on Circuits and Systems for VideoTechnology,vol.20,pp.286-296,2010.[7]B. K.Mohanty and P. K.Meher, "Memory-efficient architecture for 3-DDWT using overlapped grouping of frames," IEEE Transactions on SignalProcessing,vol.59,pp.5605-5616,2011.[8]B. K. Mohanty and P. K. Meher, "Memory efficient modular VLSIarchitectureforhighthroughputandlow-latencyimplementationofmultilevellifting 2-DDWT," IEEE Transactions on Signal Processing,vol. 59, pp. 2072-2084,2011.[9]S. Al-Azawi, "Low-Power, Low-Area Multi-level 2-D Discrete WaveletTransformArchitecture,"Circuits,Systems,andSignalProcessing,vol.37,pp.444-458,2018.[10] R.E.Atani,M.Baboli,S.Mirzakuchaki,S.E.Atani,andB.Zamanlooy,"Design and implementation of a 118 MHz 2D DCT processor," in IEEEInternationalSymposiumonIndustrialElectronics,2008,pp.1076-1081.[11] M.JridiandA.Alfalou,"Alow-power,high-speedDCTarchitectureforimage compression: Principle and implementation," in 18th IEEE/IFIP VLSISystemonChipConference(VLSI-SoC),2010,pp.304-309.

    [12] M.ElAakif,S.Belkouch,N.Chabini,andM.M.Hassani,"Lowpowerandfast DCT architecture usingmultiplier-lessmethod," in2011 Faible TensionFaibleConsommation(FTFC),2011,pp.63-66.[13] B.Z.Guo,L.Niu,andZ.M.Liu,"Implementationof2-DDCTbasedonFPGA," in Proceedings of SPIE - The International Society for OpticalEngineering,2010.[14] G. K. a. S. V. Khurram Bukhari, "DCT and IDCT implementations ondifferentFPGAtechnologies,"ComputerEngineeringLab,DelftUniversityofTechnology,2009[15] O.Nibouche,S.Boussakta,andM.Darnell,"PipelineArchitecturesforRadix-2NewMersenneNumberTransform,"IEEETransactionsonCircuitsandSystemsI:RegularPapers,vol.56,pp.1668-1680,2009.[16] H.L.P.A.Madanayake,R. J.Cintra,D.Onen,V.S.Dimitrov,andL.T.Bruton, "Algebraic integerbased8×82-DDCT architecture for digital videoprocessing,"inIEEEInternationalSymposiumonCircuitsandSystems(ISCAS),2011,pp.1247-1250.[17] A.M.Shams,A.Chidanandan,W.Pan,andM.A.Bayoumi, "NEDA:Alow-powerhigh-performanceDCTarchitecture," IEEETransactionsonSignalProcessing,vol.54,pp.955-964,2006.[18] E.D.KusumaandT.S.Widodo,"FPGAimplementationofpipelined2D-DCT and quantization architecture for JPEG image compression," in 2010InternationalSymposiuminInformationTechnology(ITSim),2010,pp.1-6.[19] S.Al-Azawi,Y.A.Abbas,andR.Jidin,"LowcomplexitymultidimensionalCDF5/3DWTarchitecture," inCommunicationSystems,Networks&DigitalSignalProcessing(CSNDSP),20149th InternationalSymposiumon,2014,pp.804-808.[20] G. K. Wallace, "The JPEG still picture compression standard," IEEETransactionsonConsumerElectronics,vol.38,pp.xviii-xxxiv,1992.[21] D. J. Le Gall, "The MPEG video compression standard," in CompconSpring'91:DigestofPapers,1991,pp.334-335.[22] W. Li, "Overview of fine granularity scalability in MPEG-4 videostandard,"IEEETransactionsonCircuitsandSystemsforVideoTechnology,vol.11,pp.301-317,2001.[23] A.MadisettiandA.N.Willson,Jr.,"DCT/IDCTprocessordesignforHDTVapplications,"inInternationalSymposiumonSignals,Systems,andElectronics(ISSSE'95),1995,pp.63-66.[24] T.Wiegand,G.J.Sullivan,G.Bjøntegaard,andA.Luthra,"Overviewofthe H.264/AVC video coding standard," IEEE Transactions on Circuits andSystemsforVideoTechnology,vol.13,pp.560-576,2003.[25] S.BoussaktaandH.O.Alshibami,"Fastalgorithmforthe3-DDCT-II,"IEEETransactionsonSignalProcessing,vol.52,pp.992-1001,2004.[26] X.Li,A.Dick,C.Shen,A.vandenHengel,andH.Wang,"Incrementallearningof3D-DCTcompactrepresentationsforrobustvisualtracking,"IEEETransactionsonPatternAnalysisandMachine Intelligence,vol.35,pp.863-881,2013.[27] S.SawantandD.A.Adjeroh,"Balancedmultipledescriptioncodingfor3DDCTvideo,"IEEETransactionsonBroadcasting,vol.57,pp.765-776,2011.[28] H.Y.Huang,C.H.Yang,andW.H.Hsu,"Avideowatermarkingtechniquebased on pseudo-3-D DCT and quantization index modulation," IEEETransactionsonInformationForensicsandSecurity,vol.5,pp.625-637,2010.[29] R. Atta and M. Ghanbari, "Spatio-temporal scalability-based motion-compensated3-Dsubband/DCTvideocoding," IEEETransactionsonCircuitsandSystemsforVideoTechnology,vol.16,pp.43-55,2006.[30] S. C. Chan and K. L. Ho, "Direct methods for computing discretesinusoidaltransforms," IEEProceedingsonRadarandSignalProcessing,vol.137,pp.433-442,1990.[31] S. An and C. Wang, "Recursive algorithm, architectures and FPGAimplementationofthetwo-dimensionaldiscretecosinetransform,"IET,ImageProcessingvol.2,pp.286-294,2008.[32] G. Jiun-In and L. Chih-Chen, "A generalized architecture for the one-dimensionaldiscretecosineandsinetransforms,"IEEETransactionsonCircuitsandSystemsforVideoTechnology,vol.11,pp.874-881,2001.[33] Z.Wu, J. Sha, Z.Wang, L. Li, andM. Gao, "An improved scaled DCTarchitecture,"IEEETransactionsonConsumerElectronics,vol.55,pp.685-689,2009.[34] C. Yuan-Ho and C. Tsin-Yuan, "A high performance video transformenginebyusing space-time scheduling strategy," IEEETransactionsonVeryLargeScaleIntegration(VLSI)Systemsvol.20,pp.655-664,2012.

  • 11

    [35] S. Al-Azawi, S. Boussakta, and A. Yakovlev, "High precision and lowpower DCT architectures for image compression applications," in IETConferenceonImageProcessing(IPR),2012,pp.1-6.[36] A.AggounandI.Jalloh,"Two-dimensionalDCT/IDCTarchitecture,"IEEProceedingsonComputersandDigitalTechniques,vol.150,pp.2-10,2003.[37] J. I. Guo, "Efficient parallel adder based design for one-dimensionaldiscretecosinetransform,"IEEProceedingsonCircuits,DevicesandSystems,vol.147,pp.276-282,2000.[38] I. Jalloh, A. Aggoun, and M. McCormick, "3D DCT architecture forcompressionof integral3D images," in IEEEWorkshoponSignalProcessingSystems,SiPS:DesignandImplementation,2000,pp.238-244.[39] A. Aggoun and I. Jalloh, "A parallel 3D DCT architecture for thecompressionofintegral3Dimages,"inThe8thIEEEInternationalConferenceonElectronics,CircuitsandSystems(ICECS)2001,pp.229-232vol.1.[40] M. Bakr and A. E. Salama, "Implementation of 3D-DCT based videoEncoder/Decoder system," inMidwest Symposiumon Circuits and Systems,2002,pp.II13-II16.[41] S.Saponara,L.Fanucci,andP.Terreni,"Low-powerVLSIarchitecturesfor3Ddiscretecosinetransform(DCT),"inIEEE46thMidwestSymposiumonCircuitsandSystems,2003,pp.1567-1570Vol.3.[42] S.Saponara,"Real-timeandlow-powerprocessingof3Ddirect/inversediscrete cosine transform for low-complexity video codec," JournalofReal-TimeImageProcessing,pp.1-11,2012.[43] Y. Ikegaki,T.Miyazaki,andS.G.Sedukhin,"3D-DCTprocessorand itsFPGA implementation," IEICETransactionson InformationandSystems,vol.E94-D,pp.1409-1418,2011.[44] L. Yuanyuan, C. Hexin, Z. Yan, and Y. Chuxi, "Three dimensional DCTsimilar butterfly algorithm and its pipeline architectures," in 2016 IEEEInformation Technology, Networking, Electronic and Automation ControlConference,2016,pp.506-510.[45] S. Al-Azawi, "Efficient Architectures for Multidimensional DiscreteTransforms in Image and Video Processing Applications," PhD Thesis,NewcastleUniversity,UK,2013.[46] L. Yuanyuan, C. Hexin, Z. Yan, and Y. Chuxi, "Device-saving pipelinearchitecturesofmulti-dimensionalDCTsimilarbutterflyalgorithm," in2016International Conference on Integrated Circuits and Microsystems (ICICM),2016,pp.339-344.[47] J. A. Nikara, J. H. Takala, and J. T. Astola, "Discrete cosine and sinetransforms—regularalgorithmsandpipelinearchitectures,"SignalProcessing,vol.86,pp.230-249,2006.[48] O.AlshibamiandS.Boussakta,"Fastalgorithmforthe3DDCT,"inIEEEInternational Conference on Acoustics, Speech, and Signal ProcessingProceedings(ICASSP'01),2001,pp.1945-1948vol.3.[49] M.C. Lee,R. K.W.Chan, andD.A.Adjeroh, "Fast three-dimensionaldiscretecosinetransform,"SIAMJournalonScientificComputing,vol.30,pp.3087-3107,2008.[50] Xiph.org, "Video Test Media [derf's collection]," Xiph.org, [Online].Available:https://media.xiph.org/video/derf/Accessedon20/09/2016.[51] Q.Huynh-Thu andM.Ghanbari, "The accuracy of PSNR in predictingvideoqualityfordifferentvideoscenesandframerates,"TelecommunicationSystems,vol.49,pp.35-48,2012/01/012012.[52] H. R. Sheikh, A. C. Bovik, and G. d. Veciana, "An information fidelitycriterion for image quality assessment using natural scene statistics," IEEETransactionsonImageProcessing,vol.14,pp.2117-2128,2005.[53] S.McKeownandR.Woods,"Lowpowerfieldprogrammablegatearrayimplementationoffastdigitalsignalprocessingalgorithms:Characterisationandmanipulationofdatalocality,"IETComputersandDigitalTechniques,vol.5,pp.136-144,2011.

    SaadAl-AzawireceivedtheB.Sc.DegreeinElectrical Engineering from University ofBaghdadandM.Sc.degreeinElectronicandCommunication Engineering from Al-Mustansiriya University, Baghdad, Iraq. Hereceived his Ph.D. degree in Electrical andElectronic Engineering from Newcastle

    University, Newcastle Upon Tyne, England, 2013. AssistantProfessor Dr. Saad is currently working as a head of the

    departmentofElectronicEngineering,CollegeofEngineering,UniversityofDiyala,Diyala,Iraq.HisresearchinterestsincludeHardware architectures for signal and image processingalgorithms and transforms, Digital Image Processing, DigitalSignalProcessing.

    Omar Nibouche received his BEng degree inElectronic Engineering form the PolytechnicSchool of Algiers and Ph.D. degree inComputer Science from Queen's UniversityBelfast. He is a lecturer in computing in theSchoolofComputingandMathematics,UlsterUniversity at Jordanstown. His research

    interests include machine learning, applications of artificialintelligence,computervisionandbiometrics.

    Said Boussakta received the PhD degree inElectrical Engineering from NewcastleUniversity, U.K., in 1990. Since 1990, he hasbeen working in academia, fully involved inboth research and teaching. From2000-2006hewasattheUniversityofLeedsasaReaderin Digital Communications and Signal

    Processing. Since 2006, he has been with the School ofEngineering, Newcastle University as a Professor ofCommunications and Signal Processing, lecturing inCommunicationNetworksandSignalProcessingsubjects.Hehas supervised over 50 students to Ph.D. completion andpublished over 200 conference proceedings and journalarticles. His research interests are in the areas of fast DSPalgorithms,DigitalCommunications,CommunicationNetworkSystems, Cryptography, and Digital Signal/Image Processing.He has also served as Chair in conferences and presentedseveralinvitedtalksincommunications,signalprocessingandsecurity. Prof Boussakta is a Fellowof the IEE, and a SeniorMember of the Communications and Signal ProcessingSocieties.

    Gaye Lightbody hasbeena Lecturerwithin the School of Computing inUlster University since 2006. ShereceivedanM.Eng.(1995)andaPhD(2000) in Electrical and ElectronicEngineering fromQueen’sUniversityofBelfast.HerPhD,HighPerformanceVLSIArchitecturesforRecursiveLeast

    Squares Adaptive Filtering, involved the research anddevelopment into a scalable efficient architecture for highlyintensive adaptive beamforming applications. Gaye thenworked in industry from 2000 to 2006 for AmphionSemiconductorLimiteddevelopingintellectualpropertycoresforASICandFPGAsolutionsintheareasofaudio,imageandvideoprocessing.Sheisaco-authorofthebookFPGA-basedImplementation of Signal Processing Systems published byWiley in 2008. A second edition has been released in 2017which provides an update to reflect the latest iterations ofFPGA theory, applications, and technology. This revisionincludescoverageofFPGAsolutionsforBigDataApplications.

  • 12

    TABLE3.Hardwareutilisationrate,maximumoperatingfrequenciesandcomputationtimesforbotharchitecturesusingvariouswordlengths

    SliceLogicUtilization Available

    SPDFA DPDFA

    HardwareusageforwordlengthsizesHardwareusageforwordlengthsizes

    (12,8) (12,6) (12,4) (12,2) (12,8) (12,6) (12,4) (12,2)

    Hardwareusage

    NoofSliceRegisters 28,800 1840 1726 1593 1459 3691 3414 3120 2826

    NoofSliceLUTs 28,800 2351 2171 1991 1811 3309 3044 2781 2517

    NoofoccupiedSlices 7,200 743 607 588 605 1229 1096 1053 1008

    NoofbondedIOBs 480 30 28 26 24 30 28 26 24

    Noof36kBlockRAMused 60

    1 - - - 1 - - -

    Noof18kBlockRAMused 14 15 15 15 15 16 16 16NoofDSP48Es 48 9 8 8 8 16 12 12 12

    Averageutilizationrate 12% 12% 11% 11% 18% 16% 15% 15%

    Maximumfrequencies(MHz) 241 230 244 226 266 338 258 333

    Computationtimesfor512×512×8-pixeldata(ms) 8.7 9.1 8.6 9.3 7.9 6.2 8.1 6.3

  • 13

    Table4.ComparisontoSimilarArchitecturesintheLiterature

    Architectures Adders/Sub. Multipliers Memory Registers InitialLatencyComputationTime(cycles)

    ControllerComplexity DCTAlgorithm

    [38] 3" 3" "# " + 1 N/R* "& + 3" "& Simple Regular;Row-Column-Frame,cascaded

    [39]

    5"# + "2

    5"#2

    "&(transposememory)

    "registerbetweenthe2-DDCTandthe1-D-DCT-

    framedirection

    "& + 32" "# Complex

    Regular,Parallel;Row-Column-FrameN×N1-DDCT+1-DDCTforframedirection

    [40] 3" − 3 3" "# " + 1 N/R > "# 6"&** Medium Regular1-DDCT;Row-Column-Frame[42]

    FullParallel(FP) " 2" + 1 " 2" + 1 ∗∗∗ " "# + 1 N/R N/R 2"# Medium

    1-DDCTRadix2Row-Column-FrameCascaded(CS) 3" 3"

    ∗∗∗ "# " + 1 N/R N/R 2"& Simple

    HardwareMultiplexed(HM) " "

    ∗∗∗ "& N/R N/R 6"& Complex

    [43] InputMem.

    Sequential "& "& 2"& "& 3" + 4 "& 3" Complex

    Regular1DDCT;Row-Column-Frame

    Pipelined1 ≈ 2"& 2"& 2"& "& 3" + 4 "& 3" Complex

    Piplined2 ≈ 3"& 3"& 2"& "& 3" + 8 "& 3" Complex

    Block N&

    8 N&8 2N

    & ≈ 16N& 3N + 4 N& 3N Complex

    SPDFA 12 + 6234#" 234#""4 + " "

    #

    ForreorderingandBRO

    "& + 5"# + 5" + 4 ≅ 2"& "& Simple Vector-Radix3-DDCT

    DPDFA 20 + 6234#" 2234#""&

    ForreorderingandBRO

    3"&2 + 6"

    # + 11" + 8 ≅ 32"& "& Simple Vector-Radix3-DDCT

    *N/R:Notreportedintheirpaper.**Computedfor4×4×4datablock.***Multiplicationisperformedbyaserialdistributedarithmeticarchitecture.