-
1
NewFastandArea-EfficientPipeline3-DDCTArchitectures
SaadAl-Azawia,OmarNiboucheb,SaidBoussaktac,GayeLightbodydaCollegeofEngineering,DiyalaUniversity,Diyala,Iraq;b,dFacultyofComputingandEngineering,UlsterUniversity,
UK;cSchoolofEngineering,NewcastleUniversity,[email protected],[email protected],[email protected],
[email protected]
AbstractTheefficient implementationof 3-D transforms is a
challenging taskdue to the computation complexity,memory and
arearequirementsofsuchtransforms.Oneimportant3-Dtransformisthe3-DDiscreteCosineTransform(3-DDCT)usedinmanyimageandvideoprocessingsystems.Inthispaper,twonewpipelinearchitecturesforthe3-DDCTcomputationusingthe3-DDCTVector-Radixalgorithm(3-DDCTVR)arepresented.Thesearchitecturesarescalableandparameterisablewithregardstodifferentwordlengthsandpipelininglevels.Theirarithmeticcomponentrequirementsarereducedtotheorderof!(#$%&')incontrastwith!(')for3-DDCTarchitecturesintheliterature,whileatthesametimetheycankeepsimilarorbetterarea-timecomplexity.Keywords:3-DDiscreteCosineTransform(DCT),Row-Column(RC),Row-Column-Frame(RCF),Vector-Radix,FPGA1.Introduction
Transforms such as the Fourier Transform (FT)
[1-5],WaveletTransform(WT)[6-9],andCosineTransform(CT)[10-14]playacriticalpartinvariousDigitalSignalProcessing(DSP)applications,includingaudio,imageandvideosystems.Muchof
the usefulness of these transforms arises from
theirfrequencyandtime-frequencyrepresentationsandpropertiesincluding
the decorrelation property, energy compactness,and
theavailabilityof fastalgorithms for their
computation.Nevertheless,eventhefastalgorithmsthatimplementthesetransformsarestillverycomputationallyintensive.Thus,thesetransformscanbecomeabottleneckintermsofthesystem’sspeed,andcontributegreatlytotheirareausageandpowerconsumption[1-3,6-8,10-19].Foritsroleinmanyimageandvideo
applications, including the JPEG, MPEGx and
H.26xcompressionstandards,theDCThasreceivedagreatdealofresearch
interest [20-24]; the 1-D and2-DDCT arenow theestablished
transforms formanyapplicationsand
standards.Further,therearemanynewandemergingapplicationsforthe3-D
DCT, including visual tracking, video coding
andwatermarking[25-29].
Numerous 1-D and 2-D DCT architectures have
beensuggestedintheliterature[30-39].Exploitingtheseparabilityprincipleofthetransform,2-DDCTcoresbasedonthe1-DDCTRow-Column(RC)approacharesuggestedin[33-36];yetveryfewarchitectures
that implement the3-DDCTcanbe
found[38-45].Traditionally,the3-DDCThasbeenimplementedbycascading
stages of the 1-DDCT as in thewell-known Row-Column-Frame (RCF)
approach. Noteworthy differencesbetween architectures in the
literature are their level ofparallelisation in terms of the number
of stages and thenumber of 1-DCT cores per stage, which leads to
differenttrade-offs between circuit complexity and throughput.
Onecommon architecture employs three stages of one 1-D DCTcore and
N3+N2–word transpose memory [40-42].Parallelisationcanbeapplied to
the first two1-DDCTcores
whichinfactimplementsa2-DDCTtransform,leadingtotheutilisation of
2N+1 1-D DCT processors and
N3+N-wordmemory[41,42].Anotherclassofthe3-DDCTarchitecturesmultiplexes
the 1-D DCT transforms involved in
itscomputationontoasingle1-DDCTarchitecture.Suchaclassof
architecture requires N3-word memory [41, 42].
Thereductionachievedinhardwareutilisationcomesatthecostofalowerthroughput.Usingthree1-DCTcorestoimplementthe
3-DDCT achieves a throughput three times higher
thanwhenemployingasingle1-DDCTprocessor.ThethroughputisN-foldaugmentedviaparallelisationofthe1-DDCTprocessors[42].The1-DDCTcoresemployedinthe3-DDCTarchitecturecanusethetransform’sfastalgorithm,distributedarithmeticorROMbaseddesigns[38].Sucharchitecturesexhibitirregularstructures,
lackofmodularity,andcomplexcontrol.Anotherclassof
the3-DDCTarchitecturereliessolelyonthesystolicapproachwithitswellestablisheddesignmethodology[43].In[44,
46], high speed and low complexity pipeline n-D DCTarchitectures
are proposed using the regular 1-D DCT andtensor product
operations. The proposed architectures
arebasedonthe1-Dand2-DDCTarchitecturesin[47].
In thispaper, twonewpipelineandscalablearchitecturesthat
implement the 3-D Discrete Cosine Transform Vector-Radix (3-D DCT
VR) are introduced. The
presentedarchitecturesareparameterisableintermsofwordlengthandpipeline
stages. Further, no block memory is used for datatransposition.
These architectures have been
implementedandtested;forinstance,anFPGA-basedimplementationofa512×512×8-worddatausingatransformlengthof8×8×8-wordcube
size and a 14-bit wordlength has achieved a
workingfrequencyof330MHzandaprocessingtimeof6.4ms.Thus,80000framescanbeprocessedineverysecond.
The remainder of this paper is organised as follows.
Insection2,thebackgroundoftheDCTtransformand3-DDCTVRalgorithmareprovided.Sections3,4and5presentthetwonew
architectures for 3-D DCT computation. The results
-
2
obtainedarediscussedinsection6andconclusionsaregiveninsection7.
2.Backgroundandthe3-DDCTVRAlgorithm
The 3-D DCT coefficients of a N×N×N data cube
arecomputedasfollows:) *+, *&, *-
=801+01&01-
'- 3 4+, 4&, 4-
56+
789:
56+
7;9:
56+
7
-
3
Fortheremainingcombinationsofodd/evenindices*+,*&and*-, one
can divide the computation of the transform asfollows:
The set of equations (6)-(13) represents a single
butterflycomputationofaDecimationInFrequency(DIF)VRalgorithmasshowninFigure2.Itcomputeseightpoints;abutterflycanreceive
aN×N×N data cube at its input and outputs 8 datacubes of
N/2×N/2×N/2-word each; the process can be
repeated until58
Rdata cubes of 2× 2× 2-word each are
computed. Thus, the flow graph of the whole butterfly
computationconsistsof#$%&'stageswith58
Rbutterfliesper
stage[25].Theoutputfromthelastbutterflystageisfedtothepostadditionstagesthenitisbit-reversed.Thepostadditionoperations,
shownby the termsoutsidebraces in the setofequations (6)-(13), are
then carried out. Further to the
)(2*+, 2*&, 2*- + 1) = gh h h [23c::+(4+, 4&, 4-) cos
∅-]
[
789:
[
7;9:
[
7
-
4
reductioninarithmeticcomplexityandprocessingtime,the3-DDCTVRalgorithmdoesnotrequiretransposememoryandexhibitsaregularbutterflystructurewhichismoresuitableforhardware
and software implementation than the
RCFapproach.Furtherdetailsaboutthe3-DDCTVRalgorithmcanbefoundin[25].
3.New3-DDCTVectorRadixArchitectures
Two new architectures are presented; namely the SinglePath Data
Flow 3-D DCT Architecture (SPDFA) and
theDualPathDataFlow3-DDCTArchitecture(DPDFA).Thedifferencebetweenthemliesinthenumberofwordsfedtotheadjacentbutterfly
and how the arithmetic operations are scheduledwithineachbutterfly
stage; thishas led to thederivationoftwo structures with different
hardware requirements.
Botharchitecturesarebuiltaccordingtothegenericblockdiagramof Figure
1. The butterfly calculation consists of
#$%&'parameterisedandscalablestagesasdescribedbythesetofequations
(6)-(13) and illustrated in Figure 2. The datareordering is common
to both SPDFA andDPDFA,
however,theinternalarchitectureofthebutterfly,post-additionstagesand
the 3-D Bit Reverse Ordering (3-D BRO) stage
arearchitecture-dependent. Of the two presented structures,SPDFA
exhibits a single line of data between neighbouringbutterfly
stages. It is more efficient in memory usage asintermediate results
are fed back to memory elements;however, using these feedback loops
prevents any furtherpipelining. DPDFA is a dual-path data flow
feed-forwardarchitecture. There are two data lines between
adjacentbutterflies and further pipelining is a simple task;
thearchitecturehoweverrequiresmorememorythanSPDFA.
The presented architectures both partition the
inputsequenceintocubesofN×N×N-wordorN-blocksofN×N-word.The
inputdata is reorderedaccording to (2).The
reorderingprocessisperformedbyshufflingwordsalongtherow,columnand
frame dimensions. It includes dividing data into
odd-indexedandeven-indexedwordsandretrogradeindexing.As
anexample,forNindicesarrangedas“0,1,2,3,4,5,6,…N-1”,thereorderedsequencewillbe“0,2,4,6,…,N-2,N-1,N-3,N-5,…..,1“.ThisstageisimplementedusingadualportblockRAM
which permits writing and reading operations to
beperformedondifferentlocationsduringthesamecycle.Thus,for
anN×N×N-word cube, thememory size required for the
reorderingoperationis5
&+ 1 '
&–wordwithalatencyof58
&
cycles as onlywriting operations are carried out during
thisperiod.
4.SinglePathDataFlow3-DDCTArchitecture
SPDFA is composed of a 3-D reordering stage,
#$%&(')butterflycomputationstages,
threepostadditionsub-stagesanda3-DBitReverseOrder(3-DBRO)stage.
4.1ButterflyStages
Thereordereddatafromthe3-Dreorderingstageisfedtothebutterflystagesatarateofonewordperclockcycle.AsshowninFigure3,#$%&'butterflystages(m=1,2,3,..,#$%&')areused.Eachbutterflystagecanbefurtherdividedinto
threesub-stagesandamultiplierasshowninFigure4.Asub-stageconsists of
two add/subtract elements for carrying
outadditionandsubtractionoperationsbetweenthetwohalvesofeach
inputalong the threedimensionsofdata, a registerand a switch. The
multiplier is used to multiply the output
Figure2.Singlebutterflyofthe3-DDCTDIFVRalgorithm
-
5
words by appropriate Twiddle Factors (TFs) which are
pre-computedandstoredinalookuptable(LUT).
For the sake of explaining, thewords3 4+, 4&, 4- at theinput
of the first butterfly stage can be indexed as 3 4+ +4&×' +
4-×'
& . The first sub-stage performs addition
andsubtractionbetweenthetwohalvesoftheinputdatacube;the
first half contains words indexed from 0to58
&− 1 and the
secondpartthewordswithindicesfrom58
&H$'
-− 1.During
thefirst58
&clockcycles,thefirstpartofthedataisstoredin
Register1(oflength58
&-word)beforebeingfedtoaddersalong
with the inputdata fromthesecondhalfduring thenext58
&
cycles.Duringthisperiod,thesubtractionoperationresultsarestored
inRegister1whiletheadditionresultsare fedtothenext sub stage.Once
this is completed, it is the turn of thesubtractionresultsstored
inRegister1
tobefedtothenextsub-stage.TheselectionofwhichpartofthedatatobestoredinRegister1,fedtotheaddersorfedtothenextsub-stageismanagedbythecontrolsignalofSwitch1,whichchangesits
valueevery58
&cycles.
What the first sub-stagecarriesoutoncubesofdata, thesecond
sub-stage performs it on blocks ofN×N-word of
thesamedatacube.Omittingchangesalong4-,eachblockisagaindividedintotwohalves;onehalfincludesindices4+
+ 4&×'
from0H$5;
&− 1whilethesecondhalfincludeswordsofthe
sameblockwithindicesfrom5;
&H$'
&− 1.Thebehaviourof
the second sub-stage is similar to the first one; except
that
Register2 isof length5;
&-wordandtheperiodofthecontrol
signalforSwitch2is'&cycleswithadutycycleof50%.The third
sub-stage implements addition and subtraction
between the twohalvesofeachcolumn ineachblockusing
Register3(oflength5
&-word).Omittingchangesof4-and4&,
thedataineachcolumnisdividedintotwohalveswithindices
rangingfrom0H$5
&− 1andfrom
5
&H$' − 1.Thewordsof
thefirsthalfofthecolumnarestoredinRegister3,theyarethen fed to
theaddersalongwith thecolumn’ssecondhalf.The results of the
addition operation are multiplied by theappropriate TFs. After
which, the results of the
subtractionoperationwhichwerefirststoredinRegister3arefedtothemultiplier
for the multiplication by the TFs. The
multiplieroutputisinputtothenextbutterflystage.Aswithsub-stages1
and 2, Switch 3 multiplexes data and its control signal is
periodicandchangesitsvalueevery5
&cycles.
In thegeneralcaseof themthbutterflystage,data is split
into 2-u6- cubes ofv
&wx<
-
words; the first butterfly sub-
stage is used to perform the addition and
subtractionoperationsbetweenthetwohalvesofeachinputdatacube;
the words involved are indexed from0to58
&y− 1 and from
58
&yto
58
&yx<.Duringthefirst
58
&ycycles,thedatacube’sfirsthalf
isstoredinRegister1;inthenext58
&ycycles,theadditionand
subtractionoperationstakeplace;theresultsoftheadditionoperationarefedtotheadjacentbutterflysub-stagewhiletheresultsofthesubtractionoperationsarestoredinRegister1
before being fed to the adjacent sub-stage in the next58
&y
cycles.Inasimilarwayandwithanappropriateswitching,thesecondandthirdbutterflysub-stagesimplementtheadditionand
subtraction operations between the first and secondhalves of the
data blocks and columns, respectively. The
lengthsofregisters,Register2andRegister3,is5;
&y-wordand
5
&y-word, respectively, which allows for storing half of
each
block and columnofdata as appropriate.
Themultiplicationoperation by a twiddle factor (TF) takes place
once allarithmeticoperationsofsub-stage3havebeencarriedout.
ButterflyStagelog2N(m=log2N)
ButterflyStage2(m=2)
ButterflyStage1(m=1)
Reordereddata
Out
Figure3.Theblockdiagramofbutterflystages
+
-
N3/2m
Switch1
Register1
+
-
N2/2m
Switch2
Register2
+
-
N/2m
Switch3
Register3
×
Sub-stage1 Sub-stage2 Sub-stage3
in outTF
Figure4.SPDFAmthbutterflyinternalarchitecture
a.
-Register 1 -
Mux 1
PIn
PRegister 2
Register 3
-Mux 2
PRegister 4
-Mux 3
2POut
Post addition Sub-stage 1
P=N
Post addition Sub-stage 2
P=1
Post addition Sub-stage 3
P=N23-D BROIn Out
b.
Figure5.a.Aparameterisedpostadditionsub-stageforSPDFA.b.Apostadditionstageand3-DBRO
4.2PostAdditionand3-DBROStages
The third part of SPDFA is the post addition stage whichperforms
the computation of the terms outside the curlybrackets in (6)-(13).
Reflecting the three dimensions of
theinputdata,thepostadditionstagecanbedividedintothreesub-stages;
each sub-stage carries out addition operationsover a given
dimension. In the first, second and third postaddition sub-stage,
the addition operations are carried outwithin the same N×N-word
block, the same column or the
-
6
same data cube; respectively. Hence, the length of
theregisters,used inFigure5and labelledasparameterP,mayvary. Still
the internal architecture of each sub-stage
isidenticallythesame.
Theoutputofthethirdpostadditionstageisfedtothe3-DBRO stage which
performs data reordering as the fastalgorithmused introducesabit
reversalpermutationon thebinary indicesof the results.Bit reversal
is
performedalongeachrow,columnandframedirectionsineachN×N×N-worddata
cube using a regular bit reversal algorithm [25].
Theoutputfromthisstagerepresentsthe3-DDCTcoefficientsof
the input data. It is implementedusing a3'
4− 1 '
2-word
dual-port block RAM. This stage is placed next to the
postadditionstagetoactasabufferforthesubsequentsystemifrequired;forinstance,itcanbeintegratedwithaquantizerasinconventionaldatacompressionalgorithms.
5.DualPathDataFlow3-DDCTArchitecture
DPDFAisadualdatapatharchitectureforthe3-DDCTVRcomputation. It is
devised toproduceahigh
speed3-DDCTarchitecturewhichcanbeeasilyretimedandpipelined.DPDFAconsistsof3-Ddatareordering,butterflystages,postadditionand3-DBROstages.The3-Ddatareorderingstageisthesameasthatpresentedearlierinthepaper.
5.1ButterflyStages
The scheduling of arithmetic operations in DPFDA isdifferent
from SPDFA. Rather than feeding the subtractionoperations
intermediate results back to the same
sub-stageregisterasinSPDFA,theresultsoftheadditionandsubtractionarefedforwardtothenextsub-stageortothenextstage.Thisreducesthetimeduringwhichregistersareutilisedforstoringpartial
results, adds another lineof data for
communicationbetweenadjacentstagesandsub-stages,andhenceincreasesthe
number of required multipliers to cope with
thecomputationoftwocoefficientsperclockcycle.However,thissimplifiespipeliningandretimingintheDPDFA.
DPDFA comprises #$%&' butterfly stages; each can bedivided
into three sub-stages, registers, switches and
twomultipliers.Agenericsub-stageconsistsof twoadd/subtract
elementsforcarryingoutadditionandsubtractionoperationsbetween
the two halves of each input along the
threedimensionsofdata.Italsocontainstworegistersandaswitchfordataorderingandmultiplexing;
theexception is the
firstsub-stageofthefirstbutterflywhichutilisesonlyoneregisterasshowninFigure6.Thefirstbutterfly
internalarchitecturetakesintoaccountthefactthatdataisreceivedatitsinputattherateofonewordperclockcyclewhicharethenstoredandprocessedattherateoftwowordsperclockcycle.
The first sub-stage performs addition and subtractionbetween the
two halves of the input data cube; Register 1stores the first half
that contains words of indices from
0to58
&− 1thenfeedsittotheaddercomponentsduringthe
next58
&cycleswhentheseconddatacubepartthatcontains
wordsindexedfrom58
&to'
-− 1isalsoavailableattheinput
of the adder components. Data multiplexing is carried
outusingSwitch1which isusedto twofoldparallelise
theserialinput.Itscontrolsignalisperiodicwithaperiodof'-cyclesanda50%dutycycle.
Insub-stage2,theregistersRegister2andRegister3,andSwitch2re-orderdatawiththeaimtoimplementtheadditionand
subtraction operations in each block of data; a block is
dividedintotwohalves;wordswithindicesfrom0to5;
&− 1
arestoredinRegister3whilethesecondhalfwhichincludes
words of the same block with indices from5;
&H$'
&− 1 is
storedinRegister2.Theflowofdatabetweensub-stages1and2andtheselectionofwhereandwhenresultsarestored
inregistersRegister2andRegister3
iscarriedoutbySwitch2.Suchaswitchhasa50%dutycyclecontrolsignalwithaperiodof'&cycles.
When sub-stage 2 processes blocks of data of the
samecube,inasimilarwaythethirdstagecarriesouttheadditionandsubtractionoperationsoncolumnsofdatabelonging
tothesameblockofdata.For'/2cycles,theadditionresultsofsub-stage 2
are fed toRegister 5; the subtraction
operationresultsstoredinRegister4arefedtotheaddercomponentsofsub-stage
3; during the next '/2cycles, Register 4 isconnected to Register 5
while the results of the addition
Figure6.a.ThefirstbutterflyofDPDFA,b.ThemthbutterflyofDPDFA
-
7
operationofsub-stage2arefedtotheaddercomponentsofsub-stage3.BothregistersRegister4andRegister5areofalengthof'/2-word.Switch3whichallowsfordataswitchingand
controls the flow of partial results in sub-stage 3 has aperiodic
control signal which changes its value every
'/2cycles.Oncealladditionandsubtractionoperationshavebeencarriedoutbythefirstbutterflythreesub-stages,twofurthertaskshavetobecarriedout,namely;themultiplicationbytheappropriateTFsandre-arrangingdatainanordersuitableforthenextbutterflystageoperationstobeexecuted.
Re-arrangingdatainSPDFAbutterfliesissimplycarriedoutbyfeedbackregisters.However,tore-arrangedata
inDPSFAone has to cancel out the data order engendered by
theselection and switching behaviour of Switch 2, Switch 3,Register
2, Register 3, Register 4and Register 5. Thedesignapproach adopted
in thiswork is to use the same
set-upofregistersandswitchestore-arrangetheorderofdataandthentoretimeformemoryoptimization.Hence,thebehaviourofSwitch4andSwitch5issimilartothatofSwitch2andSwitch3,
respectively.The impactofusingretiming isshown
inthelengthofRegister7ofthefirstbutterflyofFigure6.aand inthe
length ofRegister 1 in the first sub-stage of the
secondbutterflystageillustratedinFigure6.b.Hencetheorderofdatawhen
it leavesthefirstbutterfly issimilarto
itsorderattheadderelementsofthefirstsub-stage.
Inthegeneralcase,thetwodatainputspresentedatthemthbutterfly
stage are the two halves of the '--word
cube;howevereachhalfdataisorderedas2{6&setsof2{6+×2{6+
interleaved data cubes of sizev
&wx<
-
words. The control
signalsofallswitchesinFigure6.bareperiodicwitha50%dutycycle.ThecontrolsignalperiodofSwitch1,Switch2andSwitch
3is58
&yx<cycles,
5;
&yx<cyclesand
5
&yx<cycles,respectively.By
carefully controlling the flow of partial results into
registersRegister 1,Register 2,Register 3,Register 4,Register 5
andRegister 6 in Figure 6.b, all the addition and
subtractionsoperationscanbecarriedoutalongthethreedimensionsofthe
data. Switches Switch 4 and Switch 5, share the
controlsignalsofSwitch3andSwitch2,respectively.TheirswitchingbehaviourandtheuseofregistersRegister7andRegister8,re-arrangedatatothesameorderitwasreceivedattheinputof
the adder elements of sub-stage 1; the
multiplicationoperationscanthentakeplace.
5.2PostadditionStageand3-DBROStages
The post addition stage can be divided into three
sub-stages.Tocopewithprocessingtwowordsperclockcycle,thefirst two
sub-stages are built using two of the
sub-stagesshowninFigure5.a;thethirdsub-stageisdepictedinFigure7.aandiscomposedoffiveadd/subtractelements,registersand
a multiplexer. The post addition sub-stages
areparameterised.TheparameterPshowninFigure7,referstothelengthofregistersused.
AswithSPDFA,the3-DBROstageisrequiredtore-ordertheoutput;itadjustsforthebitreversalpermutationengenderedby
the fast transform algorithm used. The 3-D BRO is
implemented using5
&− 1 '
&-word dual port block RAM.
Thismemoryelementcanbemergedwithsystemswherethepresented3-DDCTcoreisused.
Mux 2
1
Register 3
-
Mux 3
1Register 5
-
P
Register 7
-Mux 4
1
P
Register 6
Register 1
-Mux 1
1
P
Register 2 +
3P-1
1
Out
Register 4
Register 8
Register 9
In0
In1
a.
Post addition Sub-stage 1
P=N
Post addition Sub-stage 2
P=1
Post addition Sub-stage 3
P=N23-D BRO Out
b.
In0
In1
Figure7:a.Thirdpostadditionsub-stageforDPDFA.b.A
postadditionstageand3-DBROforDPDFA.
6.PerformanceEvaluation
ThepresentedarchitectureshavebeendesignedusingXilinxSystem
generator tool and they have been tested
andimplementedonaXilinxVirtex55vlx50tff1136-3FPGAdevice.Variousvideosequencesandwordlengthshavebeenusedtotest
and evaluate the presented architectures
performanceandattributes.
6.1TestBenchandRateDistortionPerformanceDCTArepresentsthe3-DDCTofeachframecomputedusing
the presented architectures; as they implement the
samealgorithm,bothstructuresexhibitvirtuallythesameaccuracy,asshowninTable1andTable2.Whenemploying|}~,theannotation(x,y)referstoafixed-pointwordlengthofx+y-bitswherexandyrepresentthenumberofbitsoftheintegerandfractionalparts,respectively.ResultsofDCTAimplementationusing(12,8),(12,4)and(12,2)-bitwordlengthsareshowninthis
section. In comparison, DCTM represents coefficientscalculated
usingMatlab code that implements the
3-DDCT,andÄ|}~[representsaMatlabimplementationoftheinverse3-DDCT.BothÄ|}~[and|}~[Matlabimplementationsarefloating-point
based. For testing and validation
purposes|}~[andDCTAhavebeenappliedonvariousMRIandvideosequences of
512×512×8-word [50]; the Ä|}~[ is
thenappliedtotheoutputof|}~toyieldreconstructedframes.Thepeak
signal-to-noise ratio (PSNR) and rootmean squareerror (RMSE) are
used for evaluating the accuracy of thepresented architectures
output. The RMSE between
theoriginalandreconstructedframesisdefinedas:
-
8
Ç]ÉÑ *
=1
Ö×ÜÄ|}~[ |}~ L, X, * − Ä(L, X, *)
&
á
D9+
à
Y9+
(14)
WhereÄ(L, X, *) istheoriginalframeand* intherange 1 ≤* ≤
äistheframeindex.FisthenumberofframesinÄ,PandQ are the number of
its rows and columns, respectively. Inaddition, the PSNR between
the original and reconstructedframesiscomputedasfollows[51]:
ÖÉ'Ç * = 10#$%[ãå çé
è[êë 1
&
(15)
where]í3 Ä1 represents themaximum intensity value
ofthekthframe.Further,theaverageofmaximumabsoluteerror(AvgMaxErr)
of the coefficients for the presented 3-D DCTarchitectures |}~in
comparison to the Matlabimplementation|}~[iscomputedas:
]í3ÑFF * = ]í3 íìM |}~[ L, X, * − |}~ L, X, *
(16)
AvgMaxErr =+
ö]í3ÑFF *
ö
19+ (17)
where]í3ÑFF * represents themaximum absolute errorforeachframe *
.
Performance accuracy, for both presented
architectures,wasstudiedoveraselectionofimplementationwordlengths.Table1andTable2showthatthePSNR
increaseswhenthefractional part increases for both SPDFA and
DPDFA,respectively.Assuchandasexpected,thehighestaccuracyisobtained
using a 20-bit wordlength (namely, (12, 8)-bit),providing perfect
accuracy. The presented architecturesproduce very good image
quality using all the
selectedwordlengths.TheaveragePSNRoftheeighttestsequencesforSPDFAare∞,57and45dBusing(12,8),(12,4)and(12,2)-bitwordlengths,
respectively. DPDFA achieves very
comparableresults.Further,inTable1andTable2,theAvgMaxErrofbotharchitecturesarealmostidentical.Fortheaimofusingvisualinspectionasasubjectivefidelitycriterion[52],theimagesoftheoriginalandreconstructedMRI2scansareshowninFigure8.
For both architectures, the reconstructed images
arecomputedusingwordlengths (12,2), (12,4)and (12,8)bits.
Itcanbenoticedthatthe(12,2)-bitwordlengthproducedagoodquality
imagewhere the visual error can hardly be
noticed;longerwordlengthshoweverleadtoamuchbetterquality.
6.2AreaUsageandComputationTime
Thehardwareusage,speedandcomputationtimeofbotharchitecturesusingdifferentwordlengthsareshowninTable3.Itisimportantthatthepresentedarchitecturesareefficientin
terms of area usage; in particular, in resource-limiteddevices such
as FPGAs. As it can be seen from Table 3,
theaveragedeviceresourcesusageofSPDFAandDPDFAisaslowas12%and18%,respectively.ThehardwareusageofDPDFA
ishigherthanthatoftheSPDFAduetoduplicatecircuitryformultiplication,
addition and post addition stages.
However,thisextrahardwareusageandthefactthatithasnofeedbackloops
improve themaximumoperating frequency of
DPDFAoverSPDFA.ItiseasiertoplaceandroutethecomponentsofDPDFA,
including theFPGAdevicespecific resourcessuchasthe DSP elements for
the implementation of
arithmeticoperations.Assuch,thecomputationtimeof512×512×8-wordinDPDFAisshorterthanthatofSPDFA.Itisworthpointingoutthatthememoryrequirementsofbotharchitecturesarelowin
comparison with other architectures due to the in-placecomputation
and the lowmemory requirement for
theBROand3-Dreorderingoperations.Thememoryelementsof5N2and3N
2-wordhavebeenusedfor3-BROforSPDFAandDPDFArespectively. Further,
a memory of 5N2-word for 3-Dreordering operation has been used in
both architectures.Thus, the total number of block memory used in
eacharchitecture is less than 25% of the available
memoryresourcesofthe5vlx50tff1136-3FPGAdevice.
Table1.AccuracyanddistortionperformanceofSPDFA
Reconstructedandoriginalframes 3-DDCTCoefficients
PSNR(dB) RMSE AvgMaxErr
Video (12,8)(12,4)(12,2) (12,8) (12,4)(12,2) (12,8) (12,4)
(12,2)
MRI1 ∞ 59 48 ≈0 0.28 1.05 0.01 0.18 0.77
MRI2 ∞ 56 45 ≈0 0.40 1.49 0.01 0.23 0.88
Akiyo ∞ 58 47 ≈0 0.31 1.16 0.01 0.21 0.89
Stefan ∞ 56 45 ≈0 0.40 1.48 0.01 0.23 0.97
Suzie ∞ 56 45 ≈0 0.40 1.46 0.01 0.21 0.90
Bus ∞ 56 45 ≈0 0.40 1.49 0.02 0.23 0.92
Flower ∞ 56 45 ≈0 0.39 1.45 0.02 0.23 0.97
Mobile ∞ 56 45 ≈0 0.40 1.49 0.01 0.21 0.92
Average ∞ 57 45 ≈0 0.37 1.38 0.01 0.22 0.90
Table2.AccuracyanddistortionperformanceofDPDFA
Reconstructedandoriginalframes 3-DDCTCoefficients
PSNR(dB) RMSE AvgMaxErr
Video (12,8)(12,4)(12,2) (12,8) (12,4)(12,2) (12,8) (12,4)
(12,2)
MRI1 ∞ 60 48 ≈0 0.25 1.05 0.01 0.18 0.77
MRI2 ∞ 57 45 ≈0 0.36 1.49 0.01 0.23 0.86
Akiyo ∞ 59 47 ≈0 0.28 1.16 0.01 0.21 0.89
Stefan ∞ 57 45 ≈0 0.35 1.48 0.02 0.23 0.97
Suzie ∞ 57 45 ≈0 0.35 1.46 0.01 0.21 0.91
Bus ∞ 57 45 ≈0 0.35 1.49 0.02 0.23 0.93
Flower ∞ 57 45 ≈0 0.35 1.45 0.02 0.23 0.97
Mobile ∞ 57 45 ≈0 0.35 1.40 0.01 0.21 0.92
Average ∞ 58 45 ≈0 0.33 1.37 0.01 0.22 0.90
-
9
Figure8.TheoriginalandreconstructedMRI2usingbothArchitecturesforvariouswordlengthsizes
6.3DynamicPowerConsumptionThepowerconsumptioninFPGAisclassifiedintostaticand
dynamicpower.Thestaticpowermainlycomesfromleakagecurrent,whereaschargingswitchcapacitorsandshortcircuitcurrentsarethemainsourcesofdynamicpower;henceitcanbe
minimised by switching capacitance reduction [53].
Thedynamicpowerconsumptionofthepresentedarchitecturesisshown in
Figure 9. The power consumption has beencomputed using Xilinx
Xpower analyser for various clock
frequencies and different wordlengths. The dynamic
powerconsumption ishigher inDPDFAbyaround25-100mWthanSPDFA for
selected operating frequencies and wordlengths.The reason behind
that is the additional multipliers in
thebutterflystagesandtheduplicationofsomeresourcesinthefirst
twopostadditionstages.Thus,SPDFA
isoutperformingDPDFAintermsofpowerconsumptionwhichmakesitabetterchoiceforlowpowerconsumptionapplications.
Figure9:Dynamicpowerconsumptionofbotharchitectures.
6.4ComparisontoSimilarWork
The throughput of both architectures is 1 coefficient
perclockcycle; thus,'--clockcyclesareneeded
tocomputeallthe3-DDCTcoefficientsofa'--worddatacube.Acomparisonbetween
the presented and similar architectures in theliterature
isshowninTable4.OfSPDFAandDPDFA,Table4shows that the first
architecture outperforms the second
intermsofareausageasitrequiresfewermultipliers,addersandregisters.
The extra hardware DPDFA utilises is needed toperform the dual line
computation of the 3-D DCT.Nevertheless, DPDFA is easily pipelined
and it has a lowerlatency andmemory requirement than SPDFA.
Thememoryrequirements forSPDFAandDPDFA,as listed
inTable4,areusedfordatareorderingandBROonly.
Thenumberofmultipliersandaddersemployed,memoryrequirements,controllercircuitscomplexity,andcomputationtimeofthepresentedarchitectures,arealsocomparedtotherequirementsandperformanceofthearchitecturesin[38-43].As
shown in Table 4. It can be seen that the presentedarchitectures
require the lowest number of multipliers andmemory usageof all
architectures.Only #$%&'
and2#$%&'multipliersarerequiredtoperformthe3-DDCTcomputationusing
SPDFA and DPDFA, respectively; for instance,
Nmultipliersarerequiredin[41,42].Inaddition,exceptforthearchitecturesin[43],thepresentedarchitecturescarryoutthe3-D
DCT computation with the lowest latency. Table 4 alsoshows the
performance of various architectures in terms ofcomputation time;
although the presented architecturesexhibita
longercomputationtimethanthework in[39,43],this is largely balanced
by the presented architectures lowhardwareusage. This improvement
over similarwork in the
20
70
120
170
220
270
320
370
420
200 100 66.67 50 40 33.33D
ynam
ic p
ower
(mW
)Clock Frequency (MHz)
SPDFA (12,8)
DPDFA (12,8)
SPDFA (12,2)
DPDFA (12,2)
Reconstructed Image (One Image)
Reconstructed Image (One Image)
Reconstructed Image (One Image)
Reconstructed; SPDFA; (x,y) = (12,2)
Reconstructed; SPDFA; (x,y) = (12,4)
Reconstructed; SPDFA; (x,y) = (12,8)
Original Image (One Image)
Original MRI2
Reconstructed; DPDFA; (x,y) = (12,2)
Reconstructed; DPDFA; (x,y) = (12,4)
Reconstructed; DPDFA; (x,y) = (12,8)
Reconstructed Image (One Image)
Reconstructed Image (One Image)
Reconstructed Image (One Image)
-
10
literatureismainlyduetothefactthatunlikethearchitecturesin[38-43],thefocusisonemployingandregularisingthedataflowofafastalgorithmwhiletraditionalDCTarchitecturesarebasedonthedirectalgorithm[25,38-43];thishoweverisnotthe
only benefit of using a VR approach, in fact the
controlcircuitsattachedtothepresentedarchitecturesaresimpleasthere
is no data transpose. This makes the
controllercomplexitycomparabletothatofparalleldirectapproachesinin[38,42].
7.Conclusions
This paper has presented two new 3-DDCT
architecturesbasedona3-DDCTVRalgorithm.Theuseofafastalgorithmhasyieldedarchitectureswithimprovedprocessingspeedanda
reduced hardware usage as they both require the
lowestnumberofarithmeticcomponentsandmemoryrequirementamongknownarchitecturesintheliterature;atthesametime,sucharchitectures
avoid theneed formemory
transpositionandhenceareeasytoimplementandemployasimplecontrolcircuitry.Thepresentedarchitecturesareparameterisable
interms of word and transform lengths and exhibit
variouspowerconsumption,hardwareusage,processingspeedsandlevels of
pipelining, which provides the designer withmoreflexibility and a
larger choice when selecting the
rightarchitecturefortheapplicationunderconsideration.
References[1] M. Ayinala and K. K. Parhi, "Parallel pipelined
FFT architectures
withreducednumberofdelays,"inProceedingsoftheACMGreatLakesSymposiumonVLSI(GLSVLSI),2012,pp.63-66.[2]O.Nibouche,S.Boussakta,M.Darnell,andM.Benaissa,"Algorithmsandpipeline
architectures for 2-D FFT and FFT-like transforms," Digital
SignalProcessing:AReviewJournal,vol.20,pp.1072-1086,2010.[3]M.Ayinala,M.Brown,andK.K.Parhi,"PipelinedparallelFFTarchitecturesviafoldingtransformation,"IEEETransactionsonVeryLargeScaleIntegration(VLSI)Systems,vol.20,pp.1068-1081,2012.[4]S.
Saponara and B. Neri, "Radar Sensor Signal Acquisition
andMultidimensional FFT Processing for Surveillance Applications in
TransportSystems,"IEEETransactionsonInstrumentationandMeasurement,vol.66,pp.604-615,2017.[5]S.SaponaraandB.Neri,"Designofcompactandlow-powerX-bandRadarfor
mobility surveillance applications," Computers & Electrical
Engineering,vol.56,pp.46-63,2016.[6]A.Das,A.Hazra,andS.Banerjee,"Anefficientarchitecturefor3-Ddiscretewavelet
transform," IEEE Transactions on Circuits and Systems for
VideoTechnology,vol.20,pp.286-296,2010.[7]B. K.Mohanty and P.
K.Meher, "Memory-efficient architecture for 3-DDWT using overlapped
grouping of frames," IEEE Transactions on
SignalProcessing,vol.59,pp.5605-5616,2011.[8]B. K. Mohanty and P.
K. Meher, "Memory efficient modular
VLSIarchitectureforhighthroughputandlow-latencyimplementationofmultilevellifting
2-DDWT," IEEE Transactions on Signal Processing,vol. 59, pp.
2072-2084,2011.[9]S. Al-Azawi, "Low-Power, Low-Area Multi-level 2-D
Discrete
WaveletTransformArchitecture,"Circuits,Systems,andSignalProcessing,vol.37,pp.444-458,2018.[10]
R.E.Atani,M.Baboli,S.Mirzakuchaki,S.E.Atani,andB.Zamanlooy,"Design
and implementation of a 118 MHz 2D DCT processor," in
IEEEInternationalSymposiumonIndustrialElectronics,2008,pp.1076-1081.[11]
M.JridiandA.Alfalou,"Alow-power,high-speedDCTarchitectureforimage
compression: Principle and implementation," in 18th IEEE/IFIP
VLSISystemonChipConference(VLSI-SoC),2010,pp.304-309.
[12]
M.ElAakif,S.Belkouch,N.Chabini,andM.M.Hassani,"Lowpowerandfast DCT
architecture usingmultiplier-lessmethod," in2011 Faible
TensionFaibleConsommation(FTFC),2011,pp.63-66.[13]
B.Z.Guo,L.Niu,andZ.M.Liu,"Implementationof2-DDCTbasedonFPGA," in
Proceedings of SPIE - The International Society for
OpticalEngineering,2010.[14] G. K. a. S. V. Khurram Bukhari, "DCT
and IDCT implementations
ondifferentFPGAtechnologies,"ComputerEngineeringLab,DelftUniversityofTechnology,2009[15]
O.Nibouche,S.Boussakta,andM.Darnell,"PipelineArchitecturesforRadix-2NewMersenneNumberTransform,"IEEETransactionsonCircuitsandSystemsI:RegularPapers,vol.56,pp.1668-1680,2009.[16]
H.L.P.A.Madanayake,R. J.Cintra,D.Onen,V.S.Dimitrov,andL.T.Bruton,
"Algebraic integerbased8×82-DDCT architecture for digital
videoprocessing,"inIEEEInternationalSymposiumonCircuitsandSystems(ISCAS),2011,pp.1247-1250.[17]
A.M.Shams,A.Chidanandan,W.Pan,andM.A.Bayoumi,
"NEDA:Alow-powerhigh-performanceDCTarchitecture,"
IEEETransactionsonSignalProcessing,vol.54,pp.955-964,2006.[18]
E.D.KusumaandT.S.Widodo,"FPGAimplementationofpipelined2D-DCT and
quantization architecture for JPEG image compression," in
2010InternationalSymposiuminInformationTechnology(ITSim),2010,pp.1-6.[19]
S.Al-Azawi,Y.A.Abbas,andR.Jidin,"LowcomplexitymultidimensionalCDF5/3DWTarchitecture,"
inCommunicationSystems,Networks&DigitalSignalProcessing(CSNDSP),20149th
InternationalSymposiumon,2014,pp.804-808.[20] G. K. Wallace, "The
JPEG still picture compression standard,"
IEEETransactionsonConsumerElectronics,vol.38,pp.xviii-xxxiv,1992.[21]
D. J. Le Gall, "The MPEG video compression standard," in
CompconSpring'91:DigestofPapers,1991,pp.334-335.[22] W. Li,
"Overview of fine granularity scalability in MPEG-4
videostandard,"IEEETransactionsonCircuitsandSystemsforVideoTechnology,vol.11,pp.301-317,2001.[23]
A.MadisettiandA.N.Willson,Jr.,"DCT/IDCTprocessordesignforHDTVapplications,"inInternationalSymposiumonSignals,Systems,andElectronics(ISSSE'95),1995,pp.63-66.[24]
T.Wiegand,G.J.Sullivan,G.Bjøntegaard,andA.Luthra,"Overviewofthe
H.264/AVC video coding standard," IEEE Transactions on Circuits
andSystemsforVideoTechnology,vol.13,pp.560-576,2003.[25]
S.BoussaktaandH.O.Alshibami,"Fastalgorithmforthe3-DDCT-II,"IEEETransactionsonSignalProcessing,vol.52,pp.992-1001,2004.[26]
X.Li,A.Dick,C.Shen,A.vandenHengel,andH.Wang,"Incrementallearningof3D-DCTcompactrepresentationsforrobustvisualtracking,"IEEETransactionsonPatternAnalysisandMachine
Intelligence,vol.35,pp.863-881,2013.[27]
S.SawantandD.A.Adjeroh,"Balancedmultipledescriptioncodingfor3DDCTvideo,"IEEETransactionsonBroadcasting,vol.57,pp.765-776,2011.[28]
H.Y.Huang,C.H.Yang,andW.H.Hsu,"Avideowatermarkingtechniquebased on
pseudo-3-D DCT and quantization index modulation,"
IEEETransactionsonInformationForensicsandSecurity,vol.5,pp.625-637,2010.[29]
R. Atta and M. Ghanbari, "Spatio-temporal scalability-based
motion-compensated3-Dsubband/DCTvideocoding,"
IEEETransactionsonCircuitsandSystemsforVideoTechnology,vol.16,pp.43-55,2006.[30]
S. C. Chan and K. L. Ho, "Direct methods for computing
discretesinusoidaltransforms,"
IEEProceedingsonRadarandSignalProcessing,vol.137,pp.433-442,1990.[31]
S. An and C. Wang, "Recursive algorithm, architectures and
FPGAimplementationofthetwo-dimensionaldiscretecosinetransform,"IET,ImageProcessingvol.2,pp.286-294,2008.[32]
G. Jiun-In and L. Chih-Chen, "A generalized architecture for the
one-dimensionaldiscretecosineandsinetransforms,"IEEETransactionsonCircuitsandSystemsforVideoTechnology,vol.11,pp.874-881,2001.[33]
Z.Wu, J. Sha, Z.Wang, L. Li, andM. Gao, "An improved scaled
DCTarchitecture,"IEEETransactionsonConsumerElectronics,vol.55,pp.685-689,2009.[34]
C. Yuan-Ho and C. Tsin-Yuan, "A high performance video
transformenginebyusing space-time scheduling strategy,"
IEEETransactionsonVeryLargeScaleIntegration(VLSI)Systemsvol.20,pp.655-664,2012.
-
11
[35] S. Al-Azawi, S. Boussakta, and A. Yakovlev, "High precision
and lowpower DCT architectures for image compression applications,"
in IETConferenceonImageProcessing(IPR),2012,pp.1-6.[36]
A.AggounandI.Jalloh,"Two-dimensionalDCT/IDCTarchitecture,"IEEProceedingsonComputersandDigitalTechniques,vol.150,pp.2-10,2003.[37]
J. I. Guo, "Efficient parallel adder based design for
one-dimensionaldiscretecosinetransform,"IEEProceedingsonCircuits,DevicesandSystems,vol.147,pp.276-282,2000.[38]
I. Jalloh, A. Aggoun, and M. McCormick, "3D DCT architecture
forcompressionof integral3D images," in
IEEEWorkshoponSignalProcessingSystems,SiPS:DesignandImplementation,2000,pp.238-244.[39]
A. Aggoun and I. Jalloh, "A parallel 3D DCT architecture for
thecompressionofintegral3Dimages,"inThe8thIEEEInternationalConferenceonElectronics,CircuitsandSystems(ICECS)2001,pp.229-232vol.1.[40]
M. Bakr and A. E. Salama, "Implementation of 3D-DCT based
videoEncoder/Decoder system," inMidwest Symposiumon Circuits and
Systems,2002,pp.II13-II16.[41]
S.Saponara,L.Fanucci,andP.Terreni,"Low-powerVLSIarchitecturesfor3Ddiscretecosinetransform(DCT),"inIEEE46thMidwestSymposiumonCircuitsandSystems,2003,pp.1567-1570Vol.3.[42]
S.Saponara,"Real-timeandlow-powerprocessingof3Ddirect/inversediscrete
cosine transform for low-complexity video codec,"
JournalofReal-TimeImageProcessing,pp.1-11,2012.[43] Y.
Ikegaki,T.Miyazaki,andS.G.Sedukhin,"3D-DCTprocessorand itsFPGA
implementation," IEICETransactionson
InformationandSystems,vol.E94-D,pp.1409-1418,2011.[44] L. Yuanyuan,
C. Hexin, Z. Yan, and Y. Chuxi, "Three dimensional DCTsimilar
butterfly algorithm and its pipeline architectures," in 2016
IEEEInformation Technology, Networking, Electronic and Automation
ControlConference,2016,pp.506-510.[45] S. Al-Azawi, "Efficient
Architectures for Multidimensional DiscreteTransforms in Image and
Video Processing Applications," PhD
Thesis,NewcastleUniversity,UK,2013.[46] L. Yuanyuan, C. Hexin, Z.
Yan, and Y. Chuxi, "Device-saving
pipelinearchitecturesofmulti-dimensionalDCTsimilarbutterflyalgorithm,"
in2016International Conference on Integrated Circuits and
Microsystems (ICICM),2016,pp.339-344.[47] J. A. Nikara, J. H.
Takala, and J. T. Astola, "Discrete cosine and
sinetransforms—regularalgorithmsandpipelinearchitectures,"SignalProcessing,vol.86,pp.230-249,2006.[48]
O.AlshibamiandS.Boussakta,"Fastalgorithmforthe3DDCT,"inIEEEInternational
Conference on Acoustics, Speech, and Signal
ProcessingProceedings(ICASSP'01),2001,pp.1945-1948vol.3.[49] M.C.
Lee,R. K.W.Chan, andD.A.Adjeroh, "Fast
three-dimensionaldiscretecosinetransform,"SIAMJournalonScientificComputing,vol.30,pp.3087-3107,2008.[50]
Xiph.org, "Video Test Media [derf's collection]," Xiph.org,
[Online].Available:https://media.xiph.org/video/derf/Accessedon20/09/2016.[51]
Q.Huynh-Thu andM.Ghanbari, "The accuracy of PSNR in
predictingvideoqualityfordifferentvideoscenesandframerates,"TelecommunicationSystems,vol.49,pp.35-48,2012/01/012012.[52]
H. R. Sheikh, A. C. Bovik, and G. d. Veciana, "An information
fidelitycriterion for image quality assessment using natural scene
statistics,"
IEEETransactionsonImageProcessing,vol.14,pp.2117-2128,2005.[53]
S.McKeownandR.Woods,"Lowpowerfieldprogrammablegatearrayimplementationoffastdigitalsignalprocessingalgorithms:Characterisationandmanipulationofdatalocality,"IETComputersandDigitalTechniques,vol.5,pp.136-144,2011.
SaadAl-AzawireceivedtheB.Sc.DegreeinElectrical Engineering from
University ofBaghdadandM.Sc.degreeinElectronicandCommunication
Engineering from Al-Mustansiriya University, Baghdad, Iraq.
Hereceived his Ph.D. degree in Electrical andElectronic Engineering
from Newcastle
University, Newcastle Upon Tyne, England, 2013.
AssistantProfessor Dr. Saad is currently working as a head of
the
departmentofElectronicEngineering,CollegeofEngineering,UniversityofDiyala,Diyala,Iraq.HisresearchinterestsincludeHardware
architectures for signal and image processingalgorithms and
transforms, Digital Image Processing, DigitalSignalProcessing.
Omar Nibouche received his BEng degree inElectronic Engineering
form the PolytechnicSchool of Algiers and Ph.D. degree inComputer
Science from Queen's UniversityBelfast. He is a lecturer in
computing in theSchoolofComputingandMathematics,UlsterUniversity at
Jordanstown. His research
interests include machine learning, applications of
artificialintelligence,computervisionandbiometrics.
Said Boussakta received the PhD degree inElectrical Engineering
from NewcastleUniversity, U.K., in 1990. Since 1990, he hasbeen
working in academia, fully involved inboth research and teaching.
From2000-2006hewasattheUniversityofLeedsasaReaderin Digital
Communications and Signal
Processing. Since 2006, he has been with the School
ofEngineering, Newcastle University as a Professor ofCommunications
and Signal Processing, lecturing
inCommunicationNetworksandSignalProcessingsubjects.Hehas supervised
over 50 students to Ph.D. completion andpublished over 200
conference proceedings and journalarticles. His research interests
are in the areas of fast
DSPalgorithms,DigitalCommunications,CommunicationNetworkSystems,
Cryptography, and Digital Signal/Image Processing.He has also
served as Chair in conferences and
presentedseveralinvitedtalksincommunications,signalprocessingandsecurity.
Prof Boussakta is a Fellowof the IEE, and a SeniorMember of the
Communications and Signal ProcessingSocieties.
Gaye Lightbody hasbeena Lecturerwithin the School of Computing
inUlster University since 2006.
ShereceivedanM.Eng.(1995)andaPhD(2000) in Electrical and
ElectronicEngineering
fromQueen’sUniversityofBelfast.HerPhD,HighPerformanceVLSIArchitecturesforRecursiveLeast
Squares Adaptive Filtering, involved the research anddevelopment
into a scalable efficient architecture for highlyintensive adaptive
beamforming applications. Gaye thenworked in industry from 2000 to
2006 for
AmphionSemiconductorLimiteddevelopingintellectualpropertycoresforASICandFPGAsolutionsintheareasofaudio,imageandvideoprocessing.Sheisaco-authorofthebookFPGA-basedImplementation
of Signal Processing Systems published byWiley in 2008. A second
edition has been released in 2017which provides an update to
reflect the latest iterations ofFPGA theory, applications, and
technology. This
revisionincludescoverageofFPGAsolutionsforBigDataApplications.
-
12
TABLE3.Hardwareutilisationrate,maximumoperatingfrequenciesandcomputationtimesforbotharchitecturesusingvariouswordlengths
SliceLogicUtilization Available
SPDFA DPDFA
HardwareusageforwordlengthsizesHardwareusageforwordlengthsizes
(12,8) (12,6) (12,4) (12,2) (12,8) (12,6) (12,4) (12,2)
Hardwareusage
NoofSliceRegisters 28,800 1840 1726 1593 1459 3691 3414 3120
2826
NoofSliceLUTs 28,800 2351 2171 1991 1811 3309 3044 2781 2517
NoofoccupiedSlices 7,200 743 607 588 605 1229 1096 1053 1008
NoofbondedIOBs 480 30 28 26 24 30 28 26 24
Noof36kBlockRAMused 60
1 - - - 1 - - -
Noof18kBlockRAMused 14 15 15 15 15 16 16 16NoofDSP48Es 48 9 8 8
8 16 12 12 12
Averageutilizationrate 12% 12% 11% 11% 18% 16% 15% 15%
Maximumfrequencies(MHz) 241 230 244 226 266 338 258 333
Computationtimesfor512×512×8-pixeldata(ms) 8.7 9.1 8.6 9.3 7.9
6.2 8.1 6.3
-
13
Table4.ComparisontoSimilarArchitecturesintheLiterature
Architectures Adders/Sub. Multipliers Memory Registers
InitialLatencyComputationTime(cycles)
ControllerComplexity DCTAlgorithm
[38] 3" 3" "# " + 1 N/R* "& + 3" "& Simple
Regular;Row-Column-Frame,cascaded
[39]
5"# + "2
5"#2
"&(transposememory)
"registerbetweenthe2-DDCTandthe1-D-DCT-
framedirection
"& + 32" "# Complex
Regular,Parallel;Row-Column-FrameN×N1-DDCT+1-DDCTforframedirection
[40] 3" − 3 3" "# " + 1 N/R > "# 6"&** Medium
Regular1-DDCT;Row-Column-Frame[42]
FullParallel(FP) " 2" + 1 " 2" + 1 ∗∗∗ " "# + 1 N/R N/R 2"#
Medium
1-DDCTRadix2Row-Column-FrameCascaded(CS) 3" 3"
∗∗∗ "# " + 1 N/R N/R 2"& Simple
HardwareMultiplexed(HM) " "
∗∗∗ "& N/R N/R 6"& Complex
[43] InputMem.
Sequential "& "& 2"& "& 3" + 4 "& 3"
Complex
Regular1DDCT;Row-Column-Frame
Pipelined1 ≈ 2"& 2"& 2"& "& 3" + 4 "& 3"
Complex
Piplined2 ≈ 3"& 3"& 2"& "& 3" + 8 "& 3"
Complex
Block N&
8 N&8 2N
& ≈ 16N& 3N + 4 N& 3N Complex
SPDFA 12 + 6234#" 234#""4 + " "
#
ForreorderingandBRO
"& + 5"# + 5" + 4 ≅ 2"& "& Simple
Vector-Radix3-DDCT
DPDFA 20 + 6234#" 2234#""&
ForreorderingandBRO
3"&2 + 6"
# + 11" + 8 ≅ 32"& "& Simple Vector-Radix3-DDCT
*N/R:Notreportedintheirpaper.**Computedfor4×4×4datablock.***Multiplicationisperformedbyaserialdistributedarithmeticarchitecture.