Reuseable Silicon IP Cores for Discrete Wavelet Transform ...

Reuseable Silicon IP Cores for Discrete Wavelet TransformApplications

Masud, S., & McCanny, J. (2004). Reuseable Silicon IP Cores for Discrete Wavelet Transform Applications.IEEE Transactions on Circuits and Systems I: Regular Papers, 51(6)(6), 1114-1124.https://doi.org/10.1109/TCSI.2004.829236

Published in:IEEE Transactions on Circuits and Systems I: Regular Papers

Queen's University Belfast - Research Portal:Link to publication record in Queen's University Belfast Research Portal

General rightsCopyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or othercopyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associatedwith these rights.

Take down policyThe Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made toensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in theResearch Portal that you believe breaches copyright or violates any law, please contact [email protected].

Download date:27. Feb. 2022

https://doi.org/10.1109/TCSI.2004.829236

https://pure.qub.ac.uk/en/publications/27714957-e52d-4b51-ae74-efb9f66695e6

1114 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 6, JUNE 2004

Reusable Silicon IP Cores for Discrete WaveletTransform Applications

Shahid Masud, Member, IEEE, and John V. McCanny, Fellow, IEEE

Abstract—Architectures and methods for the rapid design ofsilicon cores for implementing discrete wavelet transforms over awide range of specifications are described. These architectures areefficient, modular, scalable, and cover orthonormal and biorthog-onal wavelet transform families. They offer efficient hardware uti-lization by exploiting a number of core wavelet filter propertiesand allow the creation of silicon designs that are highly parame-terized, including in terms of wavelet type and wordlengths. Con-trol circuitry is embedded within these systems allowing them tobe cascaded for any desired level of decomposition without any in-terface glue logic. The time to produce chip designs for a specificwavelet application is typically less than a day and these are com-parable in area and performance to handcrafted designs. They arealso portable across a wide range of silicon foundries and suitablefor field programmable gate array and programmable logic dataimplementation. The approach described has also been extendedto wavelet packet transforms.

Index Terms—Biorthogonal, digital signal processing (DSP), dis-crete wavelet transforms (DWTs), field programmable gate array(FPGA), folding, image processing, IP cores, rapid design, siliconIP cores, system-on-chip architectures, very large-scale integration(VLSI) architecture, VHDL, video compression.

I. INTRODUCTION

THE USE OF wavelet transforms has become increasinglypopular in a wide range of speech and image processing

applications. However, the computational requirements of manywavelet systems, particularly image and video based systemsare often best suited to a dedicated hardware implementation. Todate, quite a number of investigations have been undertaken intoboth architectures for and specific implementations of siliconwavelet systems. Important contributions include the works ofParhi and Nishitani [1], Vishwanath and Owens [2], Chakrabartiet al. [3], Grzeszczak et al. [4] and Yu et al. [5]. These paperspresent a variety of schemes ranging from digit-serial architec-tures, filter-bank folding, lattice structures, systolic arrays, andother parallel processing schemes.

An important feature of wavelet-based transforms is thatthere is an extremely large choice of possible wavelet basisfunctions that can be used for signal transformation. Theseare broadly categorized into two main families depending on

Manuscript received August 1, 2003. This work was performed at Queen’sUniversity, Belfast, U.K. This paper was recommended by Associate EditorY. Wang.

S. Masud is with the Department of Computer Science, Lahore University ofManagement Sciences, Lahore 54792, Pakistan (e-mail: [email protected]).

J. V. McCanny is with the DSiP Laboratories, School of Electrical Engi-neering, Queen’s University, Belfast BT9 5AH, U.K. (e-mail: [email protected]).

Digital Object Identifier 10.1109/TCSI.2004.829236

whether the mother wavelet used satisfies conditions of “orth-normality” or “biorthogonality” [6], [7]. The level of waveletdecomposition performed varies widely and is algorithmdependent. In addition, the wavelet basis functions used atdifferent levels of signal decomposition can themselves vary, ascan related parameters such as lengths of filter banks, signal andfilter coefficient wordlengths [8]. To date, most of the researchwhich has been undertaken on wavelet architectures or designshas been focused on specific orthonormal wavelet systems suchas Haar and Daubechies wavelets [4], [9], [10], with little workhaving been published on biorthogonal systems.

The purpose of this paper is to present the results of researchwhich has addressed this problem in a much more general sense.In particular, the paper describes generic and modular architec-tures that allow the rapid silicon design of a very broad range ofwavelet systems directly from a high-level specification. Thiswork has been motivated by two main considerations. The first isdescribed above—the need to be able to create wavelet systemswith a very wide range of specifications and tuned to differentcomputational requirements. The second has been motivatedby the challenges arising from the new era of system-on-a-chip(SoC), in particular, the desire to rapidly create complex andefficient reusable intellectual property cores. This extends ourprevious research in this area, which has shown how digitalsignal processing (DSP) systems-on-silicon can be created in afraction of time previously thought possible using hierarchicallibraries of generic DSP architecture templates captured in theform of a hardware description language [11].

The paper is structured as follows. Sections II and III intro-duce new generic architectures for the creation of orthonormaland biorthogonal wavelet filters, respectively. Details are thenpresented (Section IV) as to how these have been used for increation of parameterized generators for the design of siliconwavelet cores. Section V then presents the result of numerouscase studies covering both application specified integratedcircuit (ASIC) and field programmable gate array (FPGA)implementation. As will be discussed, this approach allowswavelet designs to be created very rapidly from a high levelspecification. The designs produced are highly portable acrossdifferent silicon foundries and can also easily be implementedin FPGA and programmable logic data (PLD) technology.This section also considers the implementation of waveletpacket transforms—systems that comprise both orthonormaland biorthogonal wavelet types and arbitrary filter bank con-nections. A generic method for folding wavelet architecturesis also described. This allows hardware requirements to betailored to computational bandwidth. The general conclusionsfrom the work are then presented in Section VI.

1057-7122/04$20.00 © 2004 IEEE

MASUD AND McCANNY: REUSABLE SILICON IP CORES 1115

Fig. 1. Three-level DWT.

Fig. 2. Two-level wavelet packet decomposition system.

II. GENERIC ARCHITECTURE FOR ORTHONORMAL

WAVELET TRANSFORMS

The hardware implementation of discrete wavelet transforms(DWTs) and wavelet packet transforms is based on the filterbank representation shown in Figs. 1 and 2, respectively. Theimplementation becomes complex because of a varying choiceof wavelet, sample rate and the cumulative wordlengths at everystage. All these issues represent important challenges in the cre-ation of efficient, generic and reusable architectures for the rapiddesign of wavelet systems.

The system which has been created for implementing or-thonormal wavelet filters exploits the fact that the polyphase(bi-phase in the case of wavelets) decomposition of wavelet fil-ters can be obtained by writing the transfer function ofboth the low- and high-pass filters in the form

(1)

where the symbols in this equation have their usual meaning.

Equation (1) is, in effect, two filters that individually operateat half-rate on alternative data samples. This is followed by adecimation stage in wavelet decomposition, meaning that thefunction of each filter pair can be time multiplexed onto a singlehalf-length filter, thus reducing filter hardware requirements by50%. This is illustrated in top and bottom halves of Fig. 3, whichshows how each multiplier block can be shared and used tocompute alternate multiplications with even and odd order co-efficients. In this approach data is input at the same rate as afull-length direct-form filter, but with alternative even and oddcoefficients used in each processing cycle. The result is onlyoutput when both the even and odd index samples have beenprocessed. The odd index samples computations are thereforetemporarily stored in a buffer and are added to even sample com-putations to generate a complete filtered and decimated output.As will be noted, this system also offers the attraction that thedata delay line is shared between the two filter structures and thisis also beneficial in reducing hardware requirements. A trans-posed-direct-form filter in this situation would require two sep-


Fig. 3. Time interleaved implementation of an N -tap wavelet analysis filterbank.

arate delay lines with increased wordlengths. Since the filtersused in wavelet transform have only a few taps, the use of di-rect-form filter structure poses no problems.

The operation of the time-interleaved circuit can be explainedthrough an eight-tap example as follows. The first sample ar-rives at time to find the first coefficient set ,and , respectively, available at each of the multipliers. Theoutput value is calculated and passed to the output.The second (odd) sample arrives at time to encounter thecoefficient values , and . The previous inputmoves forward through one delay and hence it is not operatedby any multiplier. The output value is calculatedand this is stored in the buffer memory. When samplearrives, has moved to the next multiplier. The output value

is then produced and added to the valuethat was calculated in the previous odd cycle. This

represents the second decimated output value. The processcontinues and gives a decimated and filtered output.

This approach is different from what has previously beendescribed in that it is the function of each half- rate multiplierwhich is multiplexed onto the same piece of hardware and notthe functions of the high and low pass filter as described in [11].This is highly advantageous in terms of modularity (and thus,chip-design synthesis) and exploits the fact that both half-ratefilters require the same filter lengths and data/coefficientwordlengths. As will be discussed, this approach is consistentwith what we have adopted for biorthogonal wavelets (seeSection III), where the high-pass/low-pass filter approach is notattractive due to filter order and wordlength variations. Anotherimportant difference is that the architecture in Fig. 3 does notrequire the computation of intermediate multiplications andadditions that are then discarded because of decimation. This ishighly advantageous in terms of power consumption.

For the purpose of this paper, attention is focused mainly onhigh-throughput applications, and consequently, a bit-parallel,

word-serial filter implementation has been assumed. However,the basic architecture can be simply extended to create silicongenerators in which other word formats, such as bit-serial ordigit-serial data streams are used. This allows flexibility acrossapplications in trading silicon area with performance and powerconsumption.

III. GENERIC ARCHITECTURE FOR BIORTHOGONAL

WAVELET FILTERS

Biorthogonal wavelet filters evolved from the idea of havingan exact reconstruction scheme in which the synthesis filterswere different from the analysis filters. Hence the “orthonor-mality” condition is relaxed to “biorthogonality.” Such filtersare attractive in that integer coefficients are possible and thatthey offer linear phase response [7]. However, because of thisproperty the low-pass and high-pass filters in a biorthogonalfilter bank can have different lengths with their filter coefficientspossessing a combination of symmetry or anti-symmetry. Thefilter orders can also be even or odd.

The linear phase property of the biorthogonal filters resultsin the coefficients being symmetric, in the case of odd ordersystems, or asymmetric in the case of even order systems. Thecoefficients can thus be written as

(2)

where is the filter order. Architectures for such linear phasefilters can exploit this property to reduce the number of mul-tipliers from to (even order) and to (oddorder). As before, downsampling by a factor of two allows al-ternate samples produced by the two analysis filters to be ig-nored. As in the case of orthonormal wavelet filters, polyphasedecomposition can be applied so that only one output needs tobe computed for every two input samples.


TABLE ISEQUENCE OF OUTPUT FROM A NINE-TAP BIORTHOGONAL ANALYSIS FILTER

An efficient and generic architecture for biorthogonalwavelets has been derived by combining such properties andis described by first considering the example of an odd order(symmetrical filter coefficients) filter structure. Here a nine-tapfilter is presented for illustration. The operation of this systemcan be described by considering the outputs produced by sucha system as documented in Table I. From this, the followingwill be noted.

• Coefficient symmetry simplifies the actual computationsrequired thereby reducing the number of multiplicationsand additions needed.

• Polyphase decomposition can be used to interleave evenand odd order coefficient values to the same multiplier (asdescribed in orthonormal case).

• A multiplexing scheme can be devised to appropriatelyconnect the delay line taps to the multiplier/accumulatorsat correct time instance in order to produce correct outputvalues.

The architecture developed from the above considerations isshown in Fig. 4. This is suitable for nine-tap wavelet filter pairssuch as biorthogonal 9, 3 tap or biorthogonal 9, 7 tap systems.The entries in Table I under “Required output” provide the basisfor this simplification. Since intermediate filter outputs at time

are not required, we can spread the computationsrequired at time etc., over two computation clockcycles. For example, the computations at time can be dis-tributed such that takeplace in even input cycle and take

place in the odd input data cycle. A similar explanation holds forall the subsequent outputs. The results from the even cycle arestored in the output accumulator and added with results fromthe odd cycle to produce the final decimated output.

A corresponding architecture for even-order biorthogonalwavelet filters possessing antisymmetry (i.e.,

) can similarly be developed by replacing adders inFig. 4 with subtractors. An example of a system (for a four,4–tap filter pair) is also shown in Fig. 5. Even- and odd-orderbiorthogonal wavelet filters require different interconnectionsstrategies for delay line taps and multiplier coefficients. Theprovision for this has been provided using the parameterizationscheme described later so that the architecture is fully scalablefor any type of biorthogonal wavelet filter.

The number of multipliers required for each of the architec-tures described (and thus an estimate of the silicon area require-ment) can be obtained from

Number of multipliersFilter Taps

(3)

where the operation requires rounding up of the value withinthe brackets to the nearest integer.

Here, one adder is associated with the input of every mul-tiplier. Similarly, adders are required in the accumulation pathof the multiplier outputs. A further adder, with associated delayelements, is also required at the filter output for the accumu-lator (Figs. 4 and 5). Equivalent direct-form finite impulse re-sponse (FIR) filters require multipliers and adders for an


Fig. 4. Nine-tap low-pass analysis filter for biorthogonal wavelet transform.

Fig. 5. Efficient implementation of a four-tap biorthogonal wavelet filter.

-tap filter. Exploitation of symmetry reduces this number tomultipliers and M adders. The polyphase decom-

position technique, as presented in orthonormal wavelet filters,requires multipliers and adders for a similarimplementation. The new architectures presented here, based onsimultaneous exploitation of symmetry and polyphase decom-position, results in a saving of up to 4X over direct-form FIRfilters and up to 2X over polyphase decomposition techniques,while retaining the modularity and scalability in the design.

IV. RAPID DESIGN OF WAVELET TRANSFORM CORES

Parameterized silicon cores to implement a wide rangeof orthonormal and biorthogonal wavelet functions have

been created using the generic architectures described. Thecore elements in these have themselves been created from alower level library of parameterized architectural templates[Intellectual Property (IP) cores] for implementing functionssuch as multipliers, multiplexers and delay elements [11],with these captured using (VHDL). For the purpose of thisresearch, attention was focused mainly on the use of bit parallelBooth encoded, Wallace-tree multipliers, although, as indicatedearlier, the concepts are easily extendible to many other typesof multipliers including ones with bit serial, digit serial as wellas bit parallel data organizations. When coupled with param-eterized wordlengths, levels of pipelining and filter length,this provides high flexibility in design space exploration and


Fig. 6. Scalable architecture for biorthogonal wavelet filters.

allows the rapid tailoring of circuits to meet throughput, powerand area requirements. Generic cores based on the blocksdescribed, therefore, allow a multistage wavelet transform corefor any orthonormal or biorthogonal wavelet type and at anywordlength (within practical bounds) to be quickly created andimplemented as silicon designs.

A. Orthonormal Wavelet Cores

The orthonormal wavelet filter architecture in Fig. 3 was im-plemented using a number of modular building blocks. Here,the input component (i) comprises two multipliers in the filterbank and a delay element. The latter is incorporated to synchro-nise the input reading of data and thus facilitates the seamlessinterfacing of multiple wavelet blocks. The block labeled (ii) inFig. 3 comprises two delay elements and two multiplier/accu-mulators (MACs) and implements repeating taps of the analysisfilter bank. The output circuit for the analysis filter then com-prises an accumulator and decimator as outlined in block (iii).In this case, outputs are only produced when two data sampleshave been processed. The delay element at the output removesthe need of glue logic for cascading these cores.

B. Biorthogonal Wavelet Cores

The architectures for biorthogonal filters shown in Figs. 4 and5 have been generalized into a scalable architecture, illustratedschematically in Fig. 6. This architecture was obtained by con-sidering the fact that same filter coefficient will multiply withtwo different data samples due to symmetry and the coefficientsupplied at input to each multiplier will change in even and oddcycles due to bi-phase decomposition. There is a small addi-tional overhead incurred in this design in terms of adders andmultiplexers but this makes the architecture much more genericand modular. Another important advantage of this scheme is thatif offers a simple interconnection and scheduling mechanismthat renders itself to architecture parameterization.

A parameterized SoC design system for implementing anygeneral biorthogonal wavelet filter can be produced in the fol-lowing four steps. The input parameters supplied to the core in-clude number of taps, wordlengths and wavelet type.

• Generation of delay lineAn -tap delay line is generated using latches.

• Generation of processing elementsThe processing elements, each comprising a multiplier,

an adder and two multiplexers are created. Every four tapsof the filter requires one processing element (see Fig. 6).To facilitate the generation of architectures comprisingbiorthogonal wavelet filters with inverse-symmetry [i.e.,

], the adder in the processing ele-ment of Fig. 6 is changed to a subtractor; examples includebiorthogonal 6, 2 tap wavelet filters. This selection is alsopossible through generic parameters.

• Generation of adder chainAn adder chain is required to accumulate the results

produced by the multiplications. The number of addersand their interconnections are also automatically gener-ated. Here adders are required for this operation,where is the number of processing elements.

• Output decimator and accumulatorThe final constituent block in the scaleable architec-

ture is the output accumulator based on a latch plus carrylookahead adder, as shown in Fig. 6. This produces a dec-imated output by storing the intermediate results in thelatch. The output is controlled by a synchronous availablesignal. This removes the need for interface glue logic whencreating multilevel wavelet systems.

The generic circuits described for both orthonormal andbiorthogonal wavelet filters operate using a simple controlcircuit, with the only external signals required being the Clockand Reset signals. Specific coefficient values are internallyassigned during logic synthesis process and derived fromgeneric specifications, as described below.


C. Coefficient Allocation Procedure

The architectures described have been captured in VHDL.This method allows replication of smaller processing blocksthrough generic specification but there is no direct mechanismto acquire the wavelet coefficients from a high level description.Coefficients cannot be directly supplied at time of instantiationbecause only “integers” are permitted as generic VHDL param-eters. The following method was therefore developed to allowaccess to a wide range of wavelet families through a generic de-scription.

A MATLAB code was first used to generate a text file con-taining the coefficients for all required wavelets. The text filecontaining real number coefficients is converted to two’s-com-plement format at the maximum desired resolution of MAX-bits. The variable MAX was chosen as 16 bits in this system.However, this can be easily varied. The actual coefficient reso-lution (between 1-bit and MAX-bits) is specified in the genericdescription of the cores during instantiation. These values arethen appropriately embedded in a VHDL package that is in-cluded in the core description. Each wavelet filter is identifiedby a separate index value that is specified in generics and used toaccess the appropriate set of filter coefficients. The coefficientwordlengths are selected by a VHDL code that returns only therequired number of bits of precision specified in the generics.The wordlength growth after every stage can easily be adjustedby observing the software simulation.

V. DESIGN CASE STUDIES

Numerous design examples have been undertaken to illus-trate the ease and speed with which wavelet transforms can beimplemented. The methods presented here allow a nonspecialistDSP engineer to develop a silicon implementation of a fullwavelet system concurrently with algorithm development. Thefollowing generic parameters are all that need to be specifiedat the time of instantiation of any wavelet block: 1) wavelettype e.g., Daubechies, biorthogonal 9, 7 tap etc.; 2) levels ofdecomposition; 3) length of low-pass and high-pass filtersin biorthogonal wavelets; 4) data wordlength; 5) coefficientwordlength; 6) wordlength extension to prevent overflow (andto cater for cascading stages). All the cores described belowhave been functionally tested and verified using VHDL testbenches. It is important to note that the performance measuresreported in this paper for each core are entirely dependent onchosen wordlengths and wavelet type.

An important advantage of this approach is that it readilyallows the choice of a particular core and correspondingwordlengths to be determined on the basis of given design con-straints. The cores can be cascaded for a multiple level waveletdecomposition mirroring the signal flow graph. The controlcircuit of the cores designed here allows a direct cascading ofmultiple blocks for this purpose. For low-bandwidth systems,the data can be recycled for higher levels using appropriatememory, multiplexers, and control signals.

A. Orthonormal Wavelet Transforms

The following are some examples of silicon wavelet imple-mentations developed using the methods described.

TABLE IICORE SPECIFICATIONS FOR DIFFERENT WORDLENGTHS OF BIORTHOGONAL 9, 7

TAP WAVELET FUNCTION

1) Daubechies wavelets: In this design, 9-bit data and coef-ficient values in two’s-complement format were used. The Syn-opsys environment was used to synthesize the design and gen-erate a netlist file, which was then converted to a silicon layout.Such a core comprises about 12 K gates and requires an areaof 0.39 mm . The performance measures are for single-stagedecomposition implemented in a triple level metal, 0.18- mCMOS technology. This single stage filter bank can be reusedto implement higher octaves by employing folding and sched-uling techniques. A three-level cascaded wavelet decomposi-tion based on a Daubechies four-tap wavelet, (as in Fig. 1), wasinstantiated, with coefficient and data resolution being 7 and9 bits, respectively. A 6-bit truncation was applied after everystage so that the internal wordlengths are 18, 21, and 24 bits forthe first, second, and third stages, respectively. In this case, thesilicon area required is 0.625 mm requiring around 20 K gates.A clock frequency of over 160 MHz was achieved in this design.This corresponds to a data sample rate of 80 MSamples/s.

2) Symmlet 12-tap wavelet, two stages: In this case, a two-level wavelet decomposition based on the use of two cascadedSymmlet 12-tap wavelet functions, was synthesized. Here the(two’s-complement) coefficient and data wordlengths were 10and 9 bits, respectively, with internal wordlength accuraciesof 25 and 32. Here, 9-bit truncation is employed betweenstages. This design requires 35 K gates and has a silicon areaof 0.913 mm (0.768 mm 1.187 mm).

B. Biorthogonal Wavelet Transforms

A number of designs were also created based on biorthogonalwavelet filter architecture shown in Fig. 6. The details of eachare presented below:

1) Biorthogonal 9, 7 tap wavelet transform: A typical sil-icon layout for a single stage design with two’s-complement9-bit coefficients and data occupies an area of 0.321 mm in theASIC technology used earlier and comprises around 10 K gates.It can process data at sample rates in excess of 150 MHz. Sil-icon designs for a wide range of wordlength specifications havealso been created. Three single-stage cores along with storagememory can form a two-dimensional (2-D) wavelet decompo-sition as required in the JPEG 2000 image coding standard. De-tails of some of these cores are presented in Table II. The effect


TABLE IIIFPGA IMPLEMENTATION OF BIORTHOGONAL WAVELET TRANSFORMS

of wordlengths specification on the size of the core is obviousfrom this table. The methods presented in this paper allow de-sign changes and choices to be made on the fly.

C. FPGA Implementation

The biorthogonal wavelet filter designs were also ported toXilinx 4052XL series speed grade FPGA devices. Theperformance figures obtained are tabulated in Table III, withdetails depending on the targeted device. This particular de-vice was selected for the reasons of availability in the targetlibrary and the number of configurable logic blocks (CLBs)(1936) appeared sufficient for the demonstration of our archi-tectures and designs. A clock frequency of over 65 MHz wasachieved in all the implementations, which is sufficient for somereal-time video processing applications. A similar design fora Daubechies eight-tap wavelet analysis using 8-bit data and7-bit coefficients requires 353 CLBs and 50 input/output blocks(IOBs).

D. Wavelet Packet Transform

The cores developed previously for orthonormal andbiorthogonal wavelet transforms have been employed to createsilicon designs for a number of wavelet packet decompositionsystems. Two related and important issues have to be addressedin the implementation of wavelet packet transforms. The firstis that such systems allow different wavelet functions to beused to implement each filter bank. The second is interfacingof different filter banks at successive stages. The architecturesdescribed provide the means for doing this in a very straight-forward manner.

In the case of wavelet packet transforms, the internalarchitecture of the filter banks remains unchanged from thedyadic wavelet decomposition but the external arrangementfor higher levels is variable and flexible. A basic core forwavelet decomposition can therefore be reused to produce anyarbitrary wavelet packet decomposition. Such cores are easilycascadable because of the pipelined input and output and anembedded control circuit. The choice of a wavelet function andthe interconnection of filter banks are easily specified duringinstantiation. Some examples of implementations based on thisapproach are given below.

1) Two-level wavelet packet transform: This design illus-trates the use of wavelet transform cores in developing a two-

level wavelet packet decomposition, as shown in Fig. 2. Thefirst level of has been instantiated with a Daubechies eight-tapwavelet core, whereas two Daubechies four-tap wavelet coreshave been used in the second stage of analysis. As before, dataand coefficient values are in a two’s-complement format andcomprise 9 and 8 bits, respectively. The output from the firststage (Daubechies 8-tap) is truncated to 13 bits and used as inputto the two subsequent filter banks. The output from the final(Daubechies four-tap) filter banks is then truncated to 15 bits.The methodology, however, allows a flexible mechanism for al-location and truncation of wordlengths.

The resulting design requires 28 K gates and an area of1.097 mm (0.806 mm 1.362 mm). The maximum input datathroughput is in excess of 150 MHz. A similar wavelet packettransform with the wavelet functions reversed comprises 31 Kgates but improves the power consumption by over 20%. Thisis due to the fact that the wavelet function with the highernumber of taps is now at the second level of decompositionwith this operating at half the frequency of the first.

2) Biorthogonal wavelet packets (Irregular Decomposi-tion): A three-level wavelet packet transform has also beenproduced. A range of wavelet functions was utilized todemonstrate the flexibility in this scheme. The first stage usesbiorthogonal 9, 7 tap wavelet filters, the second stage (i.e., twosets of filter banks) comprises biorthogonal 9, 3 tap waveletfunctions and the third stage utilizes biorthogonal 6, 2 tapwavelet functions (two sets of filter banks). The last stage filterbanks are connected to the low-pass output of stage 2. Thehigh pass output of stage 2 is directly available at the output.The input data as well as the coefficient are represented inan 8-bit two’s-complement format. An 8-bit truncation wasincorporated at the outputs of the first and the second stagefilter banks. The design comprises 44 K gates. The charac-teristics of individual blocks are summarized in Table IV. Asmentioned earlier, these cores easily operated at over 160 MHzcorresponding to input data rate of 80 MSamples/s.

3) Combination of orthonormal and biorthogonal wave-lets: In this case, a silicon layout for a two-level balancedwavelet packet tree, as shown in Fig. 2, was produced. Here,a biorthogonal 9, 3 tap wavelet function is employed in thefirst level, whereas the succeeding level comprises Daubechiesfour-tap and Daubechies eight-tap wavelet functions. The latterorthonormal filter banks are connected to the low-pass and thehigh-pass outputs of the first stage biorthogonal filter bank,


TABLE IVSILICON AREA OF BLOCKS INSTANTIATED IN BIORTHOGONAL WAVELET PACKET DECOMPOSITION

Fig. 7. Folded implementation of an eight-tap wavelet filter.

respectively. The input data and coefficients comprised 9 bitswith a similar truncation between the filter banks. The totalsilicon area in this case is 0.92 mm (0.735 mm 1.252 mm)and the number of gates required is 28 K.

E. Architecture Folding1

The designs described contain separate hardware implemen-tations for each stage in the filter bank. In many practical appli-cations the 150-MHz sampling rate achievable from these coresis well beyond what is required. The architectures developedoffer the attraction that these can be easily systematically foldedand retimed through the multiplexing of operations (multiplica-tion, addition, accumulation, etc.,) onto a reduced number ofcomponents. The amount of folding required depends on thewavelet choice and the downsampling (decimation) ratio. A pa-rameterized generator for wavelet transforms that incorporatessuch folding has therefore also been developed, with the foldingfactor being incorporated as an additional parameter in genericspecifications.

The schematic of a Daubechies eight-tap wavelet filter in-stantiated with folding parameter of four is shown in Fig. 7.The principle of the folded wavelet cores is to spread the com-putations of wavelet coefficients over multiple computationcycles. The amount of time available depends upon the de-sired throughput, which is linked to the folding parameter. As

1The material provided in Section V-E and related information is subjectof a U.S. patent application.

shown in the figure, the circuit computes the partial productsfor both the even and odd cycles at time , and . Asimilar core instantiated with folding parameter of two wouldconsist of two multipliers each computing the partial productsat times and , respectively. The accumulatorin this circuit has been slightly modified to facilitate the op-eration over a variable range of folding requirements. In thiscase, the output block comprises two latches, in the forwardand feedback paths, respectively, as well as a pipelined adder.Hardware sharing in this wavelet filter architecture leads to atradeoff between speed, area, and power whilst retaining thegeneric architectural attributes and on-the-fly coefficient allo-cation described earlier.

A provision has been made in the parameterization schemeto allow folding of the complete filter architecture on to a singlemultiplier. This means that an eight-tap Daubechies waveletfilter, instantiated with a folding parameter “4” comprises asingle MAC unit onto which all the filtering operations aremultiplexed. Further increases in hardware efficiency throughmultiplexing can be achieved by using digit-serial or bit-serialmultipliers. However, the flexibility displayed in this schemein terms of wordlength specification cannot be attained in otherarchitectures.

A silicon design capable of performing a three-level DWTand based on the use of the folded architecture described hasbeen generated with details presented in Table V. The first stageoperates at 160 MHz and has a folding parameter of “1.” The


TABLE VSPECIFICATIONS OF THREE-LEVEL FOLDED WAVELET CORES

Fig. 8. Initial floor plan of a Daubechies eight-tap, three-level wavelettransform processor.

second stage operates at 80 MHz and has a folding parameter of“2.” Similarly, the third stage operates at 40 MHz with a foldingparameter of “4.” The total number of gates required in thiscase, for a three-stage decomposition is around 36 K. The initiallayout of this chip is shown in Fig. 8.

VI. DISCUSSION AND CONCLUSION

A methodology is presented that allows a nonspecialist tovery rapidly design highly efficient silicon wavelet transformcores from a high level specification. This is based on genericscalable architectures utilising time-interleaved coefficientsfor the wavelet filters. These architectures are parameterizedin terms of wavelet family, wavelet type, data wordlengthand coefficient wordlength. The approach is flexible both inthe scalability of architecture and the choice of wavelet basisfunctions. A new wavelet type can easily be added whenever re-quired. The control circuitry required is self-contained and hasbeen designed so that these can be cascaded without any inter-face glue logic, for any desired level of wavelet decompositionor reconstruction. Efficient architectures for both orthonormaland biorthogonal wavelet filters were developed and used as thebasis for the parameterized generator presented. This contrastswith existing research, which has tended to focus mainly on

specific examples of orthonormal wavelets. Moreover, the newarchitectures for biorthogonal wavelet transforms reported arethe first to concurrently exploit characteristic properties, suchas symmetry and the multirate nature of such filters.

Case studies for stand-alone and cascaded silicon cores forvarious wavelet algorithms, respectively, are reported. Thetypical design time to produce silicon layout of a wavelet-basedsystem has been reduced to less than a day. The time fromspecification to implementation is the time required to run thesimulation, synthesis, and layout tools. The designs have beencaptured in VHDL and are portable across a range of foundries,target technologies and are applicable to FPGA and PLDimplementations. The use of a hierarchical approach used in thecreation of the various silicon generators described means thattightly designed smaller blocks are used to create larger libraryblocks (such as multipliers) which are in turn used to createthe circuits described. This “bottom-up,” architecture-basedapproach results in highly efficient silicon designs beingcreated—comparable with handcrafted—and contrasts stronglywith the common (and often highly inefficient) approach ofcreating RTL based cores from a high level VHDL description.

As discussed in Section I, a number of specific DWT chipdesigns have been reported in the literature [4], [5], [11]–[16].A key aspect of the generalized methodology described is thatbenchmark performance figures compare very favorably withprevious, fixed specification, full-custom designs. For example,the Daubechies four-tap, 3-level 0.8- CMOS design reportedby Yu et al. [5] requires a silicon area of 8.5 mm . This roughlycorresponds to 0.679 mm in 0.18- CMOS technology andcompares with a figure of 0.604 mm in our approach. Our de-sign also produces much better area and performance figuresthan those reported by Sheu [12]. The FPGA results for thebiorthogonal 9, 7 DWT also compare very favorably those re-ported by Altera [11], clearly demonstrating that it is possibleto create generic silicon designs which are highly competitivewith specific manual designs but can be created in a fraction ofthe time previously required.

Whereas some FPGA implementations of biorthogonalwavelets have been reported [11]–[16], the full details of archi-tecture and wordlengths are not available. Previously describedimplementations are based on direct-form FIR filter design anddo not utilize the hardware efficiencies described. In addition,designs such as those reported by Schoner [13] and Truchetet[15] are specifically aimed at short-length wavelet filters with


coefficients being “integers” or “powers of two.” The AlteraFPGA megafunction [11] allows parameterization of input andoutput wordlengths but the wavelet choice is restricted. Thedesigns presented in this paper are much more general any ofthe family of biorthogonal wavelet functions can be instantiatedfrom the generic core described.

A multilevel wavelet system can be produced using eithera filter bank folding or by cascading a number of cores in-stantiated with appropriated wordlengths and wavelet choice.The former approach is hardware efficient but suffers fromhigh latency, whereas the latter approach provides the highestthroughput. Since a word-parallel data format was used inthe design of the cores reported, the designs in the latter casemay not be suitable for very high levels of decompositionor low throughput applications. In these circumstances, thefolded cores presented in Section V and the use of digit-serialor bit-serial library components can improve silicon areaefficiency while retaining the flexibility in wavelet choice andarchitecture scaling. This can provide efficient design solutionsfor a very broad range of target applications. Parameterizedarchitectures for wavelet packet decomposition containingorthonormal and biorthogonal wavelet functions have alsobeen created. The implementation examples described includearbitrary packet tree decomposition with a variety of waveletcombinations. An example containing a mix of orthonormaland biorthogonal wavelet functions has also been presented.The possibility of selectively applying quantization in anydesired output path, as demonstrated, is advantageous in signalcoding. In these applications, each subband contributes differ-ently to the overall system characteristics and an independentassignment of wordlengths can make the hardware designmuch more flexible and efficient. This flexibility is not possiblein previously presented architectures [16].

REFERENCES

[1] K. K. Parhi and T. Nishitani, “VLSI architectures for discrete wavelettransforms,” IEEE Trans. VLSI Syst., pp. 191–202, June 1993.

[2] M. Vishwanath, R. M. Owens, and M. J. Irwin, “VLSI architectures forthe discrete wavelet transform,” IEEE Trans. Circuits Syst. II, vol. 42,pp. 305–316, May 1995.

[3] C. Chakrabarti, M. Vishwanath, and R. M. Owens, “Architectures forwavelet transforms: A survey,” J. VLSI Signal Processing, pp. 171–192,1996.

[4] A. Grzeszczak, M. K. Mandal, S. Panchanathan, and T. Yeap, “VLSI im-plementation of discrete wavelet transform,” in IEEE Trans. VLSI Syst.,Dec. 1996, pp. 421–433.

[5] C. Yu, C. A. Hsieh, and S. J. Chen, “Design and implementation of ahighly efficient VLSI architecture for discrete wavelet transform,” inProc. IEEE Custom Integrated Circuits Conf., 1997, pp. 237–240.

[6] I. Daubechies, “Orthonormal basis of compactly supported wavelets,”Commun. Pure App. Math., vol. XLI, pp. 909–996, 1988.

[7] A. Cohen, I. Daubechies, and J. C. Feauveau, “Biorthogonal basesof compactly supported wavelets,” Commun. Pure Appl. Math., pp.485–560, 1992.

[8] S. Masud and J. V. McCanny, “Finding a suitable wavelet for image-compression applications,” in Proc. IEEE Int. Conf. Acoustic SpeechSignal Processing, vol. V, May 1998, pp. 2581–2584.

[9] A. S. Lewis and G. Knowles, “VLSI architecture for 2-D Daubechieswavelet transform without multipliers,” Electron. Lett., pp. 171–173, 17,1991.

[10] T. C. Denk and K. K. Parhi, “VLSI architectures for lattice structurebased orthonormal discrete wavelet transforms,” IEEE Trans. CircuitsSyst. II, vol. 44, pp. 129–132, Feb. 1997.

[11] Biorthogonal wavelet filter megafunction (1997, Feb.). [Online]. Avail-able: http://www.altera.com

[12] M. H. Sheu, M. D. Shieh, and S. F. Cheng, “A unified VLSI architecturefor decomposition and synthesis of discrete wavelet transform,” in Proc.IEEE Midwest Symp. Circuits Systems, vol. 1, Aug. 1996, pp. 113–116.

[13] B. Schoner, C. Jons, and J. Villasenor, “Issues in wireless video codingusing run-time-reconfigurable FPGAs,” in Proc. IEEE Symp. FPGAs forCustom Computing Machines, Apr. 1995, pp. 85–89.

[14] F. Truchetet and A. Forys, “Implementation of still-image compression-decompression scheme on FPGA circuits,” in Proc. IEEE Pacific RimConf. Communications, Computers Signal Processing, vol. 1, 1997, pp.481–484.

[15] ADV601 Chip, Analog Devices. [Online]. Available: http://www.analog.com

[16] X. Wu, Y. Li, and H. Chen, “Programmable wavelet packet transformprocessor,” Electron. Lett., pp. 449–450, 18, 1999.

Shahid Masud (S’92–M’92) received theB.Sc.Engg. (Honors) from the University of En-gineering and Technology, Lahore, Pakistan, theM.Eng.Sc. degree from the University of New SouthWales, Sydney, Australia, and the Ph.D. degree fromQueen’s University Belfast, Belfast, U.K., in 1990,1992, and 1999, respectively.

He was with Amphion Semiconductor Ltd.,Belfast, U.K. He has published 20 papers in impor-tant journals and conferences and has three patentspending. He is currently an Assistant Professor at

Lahore University of Management Sciences, Lahore, Pakistan.Dr. Masud is a member of the Institute of Electrical Engineers, U.K., and a

Chartered Engineer (C.Eng.).

John V. McCanny (M’89–SM’95–F’99) receivedthe B.S. degree (honors) in physics from theUniversity of Manchester, Manchester, U.K., thePh.D. degree in solid-state physics from the newUniversity, Ulster, N. Ireland, U.K., and the higherdoctorate (D.Sc.) degree (in recognition of hisresearch contributions) from the Queen’s UniversityBelfast, Belfast, U.K., in 1973, 1978, and 1998,respectively.

He joined the Royal Signals and Radar Estab-lishment (now Qinetiq), Malvern, U.K., in 1979,

and was Principal Scientific Officer when he left in 1984. In 1984, hewas appointed a Lecturer in the Department of Electrical and ElectronicsEngineering, Queen’s University Belfast, where he became a Reader in 1988and a Full Professor in 1989. He is currently the Director of the Institute forElectronics Communications and Information Technology (ECIT), a $75 Mfacility currently being built on the Northern Ireland Science Park. He haspublished over 250 scientific papers and five research books in the field of verylarge-scale integrated architectures and system-on-chip design for signal andimage processing, and holds ten patents. He has also successfully co-foundedtwo high technology companies, Audio Processing Technology Ltd.—whichmarkets hi-fi audio compression products worldwide, and Amphion Semicon-ductor—a leading supplier of semiconductor intellectual property for digitalTV and video multimedia applications.

Prof. McCanny was awarded a Royal Academy of Engineering Silver Medalfor “outstanding contributions to U.K. Engineering leading to commercial de-velopment,” in 1996. He was also awarded an IEEE Third Millennium medal in2000. From 1999–2000 he chaired the IEEE Signal Processing Society’s Tech-nical Committee on the Design and Implementation of Signal Processing Sys-tems and was also a member of the Society’s Technical Directions committee.He was elected Fellow of the Royal Society (of London) in 2002 and was alsoawarded a Commander of the Order of the British Empire (CBE) by Queen Eliz-abeth II in the same year. He has recently been awarded the 2003 Royal DublinSociety/Irish Times Boyle Medal, which recognizes scientific excellence in Ire-land. He is a Fellow of the U.K. Royal Academy of Engineering, the Instituteof Electrical Engineers, U.K., and the Institute of Physics, and a member of theRoyal Irish Academy.

Reuseable Silicon IP Cores for Discrete Wavelet Transform ...

Documents