06340356

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013 921

A High-Performance Energy-Efficient Architecturefor FIR Adaptive Filter Based on New DistributedArithmetic Formulation of Block LMS Algorithm

Basant K. Mohanty, Senior Member, IEEE, and Pramod Kumar Meher, Senior Member, IEEE

Abstract—In this paper, we present an efficient distributed-arithmetic (DA) formulation for the implementation of blockleast mean square (BLMS) algorithm. The proposed DA-baseddesign uses a novel look-up table (LUT)-sharing technique forthe computation of filter outputs and weight-increment terms ofBLMS algorithm. Besides, it offers significant saving of adderswhich constitute a major component of DA-based structures. Also,we have suggested a novel LUT-based weight updating schemefor BLMS algorithm, where only one set of LUTs out of setsneed to be modified in every iteration, where , , andare, respectively, the filter length and input block-size. Based

on the proposed DA formulation, we have derived a parallelarchitecture for the implementation of BLMS adaptive digitalfilter (ADF). Compared with the best of the existing DA-basedLMS structures, proposed one involves nearly times adders andtimes LUT words, and offers nearly times throughput of the

other. It requires nearly 25% more flip-flops and does not involvevariable shifters like those of existing structures. It involves lessLUT access per output (LAPO) than the existing structure forblock-size higher than 4. For block-size 8 and filter length 64, theproposed structure involves 2.47 times more adders, 15% moreflip-flops, 43% less LAPO than the best of existing structures, andoffers 5.22 times higher throughput. The number of adders of theproposed structure does not increase proportionately with blocksize; and the number of flip-flops is independent of block-size.This is a major advantage of the proposed structure for reducingits area delay product (ADP); particularly, when a large orderADF is implemented for higher block-sizes. ASIC synthesis resultshows that, the proposed structure for filter length 64, has almost14% and 30% less ADP and 25% and 37% less EPO than the bestof the existing structures for block size 4 and 8, respectively.

Index Terms—Adaptive filters, block LMS, distributed arith-metic, VLSI.

I. INTRODUCTION

A DAPTIVE DIGITAL FILTERS (ADFs) are widely usedin various signal-processing applications, such as echo

cancellation, system identification, noise cancellation and

Manuscript received June 18, 2012; accepted October 07, 2012. Date of pub-lication October 25, 2012; date of current version January 25, 2013. The as-sociate editor coordinating the review of this manuscript and approving it forpublication was Prof. Zhiyuan Yan.B. K.Mohanty is with the Department of Electronics and Communication En-

gineering, Jaypee University of Engineering and Technology, Raghogarh, Guna,Madhya Pradesh, India-473226 (e-mail: [email protected]).P. K. Meher is with the Institute for InfocommResearch, 1 FusionopolisWay,

Singapore-138632 (e-mail: [email protected], url: http://www1.i2r.a-star.edu.sg/~pkmeher/).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TSP.2012.2226453

channel equalization etc. [1]. Amongst the existing ADFs,least mean square (LMS)-based finite impulse response (FIR)adaptive filter is the most popular one due to its inherent sim-plicity and satisfactory convergence performance. However,the delay in availability of the feedback-error for updating theweights according to the LMS algorithm does not favor itspipeline implementation when sampling rate is high. Haimiet al. [2] have proposed the delayed LMS (DLMS) algorithmfor pipeline implementation of LMS-based ADF. The delayedLMS is similar to the LMS algorithm except that the correctionterms for updating the filter weights of the current iteration arecalculated from the error corresponding to a past iteration.Several schemes have been proposed to implement the

DLMS-based ADFs efficiently in a systolic VLSI with min-imum adaptation delay [2]–[4], [7], [8]. To avoid adaptationdelay in pipelined LMS ADF, Poltmann [5] has proposed amodified DLMS algorithm which is used by Douglas et al.[6] to derive a systolic architecture. But, the structure of [6]involves large amount of hardware resources compared to theearlier one [2].The block LMS (BLMS) ADF [9] is one of the useful deriva-

tives of the LMS ADF for fast and computationally-efficientimplementation of ADFs. Unlike the conventional LMS ADF,BLMS ADF accepts a block of input for computing a block ofoutput and updates the weights using a block of errors in everytraining cycle. The BLMS ADF has convergence performancesimilar to the LMS ADF, but the BLMS ADF of block-lengthoffers fold higher throughput compared with the other.

Keeping this in view,many variant of BLMS algorithm like timeand frequency-domain block filtered-X LMS (BFXLMS) hasbeen proposed for specific applications [20]. Das et al. [21] haveproposed efficient BFXLMS using FFT and fast Hartley trans-form (FHT), which is computationally more efficient. We haveproposed a delayed block LMS (DBLMS) algorithm [15], anda concurrent multiplier-based architecture for high-throughputpipeline implementation of BLMS ADFs. The structure of [15]provides fold higher throughput rate and demands timesmore resources compared to those of DLMS ADF. Baghel et al.[17], [18] have suggested a distributed-arithmetic (DA)-basedstructure for FPGA implementation of BLMS ADFs. A low-complexity design has been proposed in [19] for BLMS ADFs.This structure supports a very low sampling rate since it usessingle multiply-accumulate (MAC) cell for the computation offilter output and weight-increment term.To take the advantage of DA-based hardware designs [12],

Allred et al. [10] have suggested a scheme to derive a DA-baseddesign for LMS-ADF. The structure of [10] requires separate

1053-587X/$31.00 © 2012 IEEE

922 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013

look-up-tables (LUTs) for the calculation of filter output andweight-increment terms. The LUT used for the computation offilter output and weight-increment term of DA LMS-ADF isnamed as DA-F-LUT and DA-A-LUT, respectively. In every it-eration, entire content of DA-F-LUT is updated to compute theweight-increment term, where half the content of DA-A-LUTis updated to accommodate the new input sample arriving atthe current iteration. Updating the LUTs is the most time con-suming operation in DA-based LMS-ADF, since the updatingis performed sequentially at different LUT locations. The LUTupdate time, therefore, depends on the size of the LUT to beupdated. For most practical adaptive filters, we need to use adecomposition scheme, where small size LUTs can be used inDA-based LMS-ADFwhich not only helps in reducing the LUTsize but also in reducing LUT-update time. Recently Guo etal. [16] have suggested a scheme to avoid the DA-A-LUT inDA-based LMS-ADF, where both filtering and weight-updatingare performed using DA-F-LUT. On the other hand, throughputrate of existing DA-LMS ADFs could be slow for real-time ap-plications due to bit-serial nature of DA computation. Although,there are some interesting work on DA-based LMS ADF [10],[16], we find that the potential application of DA for the imple-mentation of BLMS ADF is yet to be explored.In order to reduce the power consumption of DA-based de-

signs, we aim at reducing the number of words in the LUT andless LUT-access. DA-based BLMS ADF structure can be de-rived by extending the scheme of [10], but this structure woulddemand times more hardware (memory and combinationallogic) for times more throughput rate. The scheme of [16] of-fers sharing of LUT for the computation of both filter output andweight-increment term, but this scheme can not be applied toderive a DA-based structure for BLMS ADFs, because separateinner-product computation (IPC) is performed for calculation offilter output and weight-increment term of BLMS ADF whereasin case of LMS ADF, IPC is performed to calculate the filteroutput only. In this paper, we have formulated the DA-BLMS al-gorithm for sharing of LUTs for the computation of filter outputand weight-increment terms.The key contributions of this paper are:• DA-based formulation of BLMS algorithm where bothconvolution operation to compute filter output and corre-lation operation to compute weight-increment term couldbe performed by using the same LUT.

• A novel approach for minimization of number of LUTwords to be updated per output. This helps to save externallogic and power consumption.

We have derived a DA-based structure for BLMS-ADFusing the proposed DA-formulation and a novel LUT updatingscheme. The most remarkable aspect of the proposed schemeis that the number of adders required by the structure doesnot increase proportionately with filter order, and the numberof flip-flops required by the structure is independent of theblock-size. Apart from that, the proposed structure has signifi-cantly less LUT access than the existing DA-LMS structure forhigher block-sizes.The rest of this paper is organized as follows: Mathematical

formulation is presented in Section II. The new-LUT updatescheme is discussed in Section III, and the proposed structure forDA-based BLMS ADF is presented in Section IV. Hardware-

and time-complexities of the proposed structure are discussedin Section V. Conclusion is presented in Section VI.

II. MATHEMATICAL FORMULATION

The BLMS algorithm for updating the filter weights in the-th iteration is given by

(1)

where is defined as

(2)

and are, respectively, the weight-vector and the error-vector of the -th iteration defined as:

where is the step-size; and the input matrix is derived fromthe current input blockof length , and past samples, given by

The error-vector is computed as

(3)

where the desired response vector is defined as

The -th block of filter output is computed by the matrix-vector product:

(4)

A. Computation of Filter Output

The input matrix of size can be decomposedinto square matrices of size each, where

. Similarly, the weight vector can be decomposed intoshort weight-vectors of size , for .

The computation of (4) can then be expressed as the sum ofmatrix-vector products:

(5)

where and are defined as

MOHANTY AND MEHER: A HIGH-PERFORMANCE ENERGY-EFFICIENT ARCHITECTURE FOR FIR ADAPTIVE FILTER 923

for , and

Each filter output now can be written as the sum of inner-products as

(6)

where is an -point inner-product of an input-vectorand are given by

(7)

and is the -th row of given by

for , , and. Note that we have dropped the subscript of

in (7) only for convenience of further discussion, without lossof generality.

B. Computation of Weight Increment Term

The weight-increment vector can be decomposed intoshort vectors of size each, for .

Computation of (2) can be performed through independentmatrix-vector multiplication using the relation

(8)

where , and defined as

(9)Using (8), the individual weight increment terms could be eval-uated by the following equation

(10)

where is the inner-product between the vector and ,given by

(11)

Here also we have dropped the subscript of for con-venience of further discussions. As shown in (7) and (11), theinput-vector is the same for a pair of inner-productsand . This is a major advantage in order to optimize theLUTs when the inner-products of (7) and (11) are performedusing the DA principle.

C. DA-Formulation

Let and , respectively, be the -th compo-nents of the -point vectors and , and assumed to be -bitnumbers in 2’s complement representation:

(12a)

(12b)

and are the -th bit of and , respec-tively. Substituting (12a) in (7), we have

(13)

Rearranging the order of summation, (13) may otherwise be ex-pressed as:

(14)

where , for , andfor . Each term in the inner sum in (14) represents theinner-product of with a bit-vector (or bit-slice) of weight-vector . Corresponding to possible values of a bit-vectorof length , there could be possible values of such inner-products of with any possible bit-vector of length . Allthose possible inner-products could be pre-computed and storedin an LUT, such that when the -th bit-vector (or bit-slice) ofweight vector

for , is fed to the LUT as address, itsinner-product with , is read from the LUT. The computationof inner sum of (14), therefore, could be expressed in the formof memory read operation as:

(15)

where is a memory-read operation, and its argumentfor , is used as LUT-address. The

inner-product of (11) may, similarly, be expressed in the formof memory-read operation as

(16)

where is the -th bit-vector of error-vector defined as:, which is used

as address of an LUT to read its inner-products with . LUTcontents for the computation of and are exactlythe same, since the LUT content depends on the input-vector ,and generated for all possible bit-slices of -bit length, irrespec-tive of whether that is of the weight-vector or the error-vector.When the bit-vector is used as address, the partial resultsof are read from the LUT, and when is used asaddress, then partial results of are read from the sameLUT. Therefore, by using the proposed scheme, a common setof LUTs could be used for the computation of filter outputsand weight-increment terms. Since, the block of input sampleschanges after every iteration, the LUTs are required to be up-dated in every iteration to accommodate the new input-block.In the next Section, we have presented a novel LUT-updatingscheme for the DA-based BLMS ADFs.


Fig. 1. (a) Inner-products of FIR filter of length , and block-size .The input-vectors corresponding to inner-product is shown inside thebox. (b) LUT arrangement for DA-based computation of the FIR filter of, and . Each LUT here stores possible values of partial inner-product

of input-vector and bit-vector of of length , forand .

III. LUT-UPDATING SCHEME

Before, we discuss the proposed LUT-updating scheme, wesummarize here the proposed decomposition of input-matrixand weight-vector into small vectors, and their participation inthe inner-product computation for filtering operation. The input-matrix of size is decomposed into square matricesof size and is decomposed into short-vec-

tors of size , for where .Each of rows of represents an input-vector, so that suchinput-vectors ( , for ) are derived form ,and such input-vectors are derived from , for

. All these input-vectors are arranged in rows andcolumns such that, input-vectors of belong to -thcolumn. According to (5), weight-vectors are multipliedindependently with matrices which, in total, involvesinner-products. According to (6), results of inner-productscorresponding to each row of input-vectors are added togetherfor obtaining a filter output. From such rows of inner-prod-ucts, filter outputs are obtained.We have illustrated here the aforementioned scheme for the

implementation of FIR filter of length and block-size. Suppose, during the -th iteration the filter receives an

input-block and computes a block of output. As discussed above, the input-matrix of

size 2 6 is decomposed into 3 square-matrices , andof size 2 2. consists of a pair of input-vectors (

and ), and similarly and consist of pair of input-vec-tors and , respectively. The 6-point weight-vector is decomposed into 3 number of 2-point weight-vectors

. Fig. 1(a) shows the arrangement of input-vectorsand weight-vectors; and the corresponding inner-products areshown on the top of the rectangular boxes for clarity. Resultsof odd-numbered inner-products (on upper row) and even-num-bered inner-products (on lower row) are added separately (notshown in the figure) to obtain and , respectively.

Fig. 2. DA-based computation of the block FIR filter for and .(a) for -th iteration, (b) for -th iteration.

As shown in Fig. 1(a), the same weight-vector is used forthe computation of inner-product of a particular column ofinput-vectors. For DA realization, LUT corresponding to each

and stores partial inner-products generated by theinner-product of the corresponding input-vector with allpossible values of a bit-vector of length . DA-based parallelcomputation of filter outputs of Fig. 1(a) for the -th iterationis shown in Fig. 1(b). As shown in Fig. 2(a), the DA-basedstructure receives an input-block duringthe -th iteration, so that two new samples enter intothe set of 7 samples, and two oldest samples are discarded.Consequently, samples of the all 6 input-vectors are changed.But, it occurs in a particular order. We can find from Fig. 1(b)and Fig. 2(a), that the contents of only the first column of LUTsof Fig. 2(a) are changed by the new samples while in othercolumns, the LUT values remain the same. But the position ofthose unchanged LUTs are shifted right by one-column. For in-stance, values stored in the LUTs of second column of Fig. 2(a)are the same as values stored in LUTs of the first-column ofFig. 1(b), and similarly values stored in LUTs of third columnof Fig. 2(a) are the same as those LUTs of second-columnFig. 1(b). This feature can be observed in the LUT contentsof Fig. 2(b) for the -th iteration also. In other word,contents of a particular column of LUTs during a particulariteration are simply transferred to the adjacent column of LUTson its right during the next iteration. In this way, the oldestinput samples of particular set are shifted out through the -thcolumn ( in the example) of LUTs, and new values areentered at the first column of LUTs.Shifting of values physically from one LUT to the next across

the array of LUTs is highly time consuming and power con-suming. Therefore, we have proposed a novel LUT updatingscheme, where the LUT content need not be shifted. Since, eachcolumn of LUTs uses the same weight-vector as LUT-address,the column-wise right-shift of LUT values can be achieved by aleft-shift of the weight-vectors. This technique could save a lotof time and power, since the shifting of weight-vectors is sig-nificantly less expensive than the shifting of LUT contents. Inthe proposed LUT update scheme, contents of only one column


Fig. 3. (a) Equivalent DA-based structure of Fig. 2(a) which is derived fromstructure of Fig. 1(b) by changing the content of 5th and 6th LUT (shown ingrey color) and left shifting the weight-vectors by one-position. (b) Equiva-lent DA-based structure of Fig. 2(b) derived from the structure of Fig. 3(a) bychanging content of 3rd and 4th LUT (shown in grey color) and left-shifting theweight-vectors by one position.

of LUTs out of 3 such columns (for ) need to be up-dated in every iteration. We can find from Fig. 1(b) and Fig. 2(a)that, the values of the third-column LUTs of the -th itera-tion are not used during -th iteration, since they corre-spond to the oldest block of samples . TheLUTs of the third column are updated as shown in Fig. 3(a) ingrey-color. To feed weight-vectors to LUTs of Fig. 3(a) in thesame order as that of Fig. 2(a), weight-vectors of Fig. 1(b) aresimply left-shifted by one location. As shown in Fig. 3(a), thesecond-column of LUTs contain the values corresponding to thesamples , which is the oldest block of sam-ples in the -th iteration, and this input-block is discardedand corresponding LUTs are updated by the partial inner-prod-ucts of new input-block . Weight-vectorsof Fig. 3(a) are left-shifted by one column, and fed to LUTsof Fig. 3(b) as addresses. In the following, we summarize theproposed scheme for updating LUTs of BLMS-based adaptivefilter:• LUTs are updated column-by-column in every iteration incyclic order.

• The LUTs which store the values of partial inner-prod-ucts corresponding to samples of the oldest input block areoverwritten by those of the new input block.

• The weight-vectors are circularly left-shifted afterevery iteration to change the columns of LUT to be readcircularly.

• The values required for updating a column of LUTs for anyparticular iteration are calculated from samples of thecurrent input-block and samples of the most recentpast samples of the previous block.

Based on the above scheme, LUT-matrix is updatedcolumn-by-column from right to left after every iteration. Theupdating process starts from the -th column of LUTs and goesto the first column on a cyclic manner, and then again from thefirst column it goes to the -th column and then to the

Fig. 4. Proposed DA-based structure for implementation of BLMS adaptiveFIR filters (for and ), where

, , and.

-th column and so on. Hence, LUTs of one particular columnare updated once in a period of iterations.

IV. PROPOSED ARCHITECTURE

Proposed DA-BLMS structure is comprised of oneDA-module, one error bit-slice generator (EBSG) and oneweight-update cum bit-slice generator (WBSG). WBSG up-dates the filter weights and generates the required bit-vectorsin accordance with the DA-formulation. EBSG computes theerror block according to (3) and generates its bit-vectors. TheDA-module updates the LUTs and makes use of the bit-vectorsgenerated by WBSG and EBSG to compute the filter outputand weight-increment terms according to (15) and (16).

A. Structure for Block-Size

The proposed structure for DA-based BLMS adaptive filterfor and is shown in Fig. 4. The DA-modulereceives a block of input samples

in every iteration, and computes a blockof filter output. It also receives a block of errors in every iteration, and

computes the weight-increment term for all the componentsof the weight-vector .The structure of proposed DA-module is shown in Fig. 5. It

consists of 4 identical processing elements (PEs) for ,one LUT-update block and one MUX-array. Structure of the PEis shown in Fig. 6. It consists of 4 identical subcells (SCs) for

. Internal structure and function of the -th SC isshown in Fig. 7. As required by (15), LUT of the -th SC ofthis PE stores 16 possible values corresponding to the samples

,where .The LUT-update block of the DA-module generates the re-

quired values to update LUTs of a particular PE. Structure ofthe LUT-update block is shown in Fig. 8. It consists of oneadder-block and an input delay unit (IDU), which storessamples of the previous block. During each iteration, the adder


Fig. 5. Structure of DA-module of the proposed DA-BLMS ADF (for and ). The subscript of , and varies from 0 toin cycles.

Fig. 6. Internal structure of -th processing element (PE) of theDA-module for block-size , where .

block receives samples ( samples from the currentinput block and past samples from the IDU), and feedsthese samples to adder-cells (ACs) (see Fig. 8) such thateach AC receives samples, and input blocks of adjacent ACsare overlapped by samples. During the -th iteration,AC- receives input samples

and AC- receives the sam-ples . Forblock-size , each AC receives a block of four samples inevery iteration (shown in Fig. 9). As shown in the figure, eachof the four inputs of the AC is ANDed with a bit of the four-bitaddress by four AND cells. Each AND cellconsists of AND gates, where is the word-length of inputsamples. All those AND gates are fed with a bit of the address,

Fig. 7. Internal structure and function of -th subcell (SC) of a PE, whereand , . Convergence factor is

assumed to be power of 2.

while the other input of the AND gates are fed with a bit of inputsample. The output of AND cells are fed to an adder-tree (AT).AC receives 16 possible values of in 16 clock cycles, and cal-culates 16 values of to be stored in the LUT, where isused as the address of the LUT location and is the equivalentinteger value of . All the ACs of the adder block (see Fig. 8)work in parallel, and generate all the required values to updateLUT of SCs of a PE. According to the proposed LUT-updatescheme, LUTs of one PE out of PEs are updated in every it-eration. LUTs of all the PEs are updated once in cycles


Fig. 8. Internal structure of LUT-update block for block-size , where.

Fig. 9. Internal structure of -th AC of the LUT-update block for block-size .

Fig. 10. Internal structure of MUX-array for and .

in a cyclic order. Each PE uses separate control signal ( , for) to enable the specific column of LUTs to be

updated. LUT-update operation of proposed structure is com-pleted during the first clock cycles of every iteration.Each PE receives the bit-vectors , and through

the MUX array (shown in Fig. 9) for updating the LUTs orcomputation of filter outputs or weight-increment terms, respec-tively. After completion of the LUT-update, filtering computa-tion follows immediately for the next clock cycles by a seriesof LUT-read operations using the bit-slices of correspondingweight-vector in LSB to MSB order, as successive addressesaccording to (15). During the -th cycle of filtering, theWBSG generates parallel bit-vectors of width bits

Fig. 11. Structure of error computation cum bit-slice generator (EBSG) forblock-size , where ,and .

each for the PEs to perform the filtering operation. Each SCreceives a sequence of bit-vectors , (forwhere is the wordlength of the filter-coefficients) from theWBSG in clock cycles. The LUT-read values are shift-ac-cumulated in an accumulator (ACC) to obtain a partial filteroutput. During the -th cycle the LUT output is subtracted fromthe accumulated result since the bit-vector during thiscycle contains the sign-bits of weight-vector . Each SC usescontrol signal CTR1 to control add/substract operation in theACC. At the end of the -th cycle, ACC contents are sent tothe DMUX as input, and the ACC register (not shown in Fig. 7)is cleared to be used for the computation of weight-incrementterm from the next cycle (CTR1 is used for clearing the reg-ister). Finally the DMUX sends the computed partial results ofinner-products to the output line using the selectsignal CTR6. From SCs of each PE, partial results are ob-tained in parallel, the corresponding output of each SC fromPEs are added by an AT (Fig. 5) to obtain (the -thcomponent of -th block of filter output). A block of parallelfilter output ( ) are obtained from ATs of the DA-module ineach cycle.EBSG receives one block of filter output ( ) from the

DA-module, and calculates a block of error ( ) in everyiteration using one block of desired response according to(3). As shown in Fig. 11, error values are loaded in parallel-inserial-out (PISO) shift-registers of the bit-slice-generater (BSG)to generate bit-vectors of error-vector . CTR4 enablesthe clock for the BSG and CTR2 controls load-shift operationof each SR.Bit-vectors , for , fed serially in LSB

to MSB order to the DA-module in successive clock cyclesto compute weight-increment terms for the -th itera-tion. According to (16), LUT values of the -th block of filteroutput are also used to compute weight-increment terms for the

-th iteration. In general, LUT values of -th SCof -th PE (for , )are used to compute the weight-increment term .The -th PE, therefore, computes the weight-incrementterms . Thecomputation of weight-increment-terms is similar to the par-tial filter outputs. But in this case the same bit-vector is used


TABLE ILUT UPDATING SCHEME FOR AND , WHERE , : BLOCK SIZE, : FILTER ORDER

by all the PEs of the DA-module to compute the weight-incre-ment terms. In each SC (see Fig. 7), the ACC contents corre-sponding to the weight-increment term is sent to the output lineof the DMUX. The weight-increment terms are scaled by a

factor . Here we have assumed is a power of 2, so that thescaling of by is realized by a right-shift operation usinga fixed-shifter (see Fig. 7).According to (1), the WBSG of the proposed DA-BLMS

structure requires only the weight-increment terms of the cur-rent iteration to update the weight-vector for the next iteration.It does not require the LUT values of the current iteration.Therefore, once the weight-increment terms of the currentiteration are computed, the LUT-updating operation for thenext iteration can be started immediately in the next clockcycle. As we discussed earlier, the filter computation followsthe LUT-update operation, and first clock cycles of every it-eration are used to complete the LUT-update operation. Duringthis period, weight-update operation of WBSG also can beperformed concurrently. A bit-parallel (word-serial) structureof WBSG requires one clock-cycle to complete the weight-up-date operation, while a bit-serial structure of WBSG requiresclock-cycles to complete the same task. If wordlength of

filter-coefficients ( ) is less than or equal to the LUT-size, then bit-serial realization of WBSG does not increase

the iteration period of the DA-BLMS structure, but it certainlyhelps to reduce the hardware complexity of the DA-BLMSstructure. For and , we can have a bit-serialstructure for the WBSG. Bit-serial structure of WBSG receivesthe weight-increment terms from the DA-module in bit-serialLSB to MSB order, and updates the weight-vector accordingly.For bit-serial realization of WBSG, weight-increment termscomputed by each PE of the DA-module are finally loaded intoa separate BSG (see Fig. 5) to generate the weight-incrementterms in bit-serial order. All the BSG of the DA-module usescommon control signals CTR6 and CTR5 to perform theloading and sifting operations, respectively.WBSG is an important block of the proposed structure. It

performs three operations: (i) updates filter weights using theweight-increment values calculated by the DA-module, (ii)generates bit-vectors for the DA-module to compute

-th block of filter output, (iii) gives one left-shift(circularly) to the weight-vectors as necessitated by theproposed LUT-update scheme. We have shown LUT updatingof the DA-BLMS ADF for and in Table I forthe first 5 iterations using the proposed LUT-updating scheme.As shown in Table I, for and , the LUT-matrix

has 4 columns (for ). LUTs of all these 4 columnsare updated once in a period of 4 iterations. At any giveniteration, the LUT-matrix contains the values correspondingto recent past input samples to computea block of 4 filter output. As shown in Table I, during the5-th iteration, LUT-matrix ( to ) contain the valuescorresponding to set of input samples ( to ). Theseset of 19 samples are exactly required to compute the filteroutput ( to ). Similarly, the LUT-matrix contain thevalues corresponding to the set of samplesduring 6-th iteration, and these samples are exactly required tocompute filter outputs ( to ).The bit-serial structure of WBSG is shown in Fig. 12. It con-

sists of serial-in serial-out (SISO) SRs and carry-savefull-adders (CSFAs) corresponding to filter weights. SRsare arranged in matrix form; and filter-weights arestored in the SRmatrix column-wise, such that weight-vectoris stored in -th column of SRs. As shown in Table I for

, that bit-slices of the weight-vector are re-ceived by the PE whose LUTs are to be updated during the -thiteration, and are generated fromthe first column of filter weights. The weight-vector to be aligned with the corresponding

PE. If during the -th iteration, LUT of PE-1 is to be updated,then the first column of SR-matrix is required to contain thecomponents of weight-vector and the -th column of SRsshould contain components of weight-vector . As shownin Fig. 12, weight-increment values of the -th column offilter coefficients (available in the -th column of SR-ma-trix) are obtained from the -th PE, and these values areadded with the corresponding filter-weights bit-serially using acarry-save full-adder (CSFA). Results of CSFA of -thcolumn constitute a bit-vector of . SR contents are shiftedleft for clock cycles, to generated the shifted weight-vectorsin accordance with the proposed LUT-update scheme. Shiftingoperation of the SRs starts at -th clock cycle of everyiteration, and continue for clock cycles. The control signalCTR5 is used in WBSG to enable the shifting operation. Dflip-flop of each CSFA is cleared during the first clock cycleof every iteration to flush-out the final carry of the previous it-eration of weight-update operation.

B. Structure for Higher Block-Size

To derive DA-based BLMS structure for higher block sizesusing LUT of 16 words, we can take the block-size to be anmultiple of 4, i.e. , where is an integer. The structures


Fig. 12. Bit-serial structure of weight-update cum bit-slice generator (WBSG) for , and .

of EBSG and WBSG of the DA-BLMS filter for (for) are the same as those of block-size shown in

Fig. 11 and Fig. 12, respectively. However, the AC of the LUT-update block and the SC of each PE of the DA-module need tobe modified according to the value of . Each SC in this case,is comprised of LUTs of size 16 words each. The bit-vectorsof weight-vectors and error-vectors of bits each are splittedinto segments of 4-bit size, and fed to LUTs of each SCto read the LUTs in parallel. The values read from the LUTsare added using an AT and subsequently shift-accumulated inthe ACC for obtaining a partial output. To generate the weightupdate-values for LUTs, each AC of the LUT-update block inthis case is comprised of AND-TA blocks of size 4 (as shownin Fig. 9). For block-size , each SC involves RAMwords and adders along with one ACC and 2 DMUX.Similarly, the LUT-update block involves AND-gates and

adders.

V. HARDWARE-TIME COMPLEXITY AND

PERFORMANCE COMPARISON

A. Hardware Complexity

Proposed structure is comprised of one DA-module, oneWBSG, one EBSG and a control unit. The DA-module consistsof one LUT-update block, PEs, adder-trees of wordseach, one MUX-array and BSG, where and theblock-size . LUT-update block consists one IDU andACs, where the IDU is comprised of registers of size ,and each AC is comprised of AND-gates and adders.LUT-block, therefore, involves registers, addersand AND-gates. Each PE consists of SCs, where eachSC is comprised of LUTs of 16 words each, adders,one ACCs, one 1-to-2 line DMUX and number of 2-inputXOR-gates (used by ACC (not shown in Fig. 7) to compute 1’scomplement of the LUT-outputs when the bit-vector containssign-bits), where ACC involves one adder, one register and

one 2-to-1 line DMUX ( ). Each PE, there-fore, involves memory words, adders, registers,

DMUXes (2-to-1 line) and XOR-gates. Each BSG iscomprised of SRs (bit-level) of size . MUX-array involves

2-to-1 line MUXes. The DA-module, therefore, in-volves memory words, adders,

D-type flip-flops (FFs),2-to-1-line MUXes/DMUXes (word-level), AND-gatesand XOR-gates. WBSG involves D-type FFsand FAs. EBSG involves D-type FFs and adders.Proposed structure, therefore, requires memory words,

adders, FAs,D-type FFs, MUXes/DMUXes (word-level),

AND-gates and and XOR-gates.

B. Time Complexity

The proposed structure performs four operations sequen-tially in every iteration. Those are (i) LUT update, (ii) filteroutput computation, (iii) error calculation and (iii) compu-tation of weight-increment term. It involves 16 clock cyclesto complete LUT-update operation. It takes clock cyclesto calculate partial results of a block of filter output. It cal-culates a block of filter output from the partial results andthen block of error in one clock cycle. Finally it takesclock cycles to compute the weight-increment term for theweight vector. In every iteration, proposed structure pro-cesses one block of samples, where one iteration involves

clock cycles and duration of one clock cycle is,

where is the delay of one -bit adder. For comparisonpurpose, we have also estimated number of clock cycles re-quired by the structure of [10] and [16] for one iteration. Weassumed the read and write operations are performed in twoseparate clock cycles in a LUT to maintain uniformity in thecomparison. Structure of [10] requires 16 clock cycles to updatethe DA-A-LUT of size 16 words, clock cycles to computeone filter output and 32 clock cycles to update the DA-F-LUT


TABLE IIGENERAL COMPARISON OF HARDWARE COMPLEXITY OF THE PROPOSED STRUCTURE (FOR BLOCK-SIZE ) AND THE STRUCTURE OF

[10] AND [16] (WITH DECOMPOSITION FACTOR 4) AND THE DA-BLMS STRUCTURE OF [18]

LEGEND: ADD: adder, MULT: multiplier, FF: flip-flop, VSH: variable shifter, TR: throughput rate, LAPO: LUT access per output.

, , , , ,

. In addition to the above list of components the proposed structure involves FAs, 2-input AND-gates and 2-input XOR-gates,

where : word length of the sequence , and , : word-length of input sequence, . For the proposed structure, ,

and in case of [10] and [16], , and in case of [18], , , and block-size , where and are

relatively prime to each other.

of size 16 words. It involves 48 clock cycles for one iterationand computes one output per iteration, where the duration ofone clock cycle is and . Since,the structure of [16] does not involve DA-F-LUT, it requires 16clock cycles for updating the DA-A-LUT and clock cyclesto compute one filter output. The structure of [16], therefore,involves clock cycles for one iteration, where theduration of the clock period is the same as that of [10].

C. Number of LUT Access

During every iteration, proposed structure computes filteroutputs, and performs write operations for updating theLUTs, LUT read operations for filter output computationand LUT read operations for the computation of weight-in-crement terms. The number of LUT access per output (LAPO)of the structure is, therefore, . Similarly, thenumber of LAPO of [10] and [16] are found to beand , respectively, where is the bit-width of theinput samples and is the bit-width of all the intermediate andoutput samples. Note that, LUTs of DA-based ADF are requiredto be implemented by RAM, and the total energy consumptionof the structure, therefore, increases significantly with LAPO.

D. Performance Comparison

Hardware and time complexities the proposed structure andthe DA-LMS structures of [10], [16], and DA-BLMS structureof [18] are listed in Table II for comparison. The structure of[16] is the most efficient one amongst the existing DA-LMSstructures. Compared with [16], proposed structure requirestimes more LUT words, nearly times more adders, 4/3 timesmore FFs and offers nearly times higher throughput rate. It in-volves 16 more LAPO for block-size 4 and lessLAPO for block-size 8 than those of [16] for 16-bit internalbit precision. Interestingly, number of adders of the proposed

structure does not increase proportionately with block-size inthe proposed structure and number of flip-flops is independentof block-size. Besides, it does not require variable shifters un-like those of [10] and [16].We have estimated hardware and time complexity of pro-

posed structure for and 8, and that of [10] and [16] forfilter size ( , 32 and 64) using the complexity countsof Table II. The estimated values are listed in Table III forcomparison. Compared with the structure of [16], proposedstructure for involves 8 times more LUT words; 3.27times more adders on average for different filter orders, andoffers 5.22 times higher throughput. But, it involves, respec-tively, 37.5%, 24.4%, 17.8% more flip-flops and 25%, 37.5%,47.6% less LAPO than those of [16] for filter orders 16, 32, 64,respectively.

E. Simulation Result

To validate the proposed design, we have coded it in VHDLfor filter order 16, 32 and 64 with block-size 4 and 8. We havealso coded the design of [10] and [16] for the same filter orders.We have considered and , and synthesizedboth the designs by Synopsys Design Compiler using TMSC 90nm CMOS library. Synthesis reports obtained from the DesignCompiler are listed in Table IV.Synthesis results are in accordance with the theoretical es-

timation given in Table III. The minimum clock period of theproposed structure and the structure of [16] are slightly higherthan those of [10] due to the extra MUX/DMUX in the criticalpath. As shown in Table IV, structure of [16] is themost efficientamongst the existing structures. Compared with [16], proposedstructure for block size and 8 involve, respectively, 2.13and 3.69 times more area on average for different filter ordersand offers nearly 2.61 and 5.22 times higher throughput rate, re-spectively.


TABLE IIIHARDWARE AND TIME-COMPLEXITY OF PROPOSED STRUCTURE AND STRUCTURE OF [10] AND [16] FOR DIFFERENT SIZE FILTERS. ,

TABLE IVCOMPARISON OF AREA, DELAY, AND POWER COMPLEXITIES OBTAINED FROM SYNTHESIS RESULT OF PROPOSED STRUCTURE AND STRUCTURE OF [10] AND [16]

We have estimated ADP1, PPO2 and energy per output(EPO3) at 20 MHz clock. As shown in Table IV, for block-size4, the proposed structure has 17.47%, 18.49%, 13.66% lessADP than that of [16] for filter order 16, 32 and 64, respectively.For block-size 8, it has 31/6% less ADP than [16] on averagefor different filter orders. For block-size 4, it consumes 27.5%,28.8% and 24.6% less EPO than that of [16] for filter order 16,32 and 64, respectively. Similarly, for block-size 8, it consumesrespectively, 40%, 39.8% and 37.4% less EPO than other for16, 32 and 64 order filters. One can extrapolate these results toobtain the approximate values of ADP, PPO and EPO of theproposed structure for filter order greater than 64. One canalso extrapolate these observations to obtain the approximateestimate of the advantages of proposed structure for filter ordergreater than 64.

1

2

3

VI. CONCLUSION

We have derived aDA formulation of BLMS algorithmwhereboth convolution and correlation are performed using a commonLUT for the computation of filter outputs and weight incrementterms, respectively. This results in a significant saving of LUTwords and adders which constitute the major hardware com-ponents in DA-based computing structures. Also we have sug-gested a novel LUT updating scheme to update the LUT con-tents for DA-based BLMSADF, where only one set of LUTs outof sets need to be modified in every iteration such that LUTcontents are modified once in every iterations, where

, is the filter length and is the input block-size. Usingthe proposed scheme, we have derived a parallel architecture forthe implementation of DA-based BLMS ADF. Unlike the ex-isting DA-based LMS structure, number of adders required bythe proposed structure does not increase linearly with . Com-pared with the best of the existing DA-based LMS designs, pro-posed one involves nearly times more adders, and times


more LUT words and offers nearly times throughput. It re-quires nearly 25% more flip-flops irrespective of the block-size,but does not involve variable shifters like others. It involvesless number of LUT access per output than the existing struc-ture for block-size higher than 4. This is a major advantage ofthe proposed structure for reducing its ADP and EPO when im-plemented for large order ADF, and for higher block-sizes. Forblock-size 8 and filter length 64, the proposed structure involves2.47 times more adders, 15% more flip-flops, 43% less LAPOthan the best of the existing structures, and offers 5.22 timeshigher throughput. ASIC synthesis result shows that, the pro-posed structure for filter order 64, has almost 14% and 30% lessADP and 25% and 37% less EPO than the best of the existingstructures for block size 4 and 8, respectively.

REFERENCES[1] S. Haykin and B. Widrow, Least-Mean-Square Adaptive Fil-

ters. Hoboken, NJ: Wiley-Interscience, 2003.[2] R. Haimi-Cohen, H. Herzberg, and Y. Beery, “Delayed adaptive LMS

filtering: Current results,” in Proc. IEEE Int. Conf. Acoust., Speech,Signal Process., Albuquerque, NM, Apr. 1990, pp. 1273–1276.

[3] M. D. Meyer and D. P. Agrawal, “A modular pipelined implementa-tions of a delayed LMS transversal adaptive filter,” in Proc. IEEE Int.Symp. Circuits Syst., New Orleans, LA, May 1990, pp. 1943–1946.

[4] V. Visvnathan and S. Ramanathan, “A modular systolic architecturefor delayed least mean square adaptive filtering,” in Proc. IEEE Int.Conf. VLSI Des., Bangalore, 1995, pp. 332–337.

[5] R. D. Poltmann, “Conversion of the delayed LMS algorithm into theLMS algorithm,” IEEE Signal Process. Lett., vol. 2, p. 223, Dec. 1995.

[6] S. C. Douglas, Q. Zhu, and K. F. Smith, “A pipelined LMS adap-tive FIR filter architecture without adaptive delay,” IEEE Trans. SignalProcess., vol. 46, pp. 775–779, Mar. 1998.

[7] L. D. Van andW. S. Feng, “Efficient systolic Architectures for 1-D and2-D DLMS adaptive digital filters,” in Proc. IEEE Asia Pacific Conf.Circuits Syst., Tianjin, China, Dec. 2000, pp. 399–402.

[8] L. D. Van and W. S. Feng, “An efficient architecture for the DLMSadaptive filters and its applications,” IEEE Trans. Circuits Syst. II,Analog Digit. Signal Process., vol. 48, no. 4, pp. 359–366, Apr. 2001.

[9] G. A. Clark, S. K. Mitra, and S. R. Parker, “Block implementation ofadaptive digital filters,” IEEE Trans. Circuit Syst., vol. 28, pp. 584–592,Jun. 1981.

[10] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. An-derson, “LMS adaptive filters using distributed arithmetic for highthroughput,” IEEE Trans. Circuits Syst., vol. 52, no. 7, pp. 1327–1337,Jul. 2005.

[11] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson,“A novel high performance distributed arithmetic adaptive filter im-plementation on an FPGA,” in Proc. IEEE Int. Conf. Acoust., Speech,Signal Process. (ICASSP), 2004, vol. 5, p. V-161-4.

[12] S. A. White, “Applications of distributed arithmetic to digital signalprocessing: A tutorial review,” IEEE ASSP Mag., vol. 6, pp. 4–19, Jul.1989.

[13] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson,“An FPGA implementation for a high throughput adaptive filter usingdistributed arithmetic,” in Proc. 12th Annu. IEEE Symp. Field-Pro-grammable Custom Comput. Mach., 2004, pp. 324–325.

[14] W. Huang and D. V. Anderson, “Adaptive filters using modifiedsliding-block distributed arithmetic with offset binary coding,” inProc. IEEE In. Conf. Acoust., Speech, Signal Process. (ICASSP),2009, pp. 545–548.

[15] B. K. Mohanty and P. K. Meher, “Delayed block LMS algorithm andconcurrent architecture for high-speed implementation of adaptive FIRfilters,” presented at the IEEE Region 10 TENCON2008 Conf., Hyder-abad, India, Nov. 2008.

[16] R. Guo and L. S. DeBrunner, “Two high-performance adaptive filterimplementation schemes using distributed arithmetic,” IEEE Trans.Circuits Syst. II, Exp. Briefs, vol. 58, no. 9, pp. 600–604, Sep. 2011.

[17] S. Baghel and R. Shaik, “FPGA implementation of fast block LMSadaptive filter using distributed arithmetic for high-throughput,” inProc. Int. Conf. Commun. Signal Process. (ICCSP), Feb. 10–12, 2011,pp. 443–447.

[18] S. Baghel and R. Shaik, “Low power and less complex implementationof fast block LMS adaptive filter using distributed arithmetic,” in Proc.IEEE Students Technol. Symp., Jan. 14–16, 2011, pp. 214–219.

[19] R. Jayashri, H. Chitra, H. Kusuma, A. V. Pavitra, and V. Chan-drakanth, “Memory based architecture to implement simplified blockLMS algorithm on FPGA,” in Proc. Int. Conf. Commun. SignalProcess. (ICCSP), Feb. 10–12, 2011, pp. 179–183.

[20] Q. Shen and A. S. Spanias, “Time and frequency domain X block LMSalgorithm for single channel active noise control,” Control Eng. J., vol.44, no. 6, pp. 281–293, 1996.

[21] D. P. Das, G. Panda, and S. M. Kuo, “New block filtered-X LMS algo-rithms for active noise control systems,” IEE Signal Procesd., vol. 1,no. 2, pp. 73–81, Jun. 2007.

[22] K. K. Parhi, VLSI Digital Signal Procesing Systems: Design and Im-plementation. New York: Wiley, 1999.

[23] C. S. Burrus, “Index mappings for multidimensional formulation of theDFT and convolution,” IEEE Trans. Acoust., Speech, Signal Process.,vol. 25, pp. 239–242, Jun. 1977.

Basant K. Mohanty (M’06–SM’11) received M.Sc.degree in physics from Sambalpur University, India,in 1989 and the Ph.D. degree in the field of VLSI fordigital signal processing from Berhampur University,Orissa, in 2000.In 2001, he joined as Lecturer in Electrical and

Electronic Engineering Department, BITS Pilani,Rajasthan. Then, he joined as an Assistant Professorin the Department of Electronics and Communi-cation Engineering, Mody Institute of EducationResearch (Deemed University), Rajasthan. In 2003,

he joined Jaypee University of Engineering and Technology, Guna, MadhyaPradesh, where he became Associate Professor in 2005 and full Professor in2007. His research interest includes design and implementation of low-powerand high performance systems for multimedia applications, multi-core pro-cessor design and algorithm for concurrent processing. He has published nearly40 technical papers.Dr. Mohanty is a life time member of The Institution of Electronics and

Telecommunication Engineering, New Delhi, India. He was the recipient of theRashtriya Gaurav Award conferred by India International friendship Society,New Delhi, India for 2012.

PramodKumarMeher (SM’03) received the M.Sc.degree in physics and the Ph.D. degree in sciencefrom Sambalpur University, India, in 1978, and 1996,respectively.Currently, he is a Senior Scientist with the Institute

for InfocommResearch, Singapore, and Adjunct Pro-fessor with the School of Electrical Sciences, IndianInstitute of Technology Bhubaneswar, India. Previ-ously, he was a Professor of Computer Applicationswith Utkal University, India, from 1997 to 2002, anda Reader in electronics with Berhampur University,

India, from 1993 to 1997. His research interest includes design of dedicated andreconfigurable architectures for computation-intensive algorithms pertaining tosignal, image and video processing, communication, bio-informatics and intel-ligent computing. He has contributed nearly 200 technical papers to variousreputed journals and conference proceedings.Dr. Meher has served as a speaker for the Distinguished Lecturer Program

(DLP) of IEEE Circuits Systems Society and Associate Editor of the IEEETRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS. Currently, heis serving as Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS ANDSYSTEMS—I: REGULAR PAPERS, the IEEE TRANSACTIONS ON VERY LARGESCALE INTEGRATION (VLSI) SYSTEMS, and Journal of Circuits, Systems, andSignal Processing. He was the recipient of the Samanta Chandrasekhar Awardfor excellence in research in engineering and technology for 1999.

06340356

Education

block lms blms adf

blms adf of block

block size

implementation of block

pipelined lms adf

conventional lms adf

block of input

block of output