Top Banner
778 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 61, NO. 3, MARCH 2014 Critical-Path Analysis and Low-Complexity Implementation of the LMS Adaptive Algorithm Pramod Kumar Meher, Senior Member, IEEE, and Sang Yoon Park, Member, IEEE Abstract—This paper presents a precise analysis of the critical path of the least-mean-square (LMS) adaptive lter for deriving its architectures for high-speed and low-complexity implementation. It is shown that the direct-form LMS adaptive lter has nearly the same critical path as its transpose-form counterpart, but provides much faster convergence and lower register complexity. From the critical-path evaluation, it is further shown that no pipelining is required for implementing a direct-form LMS adaptive lter for most practical cases, and can be realized with a very small adap- tation delay in cases where a very high sampling rate is required. Based on these ndings, this paper proposes three structures of the LMS adaptive lter: (i) Design 1 having no adaptation delays, (ii) Design 2 with only one adaptation delay, and (iii) Design 3 with two adaptation delays. Design 1 involves the minimum area and the minimum energy per sample (EPS). The best of existing di- rect-form structures requires 80.4% more area and 41.9% more EPS compared to Design 1. Designs 2 and 3 involve slightly more EPS than the Design 1 but offer nearly twice and thrice the MUF at a cost of 55.0% and 60.6% more area, respectively. Index Terms—Adaptive lters, critical-path optimization, least mean square algorithms, LMS adaptive lter. I. INTRODUCTION A DAPTIVE digital lters nd wide application in several digital signal processing (DSP) areas, e.g., noise and echo cancellation, system identication, channel estimation, channel equalization, etc. The tapped-delay-line nite-impulse-response (FIR) lter whose weights are updated by the famous Widrow- Hoff least-mean-square (LMS) algorithm [1] may be considered as the simplest known adaptive lter. The LMS adaptive lter is popular not only due to its low-complexity, but also due to its stability and satisfactory convergence performance [2]. Due to its several important applications of current relevance and increasing constraints on area, time, and power complexity, ef- cient implementation of the LMS adaptive lter is still quite important. To implement the LMS algorithm, one has to update the lter weights during each sampling period using the estimated error, which equals the difference between the current lter output and the desired response. The weights of the LMS adaptive lter Manuscript received February 21, 2013; revised June 25, 2013; accepted July 18, 2013. Date of publication October 21, 2013; date of current version February 21, 2014. This paper was recommended by Associate Editor A. Ashra. P. K. Meher is with the School of Computer Engineering, Nanyang Tech- nological University, Singapore (e-mail: [email protected]; URL: http:// www3.ntu.edu.sg/home/aspkmeher). S. Y. Park is with the Institute for Infocomm Research, Singapore, 138632 (e-mail: [email protected]). Corresponding author: S. Y. Park. Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TCSI.2013.2284173 Fig. 1. Structure of conventional LMS adaptive lter. during the th iteration are updated according to the following equations. (1a) where (1b) (1c) with input vector and weight vector at the th iteration are given by, respectively, and where is the desired response, is the lter output of the th iteration, denotes the error computed during the th iteration, which is used to update the weights, is the conver- gence factor or step-size, which is usually assumed to be a pos- itive number, and is the number of weights used in the LMS adaptive lter. The structure of a conventional LMS adaptive lter is shown in Fig. 1. Since all weights are updated concurrently in every cycle to compute the output according to (1), direct-form realization of the FIR lter is a natural candidate for implementation. How- ever, the direct-form LMS adaptive lter is often believed to have a long critical path due to an inner product computation to obtain the lter output. This is mainly based on the assumption that an arithmetic operation starts only after the complete input operand words are available/generated. For example, in the ex- isting literature on implementation of LMS adaptive lters, it is assumed that the addition in a multiply-add operation (shown in Fig. 2) can proceed only after completion of the multiplication, and with this assumption, the critical path of the multiply-add operation becomes , where and are the time required for a multiplication and an addi- tion, respectively. Under such assumption, the critical path of the direct-form LMS adaptive ler (without pipelining) can be 1549-8328 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
11

Critical Path analysis using Lms adaptive algorithm

Nov 22, 2015

Download

Documents

Shanmuga Nathan

Critical Path analysis using least mean square adaptive algorithm.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 778 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 61, NO. 3, MARCH 2014

    Critical-Path Analysis and Low-ComplexityImplementation of the LMS Adaptive Algorithm

    Pramod Kumar Meher, Senior Member, IEEE, and Sang Yoon Park, Member, IEEE

    AbstractThis paper presents a precise analysis of the criticalpath of the least-mean-square (LMS) adaptive filter for deriving itsarchitectures for high-speed and low-complexity implementation.It is shown that the direct-form LMS adaptive filter has nearly thesame critical path as its transpose-form counterpart, but providesmuch faster convergence and lower register complexity. From thecritical-path evaluation, it is further shown that no pipelining isrequired for implementing a direct-form LMS adaptive filter formost practical cases, and can be realized with a very small adap-tation delay in cases where a very high sampling rate is required.Based on these findings, this paper proposes three structures of theLMS adaptive filter: (i) Design 1 having no adaptation delays, (ii)Design 2 with only one adaptation delay, and (iii) Design 3 withtwo adaptation delays. Design 1 involves the minimum area andthe minimum energy per sample (EPS). The best of existing di-rect-form structures requires 80.4% more area and 41.9% moreEPS compared to Design 1. Designs 2 and 3 involve slightly moreEPS than the Design 1 but offer nearly twice and thrice the MUFat a cost of 55.0% and 60.6% more area, respectively.

    Index TermsAdaptive filters, critical-path optimization, leastmean square algorithms, LMS adaptive filter.

    I. INTRODUCTION

    A DAPTIVE digital filters find wide application in severaldigital signal processing (DSP) areas, e.g., noise and echocancellation, system identification, channel estimation, channelequalization, etc. The tapped-delay-line finite-impulse-response(FIR) filter whose weights are updated by the famous Widrow-Hoff least-mean-square (LMS) algorithm [1] may be consideredas the simplest known adaptive filter. The LMS adaptive filteris popular not only due to its low-complexity, but also due toits stability and satisfactory convergence performance [2]. Dueto its several important applications of current relevance andincreasing constraints on area, time, and power complexity, ef-ficient implementation of the LMS adaptive filter is still quiteimportant.To implement the LMS algorithm, one has to update the filter

    weights during each sampling period using the estimated error,which equals the difference between the current filter output andthe desired response. The weights of the LMS adaptive filter

    Manuscript received February 21, 2013; revised June 25, 2013; accepted July18, 2013. Date of publication October 21, 2013; date of current version February21, 2014. This paper was recommended by Associate Editor A. Ashrafi.P. K. Meher is with the School of Computer Engineering, Nanyang Tech-

    nological University, Singapore (e-mail: [email protected]; URL: http://www3.ntu.edu.sg/home/aspkmeher).S. Y. Park is with the Institute for Infocomm Research, Singapore, 138632

    (e-mail: [email protected]).Corresponding author: S. Y. Park.Color versions of one or more of the figures in this paper are available online

    at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TCSI.2013.2284173

    Fig. 1. Structure of conventional LMS adaptive filter.

    during the th iteration are updated according to the followingequations.

    (1a)

    where

    (1b)(1c)

    with input vector and weight vector at the th iterationare given by, respectively,

    and where is the desired response, is the filter output ofthe th iteration, denotes the error computed during the thiteration, which is used to update the weights, is the conver-gence factor or step-size, which is usually assumed to be a pos-itive number, and is the number of weights used in the LMSadaptive filter. The structure of a conventional LMS adaptivefilter is shown in Fig. 1.Since all weights are updated concurrently in every cycle to

    compute the output according to (1), direct-form realization ofthe FIR filter is a natural candidate for implementation. How-ever, the direct-form LMS adaptive filter is often believed tohave a long critical path due to an inner product computation toobtain the filter output. This is mainly based on the assumptionthat an arithmetic operation starts only after the complete inputoperand words are available/generated. For example, in the ex-isting literature on implementation of LMS adaptive filters, it isassumed that the addition in a multiply-add operation (shown inFig. 2) can proceed only after completion of the multiplication,and with this assumption, the critical path of the multiply-addoperation becomes , where and

    are the time required for a multiplication and an addi-tion, respectively. Under such assumption, the critical path ofthe direct-form LMS adaptive filer (without pipelining) can be

    1549-8328 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

  • MEHER AND PARK: CRITICAL-PATH ANALYSIS AND LOW-COMPLEXITY IMPLEMENTATION OF THE LMS ADAPTIVE ALGORITHM 779

    Fig. 2. Example of multiply-add operation for the study of delay in compositeoperations.

    estimated as . Since this crit-ical-path estimate is quite high, it could exceed the sample pe-riod required in many practical situations, and calls for a reduc-tion of critical-path delay by pipelined implementation. But, theconventional LMS algorithm does not support pipelined imple-mentation. Therefore, it is modified to a form called the delayedLMS (DLMS) algorithm [3], [4], which allows pipelined imple-mentation of different sections of the adaptive filter. Note thatthe transpose-form FIR LMS adaptive filter is inherently of adelayed LMS kind, where the adaptation delay varies across thesequence of filter weights. Several works have been reported inthe literature over the last twenty years [5][11] for efficient im-plementation of the DLMS algorithm.Van and Feng [5] have proposed an interesting systolic archi-

    tecture, where they have used relatively large processing ele-ments (PEs) for achieving lower adaptation delay compared toother DLMS systolic structures with critical path of one MACoperation. Yi et al. [10] have proposed a fine-grained pipelineddesign of an adaptive filter based on direct-form FIR filtering,using a fully pipelined binary adder-tree implementation of allthe multiplications in the error-computation path and weight-update path to limit the critical path to a maximum of one addi-tion time. This architecture supports high sampling frequency,but involves large pipeline depth, which has two adverse ef-fects. First, the register complexity, and hence the power dissi-pation, increases. Secondly, the adaptation delay increases andconvergence performance degrades. However, in the followingdiscussion, we establish that such aggressive pipelining is oftenuncalled for, since the assumption that the arithmetic operationsstart only after generation of their complete input operand wordsis not valid for the implementation of composite functions indedicated hardware. Such an assumption could be valid whenmultipliers and adders are used as discrete components, whichis not the case in ASIC and FPGA implementation these days.On the other hand, we can assume that an arithmetic operationcan start as soon as the LSBs of the operands are available. Ac-cordingly, the propagation delay for the multiply-add operationin Fig. 2 could be taken to be ,where and are the delays of carry and sum gener-ation in a 1-bit full-adder circuit. Therefore, is much lessthan . In Table I, we have shown the propagationdelays of a multiplier, an adder, and carry-and-sum generationin a 1-bit full-adder circuit, and multiply-add circuit in TSMC90-nm [12] and 0.13- m [13] processes to validate our asser-tion in this context. From this table, we can also find that ismuch less than . In Section III, we further showthat the critical path of the direct-form LMS adaptive filter ismuch less than , and would amount tonearly , where .Besides, we have shown that no pipelining is required for imple-menting the LMS algorithm for most practical cases, and could

    TABLE IPROPAGATION DELAY (NS) BASED ON SYNTHESIS OF TSMC 0.13- M AND

    90-NM CMOS TECHNOLOGY LIBRARIES

    be realized with very small adaption delay of one or two sam-ples in cases like radar applications where very high samplingrate is required [10]. The highest sampling rate, which could beas high as 30.72 Msps, supported by the fastest wireless com-munication standard (long-term evolution) LTE-Advanced [14].Moreover, computation of the filter output and weight updatecould be multiplexed to share hardware resources in the adap-tive filter structure to reduce the area consumption.Further effort has been made by Meher and Maheswari [15]

    to reduce the number of adaptation delays as well as the crit-ical path by an optimized implementation of the inner productusing a unified pipelined carry-save chain in the forward path.Meher and Park [8], [9] have proposed a 2-bit multiplicationcell, and used that with an efficient adder tree for the implemen-tation of pipelined inner-product computation to minimize thecritical path and silicon area without increasing the number ofadaptation delays. But, in these works, the critical-path analysisand necessary design considerations are not taken into account.Due to that, the designs of [8], [9], [15] still consume higherarea, which could be substantially reduced. Keeping the aboveobservations in mind, we present a systematic critical-path anal-ysis of the LMS adaptive filter, and based on that, we derivean architecture for the LMS adaptive filter with minimal use ofpipeline stages, which will result in lower area complexity andless power consumption without compromising the desired pro-cessing throughput.The rest of the paper is organized as follows. In the next sec-

    tion, we review the direct-form and transpose-form implemen-tations of the DLMS algorithm, along-with their convergencebehavior. The critical-path analysis of both these implementa-tions is discussed in Section III. The proposed low-complexitydesigns of the LMS adaptive filter are described in Section IV.The performance of the proposed designs in terms of hardwarerequirement, timings, and power consumption is discussed inSection V. Conclusions are presented in Section VI.

    II. REVIEW OF DELAYED LMS ALGORITHM AND ITSIMPLEMENTATION

    In this section, we discuss the implementation and conver-gence performance of direct-form and transpose-form DLMSadaptive filters.

  • 780 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 61, NO. 3, MARCH 2014

    Fig. 3. Generalized block diagram of direct-form DLMS adaptive filter.

    Fig. 4. Error-computation block of Fig. 3.

    Fig. 5. Weight-update block of Fig. 3.

    A. Implementation of Direct-Form Delayed LMS Algorithm

    Assuming that the error-computation path is implemented inpipelined stages, the latency of error computation is cy-

    cles, so that the error computed by the structure at the th cycleis , and is used with the input samples delayed by cy-cles to generate the weight-increment term. The weight-updateequation of the DLMS algorithm is given by

    (2a)

    where

    (2b)and

    (2c)

    A generalized block diagram of direct-form DLMS adaptivefilter is shown in Fig. 3. It consists of an error-computation block(shown in Fig. 4) and a weight-update block (shown in Fig. 5).The number of delays shown in Fig. 3 corresponds to thepipeline delays introduced due to pipelining of the error-com-putation block.

    Fig. 6. Convergence of direct-form delayed LMS adaptive filter.

    Direct-form adaptive filters with different values of adapta-tion delay are simulated for a system identification problem,where the system is defined by a bandpass filter with impulseresponse given by

    (3)

    for , and otherwise. Parameters andrepresent the high and low cutoff frequencies of the pass-

    band, and are set to and , respectively.Fig. 6 shows the learning curves for identification of a 32-tapfilter with Gaussian random input of zero mean and unit vari-ance, obtained by averaging 50 runs for , and 10. Thestep-size is set to 1/40, 1/50, and 1/60 for and ,respectively, so that they provide the fastest convergence. In allcases, the output of the known system is of unity power, and con-taminated with white Gaussian noise of dB strength. It canbe seen that as the number of delays increases, the convergenceis slowed down, although the steady-state mean-square-error(MSE) remains almost the same in all cases.

    B. Implementation of Transpose-Form Delayed LMSAlgorithm

    The transpose-form FIR structure cannot be used to imple-ment the LMS algorithm given by (1), since the filter output atany instant of time has contributions from filter weights updatedat different iterations, where the adaptation delay of the weightscould vary from 1 to . It could, however, be implementedby a different set of equations as follows:

    (4a)

    (4b)

    where , and the symbols have the same meaningas those described in (1). In (4), it is assumed that no additionaldelays are incorporated to reduce the critical path during compu-tation of filter output and weight update. If additional delaysare introduced in the error computation at any instant, then the

  • MEHER AND PARK: CRITICAL-PATH ANALYSIS AND LOW-COMPLEXITY IMPLEMENTATION OF THE LMS ADAPTIVE ALGORITHM 781

    Fig. 7. Structure of transpose-form DLMS adaptive filter. The additional adap-tation delay could be at most 2 if no more delays are incorporated within themultiplication unit or between the multipliers and adders. If one delay could beplaced after the computation of and another after the computation of , then

    .

    weights are required to be updated according to the followingequation

    (5)

    but, the equation to compute the filter output remains the same asthat of (4a). The structure of the transpose-formDLMS adaptivefilter is shown in Fig. 7.It is noted that in (4a), the weight values used to compute the

    filter output at the th cycle are updated at different cycles,such that the th weight value is updated cy-cles back, where . The transpose-form LMSis, therefore, inherently a delayed LMS and consequently pro-vides slower convergence performance. To compare the conver-gence performance of LMS adaptive filters of different config-urations, we have simulated the direct-form LMS, direct-formDLMS, and transpose-formLMS for the same system identifica-tion problem, where the system is defined by (3) using the samesimulation configuration. The learning curves thus obtained forfilter length , and are shown in Fig. 8. We findthat the direct from LMS adaptive filter provides much fasterconvergence than the transpose LMS adaptive filter in all cases.The direct-formDLMS adaptive filter with delay 5 also providesfaster convergence compared to the transpose-form LMS adap-tive filter without any delay. However, the residual mean-squareerror is found to be nearly the same in all cases.From Fig. 7, it can be further observed that the transpose-form

    LMS involves significantly higher register complexity overthe direct-form implementation, since it requires an additionalsignal-path delay line for weight updating, and the registers onthe adder-line to compute the filter output are at least twice thesize of the delay line of the direct-form LMS adaptive filter.

    III. CRITICAL-PATH ANALYSIS OF LMS ADAPTIVE FILTER ANDIMPLEMENTATION STRATEGY

    The critical path of the LMS adaptive filter of Fig. 1 for directimplementation is given by

    (6)

    Fig. 8. Convergence comparison of direct-form and transpose-form adaptivefilters. (a) . (b) . (c) . Adaptation delay is set to 5 forthe direct-form DLMS adaptive filter.

    where and are, respectively, the time in-volved in error computation and weight updating. When theerror computation and weight updating are performed in twoseparate pipeline stages, the critical path becomes

    (7)

    Using (6) and (7), we discuss in the following the critical pathsof direct-form and transpose-form LMS adaptive filters.

    A. Critical Path of Direct Form

    To find the critical path of the direct-form LMSadaptive filter, let us consider the implementation of an innerproduct of length4. The implementation of this inner product is shown in Fig. 9,where all multiplications proceed concurrently, and additionsof product words start as soon as the LSBs of products areavailable. Computations of the first-level adders (ADD-1 andADD-2) are completed in time , where

  • 782 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 61, NO. 3, MARCH 2014

    Fig. 9. Critical path of an inner product computation. (a) Detailed block diagram to show critical path of inner product computation of length 4. (b) Block diagramof inner product computation of . (c) HA, FA, and 3-input XOR gate.

    is the delay due to the 3-input XOR operation for the ad-dition of the last bits (without computing the carry bits), and

    , where and are the prop-agation delays of AND and XOR operations, respectively. Forconvenience of representation, we take

    (8)

    Similarly, the addition of the second-level adder (ADD-3) (andhence the inner-product computation of length 4) is completedin time . In general, an inner product of length(shown in Fig. 4) involves a delay of

    (9)

    In order to validate (9), we show in Table II the time requiredfor the computation of inner products of different length forword-length 8 and 16 using TSMC 0.13- m and 90-nm processlibraries. Using multiplication time and time required for carry-and-sum generation in a 1-bit full-adder, obtained from Table I,we find that the results shown in Table II are in conformity with

    TABLE IISYNTHESIS RESULT OF INNER PRODUCT COMPUTATION TIME (NS) USING

    TSMC 0.13- M AND 90-NM CMOS TECHNOLOGY LIBRARIES

    those given by (9). The critical path of the error-computationblock therefore amounts to

    (10)

    For computation of the weight-update unit shown in Fig. 5, ifwe assume the step-size to be a power of 2 fraction, i.e., of theform , then the multiplication with can be implementedby rewiring, without involving any hardware or time delay. The

  • MEHER AND PARK: CRITICAL-PATH ANALYSIS AND LOW-COMPLEXITY IMPLEMENTATION OF THE LMS ADAPTIVE ALGORITHM 783

    critical path then consists of a multiply-add operation, whichcan be shown to be

    (11)

    Using (6), (10), and (11), we can find the critical path of thenon-pipelined direct-form LMS adaptive filter to be

    (12)

    If the error computation and weight updating are performedin two pipelined stages, then from (7), we can find the criticalpath to be

    (13)

    This could be further reduced if we introduce delays in the error-computation block to have a pipelined implementation.

    B. Critical Path of Transpose Form

    In the error-computation block of the transpose-form LMSadaptive filter (Fig. 7), we can see that all multiplications areperformed simultaneously, which involves time . Aftermultiplications, the results are transferred through precedingregisters to be added with another product word in the nextcycle. Since the addition operation starts as soon as the first bitof the product word is available (as in the direct-form LMS), thecritical path of the error-computation block is

    (14)

    If one delay is inserted after the computation of , then thecritical path given by (14) will change to .We have assumed here that the critical path is comprised of thelast multiply-add operation to compute the filter output. Notethat as the sum of product words traverses across the adder line,more and more product words are accumulated, and the widthof the accumulated sum finally becomes ,where is the width of the input as well as the weight values.The critical path of the weight-updating block is similarly

    found to be

    (15)

    However, for , i.e., the delay is inserted after only,the critical path will include the additional delay introducedby the subtraction for the computation of the error term, and

    . Without any adaptation delay, thecritical path would be

    (16)

    Interestingly, the critical paths of the direct-form and trans-pose-form structures without additional adaptation delay arenearly the same. If the weight updating and error computationin the transpose-form structure happen in two different pipeline

    stages, the critical path of the complete transpose-form adaptivefilter structure with adaptation delay , amounts to

    (17)

    From (13) and (17), we can find that the critical path of thetranspose-form DLMS adaptive filter is nearly the same as thatof direct-form implementation where weight updating and errorcomputation are performed in two separate pipeline stages.

    C. Proposed Design Strategy

    We find that the direct-form FIR structure not only is the nat-ural candidate for implementation of the LMS algorithm in itsoriginal form, but also provides better convergence speed withthe same residual MSE. It also involves less register complexityand nearly the same critical path as the transpose-form struc-ture. Therefore, we have preferred to design a low-complexitydirect-form structure for implementation of the LMS adaptivefilter.From Tables I and II, we can find that the critical path of

    the direct-implementation LMS algorithm is around 7.3 ns forfilter length with 16-bit implementation using the0.13- m technology library, which can be used for samplingrate as high as 100 Msps. The critical path increases by onefull-adder delay (nearly 0.2 ns) when the filter order is doubled.So, for filter order , the critical path still remainswithin 8 ns. On the other hand, the highest sampling frequencyof LTE-Advanced amounts to 30.72 Msps [14]. For still higherdata rates, such as those of some acoustic echo cancelers, we canhave structures with one and two adaptation delays, which canrespectively support about twice and thrice the sampling rate ofthe zero-adaptation delay structure.

    IV. PROPOSED STRUCTURE

    In this section, we discuss area- and power-efficient ap-proaches for the implementation of direct-form LMS adaptivefilters with zero, one, and two adaptation delays.

    A. Zero Adaptation Delay

    As shown in Fig. 3, there are two main computing blocks inthe direct-form LMS adaptive filter, namely, i) the error-compu-tation block (shown in Fig. 4) and ii) the weight-update block(shown in Fig. 5). It can be observed in Figs. 4 and 5 that mostof the area-intensive components are common in the error-com-putation and weight-update blocks: the multipliers, weight reg-isters, and tapped-delay line. The adder tree and subtractor inFig. 4 and the adders for weight updating in Fig. 5, which con-stitute only a small part of the circuit, are different in these twocomputing blocks. For the zero-adaptation-delay implementa-tion, the computation of both these blocks is required to beperformed in the same cycle. Moreover, since the structure isof the non-pipelined type, weight updating and error computa-tion cannot occur concurrently. Therefore, themultiplications ofboth these phases could be multiplexed by the same set of mul-tipliers, while the same registers could be used for both these

  • 784 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 61, NO. 3, MARCH 2014

    Fig. 10. Proposed structure for zero-adaptation-delay time-multiplexed direct-form LMS adaptive filter.

    phases if error computation is performed in the first half cycle,while weight update is performed in the second-half cycle.The proposed time-multiplexed zero-adaptation-delay struc-

    ture for a direct-form -tap LMS adaptive filter is shown inFig. 10, which consists of multipliers. The input samples arefed to the multipliers from a common tapped delay line. Theweight values (stored in registers) and the estimated error

    value (after right-shifting by a fixed number of locations to re-alize multiplication by the step size ) are fed to the multipliersas the other input through a 2:1 multiplexer. Apart from this,the proposed structure requires adders for modification ofweights, and an adder tree to add the output of multipliers forcomputation of the filter output. Also, it requires a subtractor tocompute the error value and 2:1 de-multiplexors to move theproduct values either towards the adder tree or weight-updatecircuit. All the multiplexors and de-multiplexors are controlledby a clock signal.The registers in the delay line are clocked at the rising edge

    of the clock pulse and remain unchanged for a complete clockperiod since the structure is required to take one new sample inevery clock cycle. During the first half of each clock period, theweight values stored in different registers are fed to the multi-plier through the multiplexors to compute the filter output. Theproduct words are then fed to the adder tree though the de-mul-tiplexors. The filter output is computed by the adder tree andthe error value is computed by a subtractor. Then the computederror value is right-shifted to obtain and is broadcasted toall multipliers in the weight-update circuits. Note that theLMS adaptive filter requires at least one delay at a suitable lo-cation to break the recursive loop. A delay could be insertedeither after the adder tree, after the computation, or after the

    computation. If the delay is placed just after the adder tree,then the critical path shifts to the weight-updating circuit andgets increased by . Therefore, we should place the delayafter computation of or , but preferably after com-putation to reduce the register width.

    The first half-cycle of each clock period ends with thecomputation of , and during the second half cycle, the

    value is fed to the multipliers though the multiplexors tocalculate and de-multiplexed out to be added to thestored weight values to produce the new weights according to(2a). The computation during the second half of a clock periodis completed once a new set of weight values is computed. Theupdated weight values are used in the first half-cycle of thenext clock cycle for computation of the filter output and forsubsequent error estimation. When the next cycle begins, theweight registers are also updated by the new weight values.Therefore, the weight registers are also clocked at the risingedge of each clock pulse.The time required for error computation is more than that

    of weight updating. The system clock period could be less ifwe just perform these operations one after the other in everycycle. This is possible since all the register contents also changeonce at the beginning of a clock cycle, but we cannot exactlydetermine when the error computation is over and when weightupdating is completed. Therefore, we need to perform the errorcomputation during the first half-cycle and the weight updatingduring the second half-cycle. Accordingly, the clock period ofthe proposed structure is twice the critical-path delay for theerror-computation block , which we can find using (14)as

    (18)

    where is the time required for multiplexing and de-mul-tiplexing.

    B. One Adaptation Delay

    The proposed structure for a one-adaptation-delay LMSadaptive filter consists of one error-computation unit as shownin Fig. 4 and one weight-update unit as shown in Fig. 5. Apipeline latch is introduced after computation of . The

  • MEHER AND PARK: CRITICAL-PATH ANALYSIS AND LOW-COMPLEXITY IMPLEMENTATION OF THE LMS ADAPTIVE ALGORITHM 785

    Fig. 11. Proposed structure for two-adaptation-delay direct-form LMS adaptive filter.

    multiplication with requires only a hardwired shift, sinceis assumed to be a power of 2 fraction. So there is no registeroverhead in pipelining. Also, the registers in the tapped delayline and filter weights can be shared by the error-computationunit and weight-updating unit. The critical path of this structureis the same as [derived in (10)], given by

    (19)

    C. Two Adaptation Delays

    The proposed structure for a two-adaptation-delay LMSadaptive filter is shown in Fig. 11, which consists of threepipeline stages, where the first stage ends after the first level ofthe adder tree in the error-computation unit, and the rest of theerror-computation block comprises the next pipeline stage. Theweight-update block comprises the third pipeline stage. Thetwo-adaptation-delay structure involves additional regis-ters over the one-adaptation-delay structure. The critical pathof this structure is the same as either that of the weight-updateunit [derived in (11)] or the second pipeline stage,given by

    (20)

    where refers to the adder-tree delay of stagesto add words along with the time required for subtractionin the error computation.

    D. Structure for High Sampling Rate and Large-Order FiltersWe find that in many popular applications like channel

    equalization and channel estimation in wireless communica-tion, noise cancellation in speech processing, and power-lineinterference cancellation, removal of muscle artifacts, and elec-trode motion artifacts for ECG [16][22], the filter order couldvary from 5 to 100. However, in some applications like acousticecho cancellation and seismic signal acquisition, the filter orderrequirement could be more than 1000 [23][25]. Therefore, wediscuss here the impact of increase in filter order on criticalpath along with the design considerations for implementationof large order filters for high-speed applications.For large-order filters, i.e., for large , the critical-path delay

    for 1-stage pipeline implementation in (19) increases by whenthe filter order is doubled. For 2-stage pipeline implementation,

    in (20) could be larger than , and could bethe critical-path delay of the structure. also increases bywhen the filter order is doubled. When 90-nm CMOS tech-

    nology is used, the critical-path delay could be nearly 5.97 nsand 3.66 ns for 1 and 2-stage pipeline implementations, respec-tively, when and . Therefore, in order tosupport input sampling rates higher than 273 Msps, additionaldelays could be incorporated at the tail-end of the adder treeusing only a small number of registers. Note that if a pipelinestage is introduced just before the last level of addition in theadder tree, then only one pipeline register is required. If we in-troduce the pipeline stage at levels up from the last adder inthe adder tree, then we need additional registers. The delayof the adder block however does not increase fast with the filter

  • 786 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 61, NO. 3, MARCH 2014

    TABLE IIICOMPARISON OF HARDWARE AND TIME COMPLEXITIES OF DIFFERENT ARCHITECTURES

    order since the adder tree is only increased one level when thefilter length is doubled, and introduces only one extra delay ofin the critical path.The critical path could be reduced only incrementally if we

    pipeline the adaptive filter after every addition, which will in-volve enormous register complexity. For a further increase inclock rate, one can use the block-LMS adaptive filter [26]. Ablock-LMS adaptive filter with block length would supporttimes higher sampling rate without increasing the energy per

    sample (EPS). Therefore, pipelining of the multiplication blockor adder tree after every addition is not a preferable option toimplement adaptive filters for high-sampling rate or for largefilter orders.

    V. COMPLEXITY CONSIDERATIONSThe hardware and time complexities of the proposed and ex-

    isting designs are listed in Table III. A transpose-form fine-grained retimed DLMS (TF-RDLMS), a tree direct-form fine-grained retimedDLMS (TDF-RDLMS) [10], the best of systolicstructures [5], and our most recent direct-form structure [9] arecompared with the proposed structures. The proposed designwith 0, 1, and 2 adaptation delays (presented in Section IV) arereferred to as proposed Design 1, Design 2, and Design 3, inTable III. The direct-form LMS and transpose-form LMS algo-rithm based on the structure of Figs. 4, 5, and 7 without anyadaptation delays, e.g., , and the DLMS structure pro-posed in [3] are also listed in this table for reference. It is foundthat proposed Design 1 has the longest critical path, but involvesonly half the number of multipliers of other designs except [9],and does not require any adaptation delay. Proposed Design 2and Design 3 have less adaption delay compared to existing de-signs, with the same number of adders and multipliers, and in-volve fewer delay registers.We have coded all the proposed designs in VHDL and

    synthesized them using the Synopsys Design Compiler withthe TSMC 90-nm CMOS library [12] for different filter orders.The structures of [10], [5], and [9] were also similarly coded,and synthesized using the same tool. The word-length of inputsamples and weights are chosen to be 12, and internal dataare not truncated before the computation of filter outputto minimize quantization noise. Then, is truncated to 12

    bits, while the step size is chosen to be to realize itsmultiplication without any additional circuitry. The data arrivaltime (DAT), maximum usable frequency (MUF), adaptationdelay, area, area-delay product (ADP), power consumptionat maximum usable frequency (PCMUF), normalized powerconsumption at 50MHz, and energy per sample (EPS) are listedin Table IV. Note that power consumption increases linearlywith frequency, and PCMUF gives the power consumptionwhen the circuit is used at its highest possible frequency. Allthe proposed designs have significantly less PCMUF comparedto the existing designs. However, the circuits need not alwaysbe operated at the highest frequency. Therefore, PCMUF isnot a suitable measure for power performance. The normalizedpower consumption at a given frequency provides a relativelybetter figure of merit to compare the power-efficiency of dif-ferent designs. The EPS similarly does not change much withoperating frequency for a given technology and given operatingvoltage, and could be a useful measure.The transpose-form structure of [10], TF-RDLMS provides

    the relatively high MUF, which is 8.1% more than that of pro-posed Design 3, but involves 19.4% more area, 10.4% moreADP, and 59.3% more EPS. Besides, the transpose-form struc-ture [10] provides slower convergence than the proposed direct-form structure. The direct-form structure of [10], TDF-RDLMS,has nearly the same complexity as the transpose-form counter-part of [10]. It involves 13.8% more area, 8.0% more ADP and35.6%more EPS, and 5.4% higher MUF compared with Design3. Besides, it requires 4, 5, and 6more adaptation delays than theproposed Design 3 for filter length 8, 16, and 32, respectively.The structure of [5] provides nearly the same MUF as that ofproposed Design 3, but requires 19.0% more area, 17.6% moreADP, and 20.4% more EPS. The structure of [9] provides thehighest MUF since the critical-path delay is only , how-ever, it requires more adaptation delay than the proposed de-signs. Also, the structure of [9] involves 4.7% less ADP, but12.2% more area and 26.2% more EPS than the proposed De-sign 3. Proposed Design 1 has the minimum MUF among allthe structures, but that is adequate to support the highest datarate in current communication systems. It involves theminimumarea and the minimum EPS of all the designs. The direct-formstructure of [10] requires 82.8%more area and 52.4%more EPS

  • MEHER AND PARK: CRITICAL-PATH ANALYSIS AND LOW-COMPLEXITY IMPLEMENTATION OF THE LMS ADAPTIVE ALGORITHM 787

    TABLE IVPERFORMANCE COMPARISON OF DLMS ADAPTIVE FILTER CHARACTERISTICS BASED ON SYNTHESIS USING TSMC 90-NM LIBRARY

    compared to proposed Design 1. Similarly, the structure of [5]involves 91.3% more area and 35.4% more EPS compared withproposed Design 1. Proposed Design 2 and Design 3 involvenearly the same (slightly more) EPS than the proposed Design1 but offer nearly twice and thrice the MUF at a cost of 55.0%and 60.6% more area, respectively.

    VI. CONCLUSIONBased on a precise critical-path analysis, we have derived

    low-complexity architectures for the LMS adaptive filter. Wehave shown that the direct-form and transpose-form LMSadaptive filters have nearly the same critical-path delay. Thedirect-from LMS adaptive filter, however, involves less registercomplexity and provides much faster convergence than itstranspose-form counterpart since the latter inherently performsdelayed weight adaptation. We have proposed three differentstructures of direct-form LMS adaptive filter with i) zero adap-tation delay, ii) one adaptation delay, and iii) two adaptationdelays. Proposed Design 1 does not involve any adaptationdelay. It has the minimum of MUF among all the structures,but that is adequate to support the highest data rate in currentcommunication systems. It involves the minimum area and theminimum EPS of all the designs. The direct-form structure of[10] requires 82.8% more area and 52.4% more EPS comparedto proposed Design 1, and the transpose-form structure of[10] involves still higher complexity. The structure of [5]involves 91.3% more area and 35.4% more EPS comparedwith proposed Design 1. Similarly, the structure of [9] involves80.4% more area and 41.9% more EPS than proposed Design 1.Proposed Design 3 involves relatively fewer adaptation delaysand provides similar MUF as the structures of [10] and [5]. Itinvolves slightly less ADP but provides around 16% to 26% of

    savings in EPS over the others. Proposed Design 2 and Design 3involve nearly the same (slightly more) EPS than the proposedDesign 1 but offer nearly twice or thrice the MUF at the cost of55.0% and 60.6% more area, respectively. However, proposedDesign 1 could be the preferred choice instead of proposedDesign 2 and Design 3 in most communication applications,since it provides adequate speed performance, and involvessignificantly less area and EPS.

    REFERENCES[1] B. Widrow and S. D. Stearns, Adaptive Signal Processing. Engle-

    wood Cliffs, NJ, USA: Prentice-Hall, 1985.[2] S. Haykin and B. Widrow, Least-Mean-Square Adaptive Fil-

    ters. Hoboken, NJ, USA: Wiley-Interscience, 2003.[3] G. Long, F. Ling, and J. G. Proakis, The LMS algorithm with delayed

    coefficient adaptation, IEEE Trans. Acoust., Speech, Signal Process.,vol. 37, no. 9, pp. 13971405, Sep. 1989.

    [4] M. D.Meyer and D. P. Agrawal, Amodular pipelined implementationof a delayed LMS transversal adaptive filter, in Proc. IEEE Int. Symp.Circuits Syst., May 1990, pp. 19431946.

    [5] L. D. Van and W. S. Feng, An efficient systolic architecture for theDLMS adaptive filter and its applications, IEEE Trans. Circuits Syst.II, Analog Digit. Signal Process., vol. 48, no. 4, pp. 359366, Apr.2001.

    [6] L.-K. Ting, R.Woods, and C. F. N. Cowan, Virtex FPGA implementa-tion of a pipelined adaptive LMS predictor for electronic support mea-sures receivers, IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,vol. 13, no. 1, pp. 8699, Jan. 2005.

    [7] E. Mahfuz, C. Wang, and M. O. Ahmad, A high-throughput DLMSadaptive algorithm, inProc. IEEE Int. Symp. Circuits Syst., May 2005,pp. 37533756.

    [8] P. K.Meher and S. Y. Park, Low adaptation-delay LMS adaptive filterPart-II: An optimized architecture, in Proc. IEEE Int. Midwest Symp.Circuits Syst., Aug. 2011.

    [9] P. K. Meher and S. Y. Park, Area-delay-power efficient fixed-pointLMS adaptive filter with low adaptation-delay, Trans. Very LargeScale Integr. (VLSI) Signal Process. [Online]. Available: http://ieeex-plore.ieee.org

  • 788 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 61, NO. 3, MARCH 2014

    [10] Y. Yi, R. Woods, L.-K. Ting, and C. F. N. Cowan, High speedFPGA-based implementations of delayed-LMS filters, Trans. VeryLarge Scale Integr. (VLSI) Signal Process., vol. 39, no. 12, pp.113131, Jan. 2005.

    [11] S. Y. Park and P. K. Meher, Low-power, high-throughput, and low-area adaptive FIR filter based on distributed arithmetic, IEEE Trans.Circuits Syst. II, Exp. Briefs, vol. 60, no. 6, pp. 346350, Jun. 2013.

    [12] TSMC 90 nm general-purpose CMOS standard cell li-brariestcbn90ghp [Online]. Available: www.tsmc.com/

    [13] TSMC 0.13 m General-Purpose CMOS Standard Cell Libraries -tcb013ghp [Online]. Available: www.tsmc.com/

    [14] 3GPP TS 36.211, Physical Channels and Modulation, ver. 10.0.0 Re-lease 10, Jan. 2011.

    [15] P. K. Meher and M. Maheshwari, A high-speed FIR adaptive filterarchitecture using a modified delayed LMS algorithm, in Proc. IEEEInt. Symp. Circuits Syst., May 2011, pp. 121124.

    [16] J. Vanus and V. Styskala, Application of optimal settings of the LMSadaptive filter for speech signal processing, in Proc. IEEE Int. Multi-conf. Comput. Sci. Inf. Technol., Oct. 2010, pp. 767774.

    [17] M. Z. U. Rahman, R. A. Shaik, and D. V. R. K. Reddy, Noise cancella-tion in ECG signals using computationally simplified adaptive filteringtechniques: Application to biotelemetry, Signal Process. Int. J. (SPIJ),vol. 3, no. 5, pp. 112, Nov. 2009.

    [18] M. Z. U. Rahman, R. A. Shaik, and D. V. R. K. Reddy, Adaptive noiseremoval in the ECG using the block LMS algorithm, in Proc. IEEEInt. Conf. Adaptive Sci. Technol., Jan. 2009, pp. 380383.

    [19] B. Widrow, J. R. Glover, Jr., J. M. McCool, J. Kaunitz, C. S. Williams,R. H. Hearn, J. R. Zeidler, E. Dong, Jr., and R. C. Goodlin, Adaptivenoise cancelling: Principles and applications, Proc. IEEE, vol. 63, no.12, pp. 16921716, Dec. 1975.

    [20] W. A. Harrison, J. S. Lim, and E. Singer, A new application of adap-tive noise cancellation, IEEE Trans. Acoust., Speech, Signal Process.,vol. 34, no. 1, pp. 2127, Feb. 1986.

    [21] S. Coleri, M. Ergen, A. Puri, and A. Bahai, A study of channel esti-mation in OFDM systems, in Proc. IEEE Veh. Technol. Conf., 2002,pp. 894898.

    [22] J. C. Patra, R. N. Pal, R. Baliarsingh, and G. Panda, Nonlinear channelequalization for QAM signal constellation using artificial neural net-works, IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 29, no. 2,pp. 262271, Apr. 1999.

    [23] D. Xu and J. Chiu, Design of a high-order FIR digital filtering andvariable gain ranging seismic data acquisition system, in Proc. IEEESoutheastcon, Apr. 1993.

    [24] M. Mboup, M. Bonnet, and N. Bershad, LMS coupled adaptiveprediction and system identification: A statistical model and transientmean analysis, IEEE Trans. Signal Process., vol. 42, no. 10, pp.26072615, Oct. 1994.

    [25] C. Breining, P. Dreiseitel, E. Hansler, A. Mader, B. Nitsch, H. Puder,T. Schertler, G. Schmidt, and J. Tilp, Acoustic echo control, IEEESignal Process. Mag., vol. 16, no. 4, pp. 4269, Jul. 1999.

    [26] G. A. Clark, S. K. Mitra, and S. R. Parker, Block implementation ofadaptive digital filters, IEEE Trans. Acoust., Speech, Signal Process.,vol. ASSP-29, no. 3, pp. 744752, Jun 1981.

    Pramod Kumar Meher (SM03) received the B.Sc.(Honours) andM.Sc. degree in physics, and the Ph.D.degree in science from Sambalpur University, India,in 1976, 1978, and 1996, respectively.Currently, he is a Senior Research Scientist, with

    the School of Computer Engineering, NanyangTechnological University, Singapore. Previously,he was a Senior Scientist with the Institute forInfocomm Research, Singapore, and Senior Fellowwith the School of Computer Engineering, NanyangTechnological University, Singapore. He was a

    Professor of Computer Applications with Utkal University, India, from 1997to 2002, and a Reader in electronics with Berhampur University, India, from1993 to 1997. His research interest includes design of dedicated and reconfig-urable architectures for computation-intensive algorithms pertaining to signal,image and video processing, communication, bio-informatics and intelligentcomputing. He has contributed nearly 200 technical papers to various reputedjournals and conference proceedings.Dr. Meher has served as a speaker for the Distinguished Lecturer Program

    (DLP) of IEEE Circuits Systems Society during 2011 and 2012 and AssociateEditor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSPART II:EXPRESS BRIEFS during 2008 to 2011. Currently, he is serving as AssociateEditor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSPARTI: REGULAR PAPERS, the IEEE TRANSACTIONS ON VERY LARGE SCALEINTEGRATION (VLSI) SYSTEMS, and the Journal of Circuits, Systems, andSignal Processing. Dr. Meher is a Fellow of the Institution of Electronics andTelecommunication Engineers, India. He was the recipient of the SamantaChandrasekhar Award for excellence in research in engineering and technologyfor 1999.

    Sang Yoon Park (S03M11) received the B.S. de-gree in electrical engineering and the M.S. and Ph.D.degrees in electrical engineering and computer sci-ence from Seoul National University, Seoul, Korea,in 2000, 2002, and 2006, respectively.He joined the School of Electrical and Electronic

    Engineering, Nanyang Technological University,Singapore, as a Research Fellow in 2007. Since2008, he has been with Institute for Infocomm Re-search, Singapore, where he is currently a ResearchScientist. His research interest includes design of

    dedicated and reconfigurable architectures for low-power and high-perfor-mance digital signal processing systems.