asl.epfl.ch · Adaptive ﬁlters with error nonlinearities: mean-square analysis and optimum design 205 and an iid input,” in Proceedings of the 32nd Asilomar Confer-ence on ...

EURASIP Journal on Applied Signal Processing 2001:4, 192–205© 2001 Hindawi Publishing Corporation

Adaptive Filters with Error Nonlinearities: Mean-SquareAnalysis and Optimum Design

Tareq Y. Al-Naffouri

Electrical Engineering Department, Stanford University, CA 94305, USAEmail: [email protected]

Ali H. Sayed

Electrical Engineering Department, University of California, Los Angeles, CA 90095, USAEmail: [email protected]

Received 7 August 2001 and in revised form 10 October 2001

This paper develops a unified approach to the analysis and design of adaptive filters with error nonlinearities. In particular, thepaper performs stability and steady-state analysis of this class of filters under weaker conditions than what is usually encounteredin the literature, and without imposing any restriction on the color or statistics of the input. The analysis results are subsequentlyused to derive an expression for the optimum nonlinearity, which turns out to be a function of the probability density function ofthe estimation error. Some common nonlinearities are shown to be approximations to the optimum nonlinearity. The frameworkpursued here is based on energy conservation arguments.

Keywords and phrases: adaptive filter, mean-square error, energy conservation, transient analysis, steady-state analysis, stability,error nonlinearity.

1. INTRODUCTION

The least-mean-squares (LMS) algorithm is a popular adap-tive algorithm because of its simplicity and robustness [1, 2].Many LMS-like algorithms have been suggested and analyzedin the literature with the aim of retaining the desirable prop-erties of LMS and simultaneously offsetting some of its lim-itations. Of particular importance is the class of least-mean-squares algorithms with error nonlinearities. Table 1 lists ex-amples from this class of algorithms for real-valued data.

Table 1: Examples of f[e(i)].

Algorithm Error nonlinearity f[e(i)]

LMS e(i)

LMF e3(i)

LMF family e2k+1(i), k ≥ 0

LMMN ae(i)+ be3(i)

Sign error sign[e(i)]

Saturation ∫ e(i)0

exp(− z2

2σ 2z

)dz

nonlinearity

Despite the favorable behavior of many of these LMS vari-ants, their choice and design are mostly justified by intuitionrather than by rigorous theoretical arguments. Even the LMSalgorithm, which has long been considered as an approximatesolution to a least-mean-squares problem, has only recentlybeen justified by a rigorous theory [3].

In this paper, we provide a unifying framework for themean-square performance of adaptive filters that involve er-ror nonlinearities in their update equations. We will use thisanalysis to design adaptive algorithms with optimized perfor-mance. Before discussing the features of the approach pro-posed herein and its contributions, we provide, as a moti-vation, a summary of selected studies dealing with adaptivefilters with error nonlinearities. These studies can be classifiedinto two categories.

I. Analysis using simplifying assumptions

Since adaptive algorithms with error nonlinearities areamong the most difficult to analyze, it is not uncommon toresort to different methods and assumptions with the intentof performing tractable analysis. This includes:

Linearization: here the error nonlinearity is linearizedaround an operating point and higher-order terms are dis-carded as in [4, 5, 6, 7, 8]. Analyses that are based onthis technique fail to accurately describe the adaptive filter

mailto:[email protected]

mailto:[email protected]

Adaptive filters with error nonlinearities: mean-square analysis and optimum design 193

performance for large values of the error, for example, atearly stages of adaptation. Linearization can be avoided byfocusing on a specific nonlinearity (e.g., as in the sign algo-rithm [9]) or a subclass of nonlinearities (e.g., as in the caseof the error saturation algorithm [10, 11]).

Restricting the class of input signals: such as assuming theinput to be white and/or Gaussian (e.g., [5, 6, 9, 10, 11, 12, 13,14, 15, 16]).

Independence assumption: it is very common to assumethat successive regression vectors are independent.

Assumptions on the statistics of the error signal: while sta-tistical assumptions are usually imposed on the regressionand noise sequences, it is also common to impose statisticalconditions on error quantities. For example, in studying thesign-LMS algorithm, it was assumed in [17] that the elementsof the weight-error vector are jointly Gaussian. More accurateis the assumption that the residual error is Gaussian, whichwas adopted in [6, 9, 10, 11]. By central limit arguments, thisassumption is justified for long adaptive filters. More impor-tantly, the assumption is as valid in the early stages as in thefinal stages of adaptation.

Assuming Gaussian noise: noise is sometimes restricted tobe i.i.d Gaussian as in [6, 9, 17, 18], although Gaussianity isnot as common as the previous assumptions.

Most studies of adaptive algorithms with error nonlin-earities rely on a selection from the above array of assump-tions/techniques.

II. Optimal designs [8, 19, 20, 21]

Here one attempts to construct adaptive algorithms with op-timum nonlinearities. A natural prerequisite is to evaluatesome measures of performance (e.g., [6, 7, 16, 22]) and thenminimize them to arrive at optimum choices for the non-linearity [8, 19, 20, 21]. The difficulty, of course, is that theanalysis is often plagued by the aforementioned assumptionsand techniques. The result is that the optimum nonlinearitiesobtained are not any more valid than the restrictions imposedby the analysis.

1.1. The approach of this paper

In this paper, we address some of the above concerns. In par-ticular, we present a unified approach to the mean-squareanalysis of adaptive algorithms with general error nonlinear-ities. The approach relies on energy conservation argumentsand applies under weaker assumptions than what is availablein the literature. Our performance results are subsequentlyoptimized to obtain an expression for the optimum nonlin-earity. In what follows, we list the contributions of the paper.This also serves as a layout for its organization.

(1) After introducing our notation, we set up the stage inthe next section by defining the adaptive filtering problem.We also derive an energy relation that will be the startingpoint for much of the subsequent analysis.

(2) The energy relation is used in Section 3 to studymean-square stability. In particular, without relying on anyindependence-like assumptions, we derive bounds on thestep-size for stability.

(3) Section 4 is devoted to studying the steady-statebehavior, where we show that the mean-square error can beobtained as the fixed point of a nonlinear equation. The sta-bility and steady-state analysis apply under weaker conditionsthan usual, and these conditions become reasonably accuratefor long adaptive filters.

(4) The steady-state results are used in Section 5 to ob-tain an expression for the optimum nonlinearity, which isvalid for all stages of adaptation. The nonlinearity turns outto be a function of the noise probability density function(pdf). We show how the nonlinearity manifests itself for dif-ferent noise distributions and how it relates to more commonnonlinearities.

1.2. Notation

We focus on real-valued data, although the extension tocomplex-valued data is immediate. Small boldface letters areused to denote vectors, for example, w. Also, the symbolT denotes transposition. The notation ‖w‖2 stands for thesquared Euclidean norm of a vector, ‖w‖2 = wTw. All vec-tors are column vectors except for a single vector, namely theinput data vector ui, which is taken to be a row vector. Thetime instant is placed as a subscript for vectors (e.g., wi) andbetween parentheses for scalars (e.g., e(i)).

2. ADAPTIVE ALGORITHMS WITH ERRORNONLINEARITIES

An adaptive filter attempts to identify a weight vector wo byusing a sequence of input (row) regressors ui and outputsamples d(i) that are related via

d(i) = uiwo + v(i). (1)

Here v(i) accounts for measurement noise and modelingerrors. Many adaptive schemes have been proposed in theliterature for this purpose (cf. [1, 2]). In this paper, we focuson the class of algorithms

wi+1 = wi + µf[e(i)]uTi , i ≥ 0, (2)

where wi is the estimate of w at time i, µ is the step size,

e(i) d(i)− uiwi = uiwo − uiwi + v(i) (3)

is the estimation error, and f[e(i)] is a scalar function of theerror e(i). Table 1 lists some common adaptive algorithmsand their corresponding error nonlinearities.

Mean-square analysis of adaptive filters is best carried outin terms of the weight-error vector

wi = wo −wi (4)

and the a priori and a posteriori errors defined by

ea(i) uiwi, ep(i) uiwi+1. (5)

We can use these quantities to rewrite the adaptation equation

194 EURASIP Journal on Applied Signal Processing

(2) as

wi+1 = wi − µf[e(i)]uTi . (6)

Moreover, by combining the defining expressions (3) and (5),we obtain

e(i) = ea(i)+ v(i). (7)

A relation between the estimation errors ea(i), ep(i), ande(i) can be obtained by pre-multiplying both sides of theadaptation equation (6) by ui,

uiwi+1 = uiwi − µf[e(i)]∥∥ui

∥∥2 (8)

and incorporating the defining expressions (5), which yields

ep(i) = ea(i)− µ∥∥ui

∥∥2f[e(i)]. (9)

2.1. An energy conservation relation

To motivate the subsequent analysis, it is worth listing first thequestions that are usually of interest in an adaptive filteringsetting. We are often interested in questions related to

Steady-state behavior: which relates to determining thesteady-state values of E[‖wi‖2], E[e2

a(i)], and/or E[e2p(i)].

Stability: which relates to determining the range of val-ues of the step-size for which the variances E[e2

a(i)] andE[‖wi‖2] remain bounded.

Learning curves: which relates to determining the timeevolution of the curves E[e2

a(i)] and E[‖wi‖2].Observe that the above questions are all conveniently

phrased in terms of the error quantities ea(i), ep(i), wior, more accurately, in terms of their energies. This fact mo-tivates us to pursue an energy-based approach.

More specifically, in order to address questions of thiskind,we will rely on an energy equality that relates the squarednorms of the error quantities ea(i), ep(i), wi, wi+1. To de-rive the energy relation, we combine (6) and (9) so as toeliminate the nonlinearity f[e(i)]:

wi+1 = wi −(ea(i)− ep(i)

) uTi∥∥ui∥∥2 . (10)

We then square both sides to get

∥∥wi+1∥∥2 =

(wi −

(ea(i)− ep(i)

) uTi∥∥ui∥∥2

)T

×(

wi −(ea(i)− ep(i)

) uTi∥∥ui∥∥2

).

(11)

This yields, after some straightforward manipulations, theenergy relation

∥∥wi+1∥∥2 +

∣∣ea(i)∣∣2∥∥ui∥∥2 =

∥∥wi∥∥2 +

∣∣ep(i)∣∣2

∥∥ui∥∥2 . (12)

This result is exact for any adaptive algorithm described by(2); no approximations whatsoever were used. It has proven

very useful in the study of the performance of adaptive filters,in both deterministic and stochastic analysis. The relationwas originally derived in [23, 24, 25] and used in the contextof robustness analysis of adaptive filters; it was later usedin [26, 27, 28, 29, 30, 31] in the context of steady-state andtransient analysis of adaptive filters.

Since we are interested in the mean-square behavior ofadaptive filters, we take expectations of both sides of (12) andwrite

E[∥∥wi+1

∥∥2]+ E

[∣∣ea(i)∣∣2∥∥ui∥∥2

]

= E[∥∥wi

∥∥∥2]+ E

[∣∣ep(i)∣∣2

∥∥ui∥∥2

] (13)

or, upon replacing the a posteriori error ep(i) by the equiva-lent expression (9),

E[∥∥wi+1

∥∥2]= E

[∥∥wi∥∥2]− 2µE

[ea(i)f [e(i)]

]

+ µ2E[∥∥ui

∥∥2f 2[e(i)]].

(14)

This averaged form of the energy relation is the startingpoint of our analysis. In the course of answering the adaptivefiltering questions, we will not attempt to develop (14) intoa self-contained recursion as is usually done in literature.Rather, our efforts will be centered around manipulating thetwo expectations that appear on the right-hand side of (14)by imposing as little assumptions as necessary to answer theadaptive filtering question under consideration. In particular,the following two assumptions will be used throughout ouranalysis:

(AN) The noise sequence v(i) is independent, iden-tically distributed, and independent of the input sequenceui.

(AG) The filter is long enough such that ea(i) is Gaussian.The independence assumption on the noise is valid in

many practical situations. Notice, however, that we make noassumption on the noise statistics which is contrary to theGaussian restriction that is sometimes imposed in literature(e.g., [6, 9, 17, 18]).

Assumption (AG) is justified for long filters by the centrallimit theorem. As such, the validity of the assumption is de-pendent on the filter order M . Nevertheless, the assumptionremains as valid in the initial stages of adaptation as it is inthe final stages. This comes contrary to the linearization ar-guments that are usually employed when dealing with errornonlinearities and which are only valid in the final stages ofadaptation (see [6, 7, 8]). By expressing the two expectationsin (14) in terms of the second-order moment E[e2

a(i)], webasically bypass the need for linearization.

3. MEAN-SQUARE STABILITY

Stability is usually studied in literature by first developing (14)into a self-contained recursion and subsequently determiningconditions on the step-size in order to guarantee the stability


of the recursion. As we will now see, we can study stabilitydirectly from (14) thus doing away with the self-contained re-cursion and with any auxiliary assumptions that are invokedto develop this recursion. In particular, starting from (14), wepursue a Lyapunov approach to stability where we providea nontrivial upper bound on µ for which E[‖wi‖2] remainsuniformly bounded for all i. More specifically, we will showhow to calculate a bound µ0 for which

µ ≤ µo =⇒ E[∥∥wi

∥∥2]≤ C <∞ (15)

for some constant C.

3.1. A monotone sequence of weight energies

Starting from (14), it is easy to see that

E[∥∥wi+1

∥∥2]≤ E

[∥∥wi∥∥2]

⇐⇒ −2µE[ea(i)f [e(i)]

]+ µ2E

[∥∥ui∥∥2f 2[e(i)]

]≤ 0.

(16)

Thus, if we choose µ such that for all i

µ ≤ 2E[ea(i)f [e(i)]

]E[∥∥ui

∥∥2f 2[e(i)]] , (17)

then the sequence E[‖wi‖2] will be decreasing and (beingbounded from below) also convergent. Alternatively, a suffi-cient condition for stability would be

µ ≤ 2 infi≥0

E[ea(i)f [e(i)]

]E[∥∥ui

∥∥4]1/2

E[f 4[e(i)]

]1/2 , (18)

where we appealed to the Cauchy-Schwartz inequality tobound the denominator by

E[∥∥ui

∥∥2f 2[e(i)]]≤ E

[∥∥ui∥∥4]1/2

E[f 4[e(i)]

]1/2. (19)

To proceed further, the convenience of the Gaussian assump-tion on ea(i) can be brought into fruit to evaluate the expec-tations in (18). In particular, the expectations can be writtenas functions of the second moment1 E[e2

a(i)]. This prompts

1Since ea(i) is assumed Gaussian and independent of v(i), we can, forexample, write

E[f 4[ea(i)+ v(i)]] =

∫∞−∞

∫∞−∞f 4[ea(i)+ v(i)] 1√

2πE[e2a(i)

]× e−e2

a(i)/2E[e2a(i)]pv(v)dea(i)dv

=∫∞−∞pv(v)dv

∫∞−∞f 4[ea(i)+ v(i)] 1√

2πE[e2a(i)

]× e−e2

a(i)/2E[e2a(i)]dea(i).

The inner integral depends on e2a(i) through E[e2

a(i)] only and, hence, sodoes E[f 4[ea(i)+ v(i)]].

us to define2

hG[E[e2a(i)

]] E

[ea(i)f [e(i)]

]E[e2a(i)

] , (20)

hC[E[e2a(i)

]] E

[f 4(e(i))]. (21)

For future reference, hG is tabulated in Table 2 for the errornonlinearities of Table 1.

Upon substituting (20) and (21) into (18), we see that asufficient condition for convergence is that

µ ≤ 2[E∥∥u4

i∥∥]1/2

(infi≥0

E[e2a(i)

] · hG[E[e2a(i)

]]√hC[E[e2a(i)

]]), (22)

where we moved the expectation E[‖ui‖4] outside the mini-mization since the input data is assumed stationary. Observenow that all terms in the above minimization are functionsof E[e2

a(i)]. We can therefore rewrite (22) as

µ ≤ 2

E[∥∥u4

i∥∥]1/2

(inf

E[e2a(i)]

E[e2a(i)

] · hG[E[e2a(i)

]]√hC[E[e2a(i)

]]). (23)

We emphasize that the minimization takes place over the pos-sible values of E[e2

a(i)] only; these values are not arbitrarybut correspond to those assumed by the learning curve of theadaptive filter. As it stands, the bound (23) is still not useful,and we need to replace it by a time-independent bound.

3.1.1 Removing the time index (dependence)

We can replace the range of feasible values of E[e2a(i)] by the

larger set

Ω = E[e2a]

: 0 ≤ E[e2a]<∞. (24)

Minimization overΩ is easier to carry out and we additionallyhave

infE[e2

a]∈Ω

(E[e2a] · hG[E[e2

a]]

√hC[E[e2a]]

)

≤ infE[e2

a(i)]

(E[e2a(i)

] · hG[E[e2a(i)

]]√hC[E[e2a(i)

]]).

(25)

Almost always, however, minimization over Ω yields a nullvalue and hence is useless. A more intelligent choice of thefeasible set is thus called for.

3.1.2 A lower bound on E[e2a(i)]

A nonzero lower bound on E[e2a(i)] can be obtained by

noting that E[e2a(i)] cannot be lower than the Cramer-Rao

bound λ associated with the underlying estimation process,that is, the problem of estimating the random quantity uiwo

2The subscript in hG points to the fact that the Gaussian assumption(AG) is the major assumption in evaluating E[ea(i)f [e(i)]]. The subscriptin hC is a reminder that the Cauchy-Schwartz inequality is the key step inapproximating E[‖ui‖2f 2[ea(i)+ v(i)]].


Table 2: hG[·] for the error nonlinearities of Table 1 (σ 2ea E[e2

a(i)]).

Algorithm hG[σ 2ea]

(with v(i) Gaussian) hG[σ 2ea]

LMS 1 1

LMF 3(σ 2ea + σ 2

v)

3(σ 2ea + σ 2

v)

LMF family(2k+ 2)!

2k+1(k+ 1)!(σ 2ea + σ 2

v)k ∑k

j=0

(2k+1j

)σ 2jea E

[v2(k−j)(i)

]

LMMN a+ 3bσ 2v + 3bσ 2

ea a+ 3bσ 2v + 3bσ 2

ea

Sign error

√2π

1√σ 2ea + σ 2

v

√2πσeaE

[e−v

2(i)/2σ 2ea]

Sat. nonlin.σz√

σ 2ea + σ 2

v + σ 2z

σz√σ 2ea + σ 2

zE[e−v

2(i)/2(σ 2ea+σ 2

z )]

by using uiwi (see [32, page 72]). Thus, we can write

E[e2a(i)

] ≥ λ (26)

and subsequently replace Ω with the smaller set

Ω′ = E[e2a]

: λ ≤ E[e2a]. (27)

This results in the tighter bound

µ ≤ 2[E∥∥ui

∥∥4]1/2

(inf

E[e2a]∈Ω′

E[e2a] · hG[E[e2

a]]

√hC[E[e2a]]

). (28)

3.1.3 An upper bound on E[e2a(i)]

It turns out that for adaptive filters that employ a linear orsublinear error nonlinearities (e.g., the LMS and the sign algo-rithms), the feasible setΩ′ defined by (27) is good enough (aswe will show in the examples further ahead). In other words,the set is small enough to produce a positive upper bound onthe step size for stability. In other cases, for example, the LMFalgorithm and its family, we need a more compact set and anupper bound on E[e2

a(i)] is also necessary. Not surprisinglyperhaps, the upper bound depends on the initial condition ofthe adaptive filter.

To this end, observe that since ea(i) is Gaussian, by as-sumption, it satisfies

E[e2a(i)

] = 14

(E[∣∣ea(i)∣∣])2

= 14

(E[∣∣uiwi

∣∣])2

≤ 14

(E[∥∥ui

∥∥∥∥wi∥∥])2,

(29)

where the last line follows from the Cauchy-Schwartz inequal-ity. To proceed further, we apply the same inequality to theexpectation operator this time, splitting it as

E[e2a(i)

] ≤ 14

([E∥∥ui

∥∥2]1/2[

E∥∥wi

∥∥2]1/2)2

= 14E[∥∥ui

∥∥2]E[∥∥wi

∥∥2]

= 14

Tr(R)E[∥∥wi

∥∥2].

(30)

Here R denotes the covariance matrix of the regression vec-tor ui. Now since the bounds (18) or (23) on µ ensure thatE[‖wi‖2] is decreasing, we have

E[∥∥wi

∥∥2]≤ E

[∥∥w0∥∥2]. (31)

This gives us the desired upper bound

E[e2a(i)

] ≤ 14

Tr(R)E[∥∥w0

∥∥2]. (32)

The two bounds (26) and (32) produce the alternative feasi-bility set

Ω′′ =E[e2a]

: λ ≤ E[e2a] ≤ 1

4Tr(R)E

[∥∥w0∥∥2]

(33)

which leads to the following conclusion.

Theorem 1 (stability). Consider an adaptive filter of the form

wi+1 = wi + µuTi f [e(i)], i ≥ 0, (34)

where e(i) = d(i)−uiwi and d(i) = uiwo+v(i). Assume thenoise sequence v(i) is i.i.d and independent of ui, and that thefilter is long enough so that ea(i) = ui(wo −wi) is Gaussian.Then sufficient conditions for stability are

µ ≤ 2[E∥∥ui

∥∥4]1/2

(inf

E[e2a]∈Ω′

E[e2a] · hG[E[e2

a]]

√hC[E[e2a]]

)(35)


or

µ ≤ 2[E∥∥ui

∥∥4]1/2

(inf

E[e2a]∈Ω′′

E[e2a] · hG[E[e2

a]]

√hC[E[e2a]]

), (36)

where

Ω′ =E[e2a]

: λ ≤ E[e2a],

Ω′′ =E[e2a]

: λ ≤ E[e2a] ≤ 1

4Tr(R)E

[∥∥w0∥∥2] (37)

and λ is the Cramer-Rao bound associated with the problem ofestimating the random quantity uiwo by using uiwi.

As indicated earlier, the bound (35) will be zero for su-perlinear functions f[·] (e.g., LMF and LMMN), in whichcase the tighter bound (36) will be more useful. Notice alsothat the above bounds are derived without relying on anyindependence-like assumptions.

3.2. Examples: explicit bounds on µ for stability

3.2.1 The LMS algorithm

Instead of applying Theorem 1, we can in the LMS case bemore specific. Thus starting from (17), we obtain

µ ≤ 2 infE[e2

a(i)]

E[ea(i)e(i)

]E[∥∥ui

∥∥2e2(i)]

= 2 infE[e2

a(i)]

E[e2a(i)

]E[∥∥ui

∥∥2e2a(i)

]+ σ2

vE[∥∥ui

∥∥2]

≤ 2 infE[e2

a(i)]

E[e2a(i)

][E∥∥ui

∥∥4]1/2

E[e4a(i)

]1/2 + σ2vE[∥∥ui

∥∥2]

= 2 infE[e2

a(i)]

E[e2a(i)

]√

3[E∥∥ui

∥∥4]1/2

E[e2a(i)

]+ σ2vE[∥∥ui

∥∥2] ,

(38)

where the last line follows from the Gaussian assumption(AG). By performing the minimization over Ω′, we obtainthe tighter bound

µ ≤ 2 infE[e2

a(i)]≥λ

E[e2a(i)

]√

3[E∥∥ui

∥∥4]1/2

E[e2a(i)

]+ σ2vE[∥∥ui

∥∥2]

= 2λ√

3[E∥∥ui

∥∥4]1/2

λ+ σ2vE[∥∥ui

∥∥2] .

(39)

For binary inputs, stability of the LMS algorithm can beestablished even without the Gaussian assumption (AG). Forthen, ‖ui‖2 = M and stability is guaranteed if

µ ≤ 2 infE[e2

a(i)]

E[ea(i)e(i)

]E[∥∥ui

∥∥2e2(i)] = 2

Minf

E[e2a(i)]

E[e2a(i)

]E[e2a(i)

]+ σ2v

≤ 2M

infE[e2

a]≥λ

E[e2a]

E[e2a]+ σ2

v= 2M

λλ+ σ2

v.

(40)

3.2.2 The sign algorithmFor the sign algorithm, the bound (35) reads

µ ≤ 2[E∥∥ui

∥∥4]1/2 inf

E[e2a]≥λ

√2π(E[e2a])3/2E

[e−v

2(i)/2E[e2a]]

=√

8π

λ3/2[E∥∥ui

∥∥4]1/2 E

[e−v

2(i)/2λ].(41)

3.2.3 The LMF algorithmFor the LMF algorithm, we employ the tighter bound (36)

µ ≤ 2[E∥∥ui

∥∥4]1/2 inf

E[e2a(i)]∈Ω′′

E[e2a(i)

] · hG[E[e2a(i)

]]√hC[E[e2a(i)

]] (42)

= 2[E∥∥ui

∥∥4]1/2 inf

E[e2a(i)]∈Ω′′

3E[e2a(i)

](E[e2a(i)

]+ σ2v)

E[e12(i)

]1/2 .

(43)

Using the Gaussian assumption, it is easy to evaluate the ex-pectation E[e12(i)], and subsequently carry out the mini-mization in (43). However, this is not necessary for all whatwe need is to make sure that the bound in (43) is positive.Since E[e2

a]hG/√hC is strictly positive over Ω′′ and since Ω′′

is compact, we conclude that

infE[e2

a]∈Ω′′E[e2a] · hG(E[e2

a])

√hC(E[e2a]) > 0. (44)

For design purposes, we could determine the infimum ofE[e2

a]hG/√hC overΩ′′. Here we are interested in the simpler

task of establishing stability.

Remark. Linearization was employed in [4] to prove the sta-bility of the LMF algorithm. While this might be reasonablefor steady-state analysis, it need not be valid when stability isconcerned. Notice that no linearization arguments were usedhere. Observe also that for the LMF algorithm, the infimumof E[e2

a]hG/√hC over Ω′ is zero. That explains why we had

to perform minimization over the smaller set Ω′′, which isinitial-condition dependent. This suggests that the perfor-mance of the LMF is initial-condition dependent too, as isoften confirmed by simulation. In general, this is expected tobe the case for algorithms employing super-nonlinearities.

4. STEADY-STATE BEHAVIOR

To investigate the steady-state behavior, we again start fromthe averaged energy relation (14). Assuming that the filter isstable, it should attain a steady-state where it holds that

limi→∞

E[∥∥wi+1

∥∥2]= limi→∞

E[∥∥wi

∥∥2]. (45)

Therefore, (14) becomes in steady-state

limi→∞

E[ea(i)f [e(i)]

] = µ2

limi→∞

E[∥∥ui

∥∥2f 2[e(i)]]. (46)


Both of the expectations in (46) were dealt with as part of thestability analysis. Using the Gaussian assumption, we havealready argued that E[ea(i)f [e(i)]] can be written as a func-tion of E[e2

a(i)]. In particular, from (20) we have

E[ea(i)f [e(i)]

] = E[e2a(i)

]hG[E[e2a(i)

]], (47)

where the function hG was tabulated in Table 2 for the non-linearities in Table 1.

In a similar fashion, we now proceed to evaluateE[‖ui‖2f 2[e(i)]] in terms of E[e2

a] (rather than bound itas in the stability analysis). This prompts us to introduce thefollowing “asymptotic” assumption:

(AU) The random variables ‖ui‖2 and f 2[e(i)] areasymptotically uncorrelated, that is,

limi→∞

E[∥∥ui

∥∥2f 2[e(i)]]= E

[∥∥ui∥∥2]

limi→∞

E[f 2[e(i)]

]. (48)

Assumption (AU) has the same spirit as the independenceassumption3 but it is weaker. For one thing, relation (48) isexact for constant modulus inputs while the independenceassumption is not. Moreover, the separation property (48)need only be satisfied asymptotically. Fortunately, assump-tion (AU) acts in harmony with the Gaussianity assumptionon ea(i) in that it also becomes more realistic as the filtergets longer. For then, by an ergodic argument, ‖ui‖2 behaveslike the second moment of the input (scaled by the filterlength M).

To proceed, we use the Gaussian assumption (AG) onea(i) to express the expectation E[f 2[e(i)]] as a functionof the second-order moment E[e2

a(i)], which motivates thedefinition

hU[E[e2a(i)

]] E

[f 2[e(i)]

]. (49)

The function hU is tabulated in Table 3 for the nonlinearitiesof Table 1.

This definition, together with (48), yields

limi→∞

E[∥∥ui

∥∥2f 2[e(i)]]= E

[∥∥ui∥∥2]

limi→∞

hU[E[e2a(i)

]]

= Tr(R) limi→∞

hU[E[e2a(i)

]].

(50)

Upon substituting (47) and (50) into (46), we obtain

limi→∞

E[e2a(i)

] µ

2Tr(R)

limi→∞ hU[E[e2a(i)

]]limi→∞ hG

[E[e2a(i)

]] . (51)

Now denote the steady-state mean-square error by

S limi→∞

E[e2a(i)

]. (52)

Then since both hU and hG are analytic in their arguments,

3The independence assumption requires that the input regressors uiform an independent and identically distributed sequence. This assumptionis heavily used in the adaptive filtering literature.

we have

limi→∞

hU[E[e2a(i)

]] = hU[S],limi→∞

hG[E[e2a(i)

]] = hG[S], (53)

so that the MSE is the positive root of the nonlinear equation

S = µ2

Tr(R)hU[S]hG[S]

. (54)

In other words, the MSE is the fixed point of the function(µ/2)Tr(R)(hU[S]/hG[S]). For a given error nonlinearityf , we can evaluate hU and hG (as done in Tables 2 and 3) andsubsequently solve for the MSE. Our findings are summarizedin the following theorem.

Theorem 2 (steady-state behavior). Consider the setting ofTheorem 1. Assume further that‖ui‖2 andf 2[e(i)] are asymp-totically uncorrelated, and that the filter is mean-square stablewith MSE denoted by S. Then the following equality holds:

S = µ2

Tr(R)hU[S]hG[S]

. (55)

To demonstrate the use of this theorem, we provide in whatfollows expressions for the mean-square error of some of thenonlinearities in Table 1.

4.1. Examples: MSE expressions

4.1.1 The LMS algorithm

In the LMS case, (54) reads

S = µ2

Tr(R)(S + σ2

v)

(56)

or, equivalently,

S = σ2vµTr(R)

2− Tr(R). (57)

This is a well-known result that was derived in [13] by relyingon the independence assumption. Using the energy-basedapproach of this paper, we only need the weaker assumption(AU), as indicated above and also in [27, 29].4

4.1.2 The sign algorithm

In the case of the sign algorithm, we can show that the MSEsatisfies

S = µ√π8

Tr(R)√S

E[e−v2(i)/2S

] . (58)

This relation applies irrespective of assumption (AU). Wehave only appealed to the Gaussian assumption (AG) in ar-riving at (58)—see also [28]. The expectation that appears

4This result remains valid irrespective of the Gaussian assumption onea(i). The reason is that in the LMS case, the two expectations that appearin (46) are already quadratic in ea(i)—see [27, 29].


Table 3: hU[·] for the error nonlinearities of Table 1 (σ 2ea E[e2

a(i)]).

Algorithm hU[σ 2ea]

(v(i) Gaussian) hU[σ 2ea]

LMS σ 2ea + σ 2

v σ 2ea + σ 2

v

LMF 15(σ 2ea + σ 2

v)3

15σ 6ea + 45σ 4

eaσ2v + 15σ 2

eaE[v4(i)

]+ E[v6(i)]

LMF family(4k+ 2)!

22k+1(2k+ 1)!(σ 2ea + σ 2

v)2k+1 ∑2k+1

j=0

(4k+2

2j

) (2j)!2jj!

σ 2jea E

[v2(2k−j+1)(i)

]

LMMN

a2(σ 2ea + σ 2

v)+ 6ab

(σ 2ea + σ 2

v)2

15b2σ 6ea +

(45b2σ 2

v + 6ab)σ 4ea

+15b2(σ 2ea + σ 2

v)3 +(15b2E

[v4(i)

]+ 12abσ 2v + a2

)σ 2ea

+E[(bv2(i)+ a)2v2(i)]

Sign error 1 1

Sat. nonlin. σ 2z sin−1

(σ 2ea + σ 2

v

σ 2ea + σ 2

v + σ 2z

)π2σ 2z − 2σ 3

z

∫ 1/√

2

0

1√σ 2ea + σ 2

z (1− x2)E[e−v

2(i)/2(σ 2ea+σ 2

z (1−x2))]dx

in (58) is carried over the noise pdf. By specifying the noisestatistics, we obtain the following special cases:

S =

α+√α+ 4σ2

v

2Gaussian noise,

α

√6π

σv

erf(√

3σ2v/2S

) Uniform noise,

α√Seσ 2

v /2S Binary noise,

(59)

where α = µ√π/8 Tr(R). Each of these equations can be

uniquely solved for the MSE. They were derived in [12] underthe independence assumption and under an i.i.d restrictionon the input, but are rederived here without relying on theserestrictions.

4.1.3 The LMF algorithm

For the LMF algorithm, and with the aid of Tables 1 and 2,equation (54) takes the form

S = µ6

15S3 + 45σ2vS2 + 15mv,4S +mv,6

S + σ2v

Tr(R), (60)

where mv,4 and mv,6 denote the fourth and sixth momentsof the noise v(i). Finding the MSE is thus equivalent to find-ing the roots of a third-order equation, which can be donenumerically. We can avoid this in the Gaussian case and ob-tain a closed formula for the MSE.

Gaussian noise

In the Gaussian noise case, (60) simplifies to

S = 5µ2

(S + σ2

v)3

S + σ2v

Tr(R) = α2

(S + σ2

v)2, (61)

where α = 5µTr(R). This is a quadratic equation in S withtwo positive roots

S =(1−ασ2

v)± √1− 2ασ2

v

α. (62)

Only the larger root is meaningful.5

Remarks. Although several of the MSE expressions above ap-peared previously in literature, the advantage of the energy-based approach of this article is threefold:

(1) All expressions are obtained as a fall out of the sameapproach (i.e., by using the same expression (54))—an ap-proach that avoids the need for a self-contained recursion.

(2) The expressions are either new (e.g., (58), (60), and(62)) or are otherwise rederived under a more relaxed set ofassumptions (e.g., (57) and (59)).

(3) Here we avoid the need for linearization as in theLMF example. It seems that calculating the steady-state errorfor super nonlinearities (e.g., the LMF and its family) hasalways involved some form of linearization arguments (e.g.,[4, 6, 8, 16]).

5. OPTIMUM CHOICE OF THE NONLINEARITY

In this section, we build upon the second-order analysis per-formed above to optimize the choice of the error nonlinearityf . To this end, consider expression (54) for the mean-squareerror written in a more explicit form

S = µ2

Tr(R)E[f 2[e(i)]

]E[ea(i)f [e(i)]

]/E[e2a(i)

] . (63)

5We can show that the smaller root is O(µ3). It is well known that theMSE is linearly proportional to the step size, and hence the smaller root canbe ignored.


We would like to choose a nonlinearity f that minimizes themean-square error. If we confine our attention to the classof smooth nonlinearities, we can write, using the Gaussianassumption (AG) and Price theorem [11, 33],

E[ea(i)f [e(i)]

] = E[ea(i)f [ea(i)+ v(i)]]= E[ea(i)e(i)]E[f ′(e(i))]= E[e2

a(i)]E[f ′[e(i)]

].

(64)

Thus, for a smooth error nonlinearity f , the MSE takes thealternative form

S = µ2

Tr(R)E[f 2[e(i)]

]E[f ′[e(i)]

] . (65)

The mean-square error cannot be minimized beyond thelimit λ, which corresponds to the Cramer-Rao bound of theunderlying estimation process. We can thus write

E[f 2[e(i)]

]E[f ′[e(i)]

] ≥ 2µTr(R)

λ α. (66)

Now let pe denote the pdf of e(i). We claim that the nonlin-earity

f[e(i)] = −αp′e[e(i)]pe[e(i)]

(67)

attains the lower bound on the MSE and hence is optimum.To see this, we evaluate the numerator and denominator of(66) for this choice of f . Using integration by parts, we have

E[f ′[e(i)]

] =∫∞−∞f ′[e(i)]pe[e(i)]de(i)

= f[e(i)]pe[e(i)]∣∣∣∞−∞

−∫∞−∞f[e(i)]p′e[e(i)]de(i).

(68)

For the choice (67) of f , this yields

E[f ′[e(i)]

] = −αp′e[e(i)]∣∣∣∞−∞

+α∫∞−∞

(p′e[e(i)]

)2

pe[e(i)]de(i)

(69)

or, assuming that p′e decays to zero as e(i) approaches ±∞,

E[f ′[e(i)]

] = α∫∞−∞

(p′e[e(i)]

)2

pe[e(i)]de(i). (70)

Now for the same choice of f , we have

E[f 2[e(i)]

] = α2∫∞−∞

(p′e[e(i)]pe[e(i)]

)2

pe[e(i)]de(i) (71)

= α2∫∞−∞

(p′e[e(i)]

)2

pe[e(i)]de(i). (72)

We thus conclude that

E[f 2[e(i)]

]E[f ′[e(i)]

] = α. (73)

In other words, the nonlinearity (67) ensures that the MSEapproaches the minimum limit determined by the Cramer-Rao bound.

5.1. Removing the constantα

The optimum nonlinearity is specified up to the constantα = 2λ/µTr(R). It turns out that we can use this nonlinearitywithout calculating α. To see this, we examine the adaptationequation (6) with the optimum choice (67) of f :

wi+1 = wi − µ(−αp

′e[e(i)]pe[e(i)]

)ui. (74)

Since the step size µ is a design parameter that is usuallyvaried, we can absorb the constant α into µ so that

wi+1 = wi − µ(− p

′e[e(i)]pe[e(i)]

)ui (75)

in which case the optimum nonlinearity effectively reads

fopt[e(i)] = −p′e[e(i)]pe[e(i)]

. (76)

Theorem 3 (optimum nonlinearity). Consider the settingof Theorem 2. Let pe denote the pdf of the estimation errore(i). The optimum nonlinearity that minimizes the steady-statemean-square error is given by


. (77)

5.2. Incorporating the Gaussian assumption on ea(i)

Implementing the optimum nonlinearity (67) or (76) re-quires that we evaluate the pdf of e(i) at each time instant.It turns out, however, that we can replace this task with thesimpler task of evaluating the variance of ea(i) in additionto specifying the (time-invariant) noise pdf. To this end, re-call that our derivation of the MSE expression relied on aGaussian assumption on the error ea(i), and, hence, so doesour subsequent derivation of the optimum error nonlinearity.Fortunately, this assumption helps us obtain a more explicitexpression for the optimum nonlinearity (76). To see this, no-tice that the estimation error is the sum of two independentrandom variables,

e(i) = ea(i)+ v(i). (78)

Therefore, its pdf, pe[e(i)], is the convolution of the pdf ’s ofea(i) and v(i), that is,

pe[e(i)] = pea[e(i)] pv[e(i)] (79)

= 1√2πσ2

ea

e−e2(i)/2σ 2

ea pv[e(i)]. (80)


Here σ2ea denotes the variance of ea(i).6 The above calcula-

tion reduces the determination of the optimum nonlinearityto the task of modeling the noise pdf.

5.3. Estimating the varianceσ2ea

Perhaps the most challenging task of the algorithm is estimat-ing the variance of the a priori error ea(i)—a nonstationaryquantity. The easiest way out is to set σ2

ea to some constantvalue. Alternatively, as done in the simulation, we first esti-mate the variance of e(i) using a window of samples of e(i),and subsequently estimate σ2

ea from

σ2ea = σ2

e − σ2v . (81)

Furthermore, to avoid malfunctioning of the algorithm, weconfine the estimate σ2

ea to a bounded interval [a, b] that isdetermined by the designer.

Remarks. (1) From above, we see that the derivation of theoptimum nonlinearity blends smoothly with the stability andsteady-state analysis in that it relies on the same set of assump-tions, and is also obtained as a fallout of the same energyconservation approach.

(2) Our derivation of the optimum nonlinearity sharesanother feature with the (stability) analysis in that it makesuse of the fundamental limit set by the Cramer-Rao boundof the underlying estimation process.

(3) Also note that no heavy machinery is appealed to indeveloping the optimum nonlinearity, maintaining the gen-eral themes of clarity and simplicity. In particular, we avoidthe variational approaches that are usually employed in liter-ature in designing optimum adaptation schemes (see [8, 20]).

(4) The nonlinearity (76) is derived under simpler as-sumptions compared to what is available in literature. Forinstance, we employ a weaker version of the independenceassumption (compare with [8, 10, 19, 20, 21]) and make norestriction on the color or statistics of the input (comparewith [8, 10, 20]). The nonlinearity (76) also applies irrespec-tive of the noise statistics or whether its pdf is symmetric ornot (contrary to what is assumed in [8, 19]). We only requirethe noise to have zero mean.

(5) More importantly, perhaps, we avoid the need for anylinearization arguments making the nonlinearity (76) accu-rate over all stages of adaptation. In contrast, the optimumnonlinearity

f[e(i)] = −p′v[e(i)]pv[e(i)]

(82)

that was derived in [8] using linearization arguments is onlyaccurate in the final stages of adaptation. In fact, the moreaccurate expression (76) for the optimum nonlinearity col-lapses to (82) as the filter reaches its steady-state.

(6) Notice further that expression (76) for the optimumnonlinearity applies irrespective of whether the noise pdfis smooth enough (differentiable) or not. Thanks to the

6The time dependence of σ 2ea is suppressed for notational convenience.

smoothing convolution operator (see (80)), we can, for ex-ample, directly calculate the optimum nonlinearity for binaryand uniform noise (see examples below). This comes contraryto the nonlinearity (82) where an artificial smoothing kernelneeds to be employed for such singular cases [8].

5.4. Examples

In what follows,we show how the error nonlinearity manifestsitself for different noise statistics.

Gaussian noise

When v is Gaussian, so is the estimation error e(i) (since e(i)is then the sum of two Gaussian random variables, v(i) andea(i)). In this case, the optimum nonlinearity (76) becomes


= 1

σ2ee(i) (83)

which, up to the scaling factor 1/σ2e , is the error function

of the LMS. Therefore, the LMS is the optimum adaptivealgorithm in the presence of Gaussian noise.

Laplacian noise

When v follows a Laplacian distribution, its pdf takes theform

pv[v] = 12e−|v|. (84)

Upon substituting this expression into (80), we can show thatthe pdf of e(i) takes the form

pe[e(i)] = 14eσ

2ea /2

ee(i)

(1− erf

[e(i)+ σ2

ea√2σ2

ea

])

+ e−e(i)(

1+ erf

[e(i)− σ2

ea√2σ2

ea

]).

(85)

After some straight forward manipulations, we can show thatthis leads to the following form for the nonlinearity −p′e/pe

fopt = −(ee(i)

(1− erf

[(e(i)+ σ2

ea)/√

2σ2ea

]

−√

2/πσ2eae

−(e(i)+σ 2ea )

2/2σ 2ea

)

− e−e(i)(

1+ erf[(e(i)− σ2

ea)/√

2σ2ea

]

−√

2/πσ2eae

−(e(i)−σ 2ea )

2/2σ 2ea

))

×(ee(i)

(1− erf

[(e(i)+ σ2

ea)/√

2σ2ea

])

+ e−e(i)(

1+ erf[(e(i)− σ2

ea)/√

2σ2ea

]))−1

,

(86)


where erf is the error function defined by

erf[x] 2√π

∫ x0e−t

2dt. (87)

Uniform noise

When v is uniformly distributed over [−1,1], we have

pv[v] =

12, −1 ≤ v ≤ 1,

0, otherwise.

(88)

Upon substituting (88) into (80), we obtain

pe[e(i)] =σea2

√π2

erf

[e(i)+ 1√

2σ2ea

]− erf

[e(i)− 1√

2σ2ea

] .(89)

We can use this expression to show that

fopt[e(i)] =√

8π

1σea

e−((e2(i)+1)/2σ 2

ea )

× sinh[e(i)/σ2

ea]

erf[(e(i)+1

)/√

2σ2ea

]−erf

[(e(i)−1

)/√

2σ2ea

] .(90)

Binary noise

In the binary noise case, we have

pv[v] =

1 with probability 0.5,

−1 with probability 0.5.

(91)

In this case, the optimum nonlinearity reads

fopt[e(i)] = 1

σ2ea

(e(i)− e−((e2(i)+1)/2σ 2

ea ) tanh

[e(i)σ2ea

]).

(92)

5.4.1 Simulations

Here we use simulations to illustrate the favorable behavior ofthe optimum algorithm in comparison to the LMS. The sys-tem to be identified is an FIR channel with 15 taps normalizedso that the SNR relative to the input and output is the same(10 dB in our case). The input is taken to be Gaussian whilethe additive output noise is assumed to be binary or Lapla-cian. The variance σ2

e is estimated using the most recent foursamples of e(i), and the estimate is in turn used in (81) to es-timate the variance of ea(i). Whenever the estimate σ2

ea fallsoutside the range [σ2

v/2,5σ2v] ([σ2

v/10,5σ2v]) in the binary

(Laplacian) noise case, we enforce the assignment σ2ea = σ2

vinstead. The experiment is averaged over one thousand runs.

The LMS and the optimum adaptive algorithms are com-pared (Figures 1 and 2) in terms of their learning curves;the evolution of E[‖wi‖2] with time (also known as the

mean-square deviation or MSD). We also plot the nonlin-earities employed by both algorithms. Since the optimumnonlinearity is time varying (through its dependence on σ2

ea),it has a stochastic nature. The plots thus show the optimumnonlinearities in their averaged forms.

5.5. Relation between optimum and othernonlinearities

The optimum nonlinearities (76) and (82) are expressed interms of some pdf (pe[e(i)] or pv[e(i)]) and its derivative.This makes the nonlinearities difficult to implement sincethe pdf is usually unknown and/or time-varying. Even if thepdf is known, the corresponding nonlinearity would be ex-pressed in terms of transcendental functions (e.g., as in (90)and (92)), which do not lend themselves to real-time im-plementations. This is compounded by the fact that differ-ent distributions (i.e., pdf ’s) call for different nonlinearities.Thus, the optimum nonlinearities defy an important featureof least-mean-square algorithms, namely computational sim-plicity. A more alarming issue though is that these nonlinear-ities do not seem to relate to the ubiquitous LMS algorithmor its common variants. In the following, we address both ofthese issues for the nonlinearity (76) by representing the pdfpe[e(i)] in an Edgeworth expansion, which we now digress tointroduce. The nonlinearity (82) can be dealt with similarly.

5.5.1 Edgeworth expansion ofpeLet γj denote the jth cumulant of e(i) and let σ2

e denote itsvariance. Assuming pe to be even, its Edgeworth expansionis given by [34]

pe[e(i)] = 1σeφ[e(i)σe

] ∞∑j=0

a2jHe2j

[e(i)σe

], (93)

whereφ is the standard (zero-mean, unit variance) Gaussianpdf and He2j is the Hermite polynomial of degree 2j [35].The coefficients aj are defined recursively by

a0 = 0, a1 = 1, a2 = 0,

aj =1j

[γ1

σeaj−1 +

(γ2

σ2e− 1

)aj−2

+j∑

m=3

γm(m− 1)!σme

aj−m

](j ≥ 3).

(94)

Since pe is even, we can show that a2j+1 = 0 for all j,and only the even indexed coefficients appear in (93). Thus,pe[e(i)]/φ[e(i)/σe] is an infinite linear combination ofHermite polynomials of even degree. Alternatively, sinceHe2j[e(i)/σe] is a linear combination of even powers of itsargument, so is the expansion (93). This series expansion istherefore similar to the familiar Taylor expansion except forthe fact that it is expressed in statistically relevant terms—the aj ’s, which are defined in terms of the cumulants ratherthan the derivatives of pe. The expansion (93) is also differ-ent from a Taylor series in that we are not interested in itsconvergence as much as in representing pe in as few terms


−5

−4

−3

−2

−1

0

1

2

3

4

5

−5 0 5e(i)

Error (non) linearities

LMS alg.

opt. alg.

e(i)

&f o

pt[e(i)]

0 200 400 600 800 1000Iteration

−15

−10

−5

0

Learning curves

LMS alg.

opt. alg.

MSD

Figure 1: Error updates and learning curves for the LMS and optimum algorithm (binary noise case).

−5

−4

−3

−2

−1

0

1

2

3

4

5

−5 0 5e(i)

Error (non) linearities

LMS alg.

opt. alg.

e(i)

&f o

pt[e(i)]

0 200 400 600 800 1000Iteration

−18

−16

−14

−12

−10

−8

−6

−4

−2

0

Learning curves

LMS alg.

opt. alg.

MSD

Figure 2: Error updates and learning curves for the LMS and optimum algorithm (Laplacian noise case).

of (93) as possible. For most practical purposes, the first fewterms of (93) are sufficient for a good approximation [34].

5.5.2 Relating the nonlinearities

With the Edgeworth expansion (93) at hand, we nowshow how the optimum nonlinearity (76) relates to the(non)linearities of the LMS algorithm and its variants. Us-ing (93), we can represent the optimum nonlinearity (76),

after some straightforward manipulations, as

fopt[e(i)] = −p′[e(i)]p[e(i)]

= e(i)σ2e− 1

σ2e

∑∞j=2(2j)a2jHe2i−1

[e(i)/σe

]∑∞j=2 a2jHe2i

[e(i)/σe

] .

(95)


We can finally put fopt in a more familiar form by approxi-mating the rational function in the last equation by its Taylorseries. As the rational function is odd, the Taylor series willcontain only odd powers of e(i), and we can, therefore, write

fopt[e(i)] =∞∑j=0

c2j+1e2j+1(i). (96)

The c2j+1’s can be written in terms of thea2j ’s but the explicitdependence is not essential for the subsequent discussion.The following remarks are in order.

Remarks. (1) The lowest order term in (96) is that employedby the LMS algorithm. Thus, LMS is a first-order approxima-tion of the optimum nonlinearity, which explains the robust-ness of the LMS in noisy environments (see also [3] for a de-terministic account of the robustness of the LMS). Moreover,the cubic term of (96) corresponds to the error nonlinearityof the LMF algorithm, while the higher order terms are thoseof the LMF-family.

(2) The approximation (96) also suggests that a mixtureof the LMS algorithm and the LMF family of algorithms willoutperform any of the individual algorithms as such mixturesrepresent a better approximation of the optimum nonlinear-ity. The LMS-LMF mixture (also known as the LMMN algo-rithm) was actually simulated in [36] and shown to outper-form both of the constituent algorithms. The approximationpresented here not only justifies such mixtures but can ac-tually be used to design optimal mixtures by calculating thecoefficients c2j+1 (which, in turn, can be explicitly expressedin terms of the estimation error cumulants, γj).

(3) From the above, it follows that the optimum nonlin-earity is nothing but an optimal mixture of familiar nonlin-earities.

(4) The approximation (96) alleviates the difficulties asso-ciated with the implementation of the optimum nonlinearity.In particular, it applies irrespective of the distribution of theestimation error pdf pe. The approximation also providesfor a tradeoff between numerical simplicity and more accu-rate approximation of the nonlinearity. Notice, however, that(96) still calls for the estimation of the coefficients c2j+1, or,equivalently, of the cumulants γi of the estimation error e(i).

(5) By expanding the noise pdf, pv , in an Edgeworth se-ries similar to (93), we can extend the above remarks to thenonlinearity (82) (see [37]).

6. CONCLUSION

In this paper, we pursued a unified approach to mean-squareanalysis of adaptive filters with arbitrary error nonlinearities.In particular, starting from an energy conservation relation,we were able to arrive at sufficient conditions for stabilitywithout relying on any independence assumptions. Using thesame relation we also showed that the MSE is the fixed pointof a nonlinear function. This nonlinear expression for theMSE was subsequently used to derive an expression for theoptimum nonlinearity.

We would like to emphasize that all our results apply forany error nonlinearity and for arbitrary input color and statis-tics. They are obtained as a fall out of the same energy conser-vation relation and rely on weak assumptions that are quiteaccurate for long enough filters.

ACKNOWLEDGEMENTS

The work of T. Y. Al-Naffouri was partially supported by a fel-lowship from King Fahd University of Petroleum and Min-erals, Saudi Arabia. The work of A. H. Sayed was partiallysupported by the National Science Foundation under grantECS-9820765.

REFERENCES

[1] O. Macchi, Adaptive Processing: The Least Mean Squares Ap-proach with Applications in Trasmissions, John Wiley & Sons,New York, 1995.

[2] S. Haykin, Adaptive Filter Theory, Prentice Hall, EnglewoodCliffs, NJ, 3rd edition, 1996.

[3] B. Hassibi, A. Sayed, and T. Kailath, “H∞ optimality of theLMS algorithm,” IEEE Trans. Signal Processing, vol. 44, no. 2,pp. 267–280, 1996.

[4] E. Walach and B. Widrow, “The least mean fourth (LMF) adap-tive algorithm and its family,” IEEE Transactions on InformationTheory, vol. 30, no. 2, pp. 275–283, 1984.

[5] J. D. Gibson and S. D. Gray, “MVSE adaptive filtering subjectto a constraint on MSE,” IEEE Trans. Circuits and Systems, vol.35, no. 5, pp. 603–608, 1988.

[6] D. L. Duttweiler, “Adaptive filter performance with nonlineari-ties in the correlation multiplier,” IEEE Trans. Acoustics, Speech,and Signal Processing, vol. 30, no. 4, pp. 578 –586, 1982.

[7] W. A. Sethares, “Adaptive algorithms with nonlinear data anderror functions,” IEEE Trans. Signal Processing, vol. 40, no. 9,pp. 2199–2206, 1992.

[8] S. C. Douglas and T. H.-Y. Meng, “Stochastic gradient adapta-tion under general error criteria,” IEEE Trans. Signal Processing,vol. 42, no. 6, pp. 1335–1351, 1994.

[9] V. J. Mathews and S. H. Cho, “Improved convergence analysisof stochastic gradient adaptive filters using the sign algorithm,”IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP-35, no. 4, pp. 450–454, 1987.

[10] N. J. Bershad, “On error-saturation nonlinearities in LMSadaptation,” IEEE Trans. Acoustics, Speech, and Signal Process-ing, vol. 36, no. 4, pp. 440–452, 1988.

[11] N. Bershad and M. Bonnet, “Saturation effects in LMS adap-tive echo cancellation for binary data,” IEEE Trans. Acoustics,Speech, and Signal Processing, vol. 38, no. 10, pp. 1687–1696,1990.

[12] T. A. C. M. Claasen and W. F. G. Mecklenbrauker, “Comparisonof the convergence of two algorithms for adaptive FIR digitalfilters,” IEEE Trans. Circuits and Systems, vol. 28, no. 6, pp.510–518, 1981.

[13] W. A. Gardner, “Learning characteristics of stochastic-gradient-descent algorithms: a general study, analysis and cri-tique,” Signal Processing, vol. 6, no. 2, pp. 113–133, 1984.

[14] A. Feuer and E. Weinstein, “Convergence analysis of LMS fil-ters with uncorrelated Gaussian data,” IEEE Trans. Acoustics,Speech, and Signal Processing, vol. 33, no. 1, pp. 222–229, 1985.

[15] M. Rupp, “The behavior of LMS and NLMS algorithms in thepresence of spherically invariant processes,” IEEE Trans. SignalProcessing, vol. 41, no. 3, pp. 1149–1160, 1993.

[16] T. Y. Al-Naffouri, A. Zerguine, and M. Bettayeb, “Convergenceanalysis of the LMS algorithm with a general error nonlinearity


and an iid input,” in Proceedings of the 32nd Asilomar Confer-ence on Signals, Systems, and Computers, vol. I, pp. 556–559,Asilomar, CA, November 1998.

[17] S. Koike, “Convergence analysis of a data echo canceller with astochastic gradient adaptive FIR filter using the sign algorithm,”IEEE Trans. Signal Processing, vol. 43, no. 12, pp. 2852–2861,1995.

[18] J. C. M. Bermudez and N. Bershad, “A nonlinear analyticalmodel for the quantized LMS algorithm—the arbitrarystepsize case,” IEEE Trans. Signal Processing, vol. 44, no. 5, pp.1175–1183, 1996.

[19] B. Polyak and Y. Tsypkin, “Adaptive estimation algorithms(convergence, optimality, stability),” Avtomatika i Tele-mekhanika, , no. 3, pp. 71–84, 1979.

[20] N. J. Bershad, “On the optimum data nonlinearity in LMSadaptation,” IEEE Trans. Acoustics, Speech, and Signal Process-ing, vol. 34, no. 2, pp. 69–76, 1986.

[21] T. Y. Al-Naffouri, A. H. Sayed, and T. Kailath, “On the selectionof optimal nonlinearities for stochastic gradient adaptive algo-rithms,” in Proc. ICASSP, vol. 1, pp. 464–467, Istanbul, Turkey,June 2000.

[22] R. Sharma,W. A. Sethares, and J. A. Bucklew, “Asymptotic anal-ysis of stochastic gradient based adaptive filtering algorithmswith general cost functions,” IEEE Trans. Signal Processing, vol.44, no. 9, pp. 2186–2194, 1996.

[23] A. H. Sayed and M. Rupp, “A time-domain feedback analysisof adaptive gradient algorithms via the small gain theorem,”in Proc. SPIE Conference on Advanced Signal Processing: Algo-rithms, Architectures, and Implementations, F. T. Luk, Ed., vol.2563, pp. 458–469, San Diego, CA, July 1995.

[24] M. Rupp and A. H. Sayed, “A time-domain feedback analysis offiltered-error adaptive gradient algorithms,” IEEE Trans. SignalProcessing, vol. 44, no. 6, pp. 1428–1439, 1996.

[25] A. H. Sayed and M. Rupp, “Robustness issues in adaptivefiltering,” in Digital Signal Processing Handbook, chapter 20,CRC Press, January 1998.

[26] J. Mai and A. H. Sayed,“A feedback approach to the steady-stateperformance of fractionally-spaced blind adaptive equalizers,”IEEE Trans. Signal Processing, vol. 48, no. 1, pp. 80–91, 2000.

[27] N. R. Yousef and A. H. Sayed, “A unified approach to thesteady-state and tracking analysis of adaptive filters,” IEEETrans. Signal Processing, vol. 49, no. 2, pp. 314–324, 2001.

[28] N. R. Yousef and A. H. Sayed, “Steady-state and tracking anal-yses of the sign algorithm without the explicit use of the inde-pendence assumption,” IEEE Signal Processing Letters, vol. 7,no. 11, pp. 307–309, 2000.

[29] T. Y. Al-Naffouri and A. H. Sayed, “Transient analysis of adap-tive filters,” in Proc. ICASSP, Salt Lake City, Utah, May 2001.

[30] T. Y. Al-Naffouri and A. H. Sayed, “Transient analysis of adap-tive filters–Part I: the data nonlinearity case,” in IEEE-EURASIPWorkshop on Nonlinear Signal and Image Processing, Baltimore,Maryland, June 2001.

[31] T. Y. Al-Naffouri and A. H. Sayed, “Transient analysis ofadaptive filters–Part II: the error nonlinearity case,” in IEEE-EURASIP Workshop on Nonlinear Signal and Image Processing,Baltimore, Maryland, June 2001.

[32] H. L. Van Trees, Detection, Estimation, and Modulation Theory:Part I, John Wiley and Sons, New York, 1968.

[33] R. Price, “A useful theorem for nonlinear devices having gaus-sian inputs,” IEEE Transactions on Information Theory, vol. 4,no. 6, pp. 69–72, 1958.

[34] A. H. Nuttall, “Evaluation of densities and distributions viaHermite and generalized Laguerre series employing higher-order expansion coefficients determined recursively via mo-ments or cumulants,” Tech. Rep. TR 7377, Naval Under-WaterSystems Center, February 1985.

[35] M. Abramowitz and I. Stegun, Handbook of MathematicalFunctions, Dover Publications, Dover, New York, 1970.

[36] O. Tanrikulu and J. A. Chambers, “Convergence and steady-state properties of the least-mean mixed norm (LMMN) adap-tive algorithm,” in IEE Proc.-Vision, Image and Signal Process-ing, vol. 143, pp. 137–142, June 1996.

[37] T. Y. Al-Naffouri, A. Zerguine, and M. Bettayeb, “A unify-ing view of error nonlinearities in LMS adaptation,” in Proc.ICASSP, vol. III, pp. 1697–1700, Seattle, May 1998.

Tareq Y. Al-Naffouri received a BS degreein Mathematics (with honors) and an MSdegree in Electrical Engineering from KingFahd University of Petroleum and Minerals,Dhahran, Saudi Arabia, in 1994 and 1997, re-spectively. He received an MS degree in Elec-trical Engineering from Georgia Institute ofTechnology in 1998, and subsequently joinedStanford University where he is a PhD candi-date in the Electrical Engineering Department. Mr. Al-Naffouri’s re-search interests include system identification, adaptive filtering anal-ysis and design, echo cancellation, and design under uncertainty. Hehas recently been interested in channel identification and equaliza-tion in OFDM transmission. He is the recipient of a 2001 best studentpaper award at an international meeting for his work on adaptivefiltering analysis.

Ali H. Sayed received the Ph.D. degree inelectrical engineering in 1992 from Stan-ford University, Stanford, CA. He is Profes-sor of Electrical Engineering at the Universityof California, Los Angeles. He has over 160journal and conference publications, is coau-thor of the research monograph IndefiniteQuadratic Estimation and Control (SIAM, PA1999) and of the graduate-level textbook Lin-ear Estimation (Prentice Hall, NJ, 2000). He is also coeditor of thevolume Fast Reliable Algorithms for Matrices with Structure (SIAM,PA 1999). He is a member of the editorial boards of the SIAM Journalon Matrix Analysis and Its Applications, of the International Journalof Adaptive Control and Signal Processing, and has served as coeditorof special issues of the journal Linear Algebra and Its Applications.He has contributed several articles to engineering and mathemati-cal encyclopedias and handbooks, and has served on the programcommittees of several international meetings. He has also consultedwith industry in the areas of adaptive filtering, adaptive equaliza-tion, and echo cancellation. His research interests span several areasincluding adaptive and statistical signal processing, filtering and es-timation theories, equalization techniques for communications, in-terplays between signal processing and control methodologies, andfast algorithms for large-scale problems. To learn more about hiswork, visit the website of the UCLA Adaptive Systems Laboratory athttp://www.ee.ucla.edu/asl. Dr Sayed is a recipient of the 1996 IEEEDonald G. Fink Award. He is Associate Editor of the IEEE Transac-tions on Signal Processing.

http://www.ee.ucla.edu/asl

asl.epfl.ch · Adaptive ﬁlters with error nonlinearities: mean-square analysis and optimum design 205 and an iid input,” in Proceedings of the 32nd Asilomar Confer-ence on ...

Documents