ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

ELEG-636: Statistical Signal Processing

Kenneth E. Barner

Department of Electrical and Computer EngineeringUniversity of Delaware

Spring 2009

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 1 / 406

Course Objectives & Structure


Objective: Given a discrete time sequence {x(n)}, develop

Statistical and spectral signal representations

Filtering, prediction, and system identification algorithmsOptimization methods that are

StatisticalAdaptive

Course Structure:

Weekly lectures [notes: www.ece.udel.edu/ barner]

Periodic homework (theory & Matlab implementations) [10%]

Midterm & Final examinations [80%]

Final project [10%]



Outline

1 Probability

2 Stationary Process and Models

3 Linear Systems, Spectral Representations, and Eigen Analysis

4 Maximum Likelihood and Bayes Estimation

5 Wiener (MSE) Filtering Theory

6 Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms)

7 Application: Blind Deconvolution


Probability

Signal Characterization

Assumption: Many methods take {x(n)} to be deterministicReality: Real world signals are usually statistical in nature

Thus,. . . x(−1), x(0), x(1), . . .

can be interpreted as a sequence of random variables.We begin by analyzing each observation x(n) as a R.V.Then, to capture dependencies, we consider random vectors

. . . x(n), x(n + 1), . . . , x(n + N − 1)︸︷︷︸x(n)

, x(n + N), . . .


Probability Random Variables

Random Variables

Definition

For a space S, the subsets, or events of S, have associatedprobabilities.

To every event δ, we assign a number x(δ), which is called a R.V.

The distribution function of x is

Pr{x ≤ x0} = Fx (x0) −∞ < x0 < ∞Properties:

1 F (+∞) = 1, F (−∞) = 02 F (x) is continuous from the right

F (x+) = F (x)

3 Pr{x1 < x ≤ x2} = F (x2) − F (x1)



Example

Fair toss of two coins: H=heads, T=Tails

Define numerical assignments:

Events(δ) Prob. X(δ) Y(δ)HH 1/4 1 -100HT 1/4 2 -100TH 1/4 3 -100TT 1/4 4 500

This assignments yield different distribution functions

Fx(2) = Pr{HH, HT} = 1/2

Fy(2) = Pr{HH, HT , TH} = 3/4

How do we attain an intuitive interpretation of the distribution function?



Distribution Plots

0

0.5

1

0 1 2 3 4 5

)(xFx

x 0

0.5

1

-200 -100 0 100 200 300 400 500 600 700

)(yFy

y

43

Note properties hold:1 F (+∞) = 1, F (−∞) = 02 F (x) is continuous from the right

F (x+) = F (x)

3 Pr{x1 < x ≤ x2} = F (x2) − F (x1)



Definition

The probability density function is defined as,

f (x) =dF (x)

dx

or F (x) =

∫ x

−∞f (x)dx

Thus F (∞) = 1 ⇒∫ ∞

−∞f (x)dx = 1

Types of distributions:

Continuous: Pr{x = x0} = 0 ∀x0

Discrete: F (xi) − F (x−i ) = Pr{x = xi} = Pi

In which case f (x) =∑

i Piδ(x − xi)

Mixed: discontinuous but not discrete



Distribution examples

Uniform: x ∼ U(a, b) a < b

f (x) =

{ 1b−a x ∈ [a, b]

0 else

ab −1

a b

f(x)

a b

F(x)1



Gaussian: x ∼ N(μ, σ)

f (x) =1√2πσ

e− (x−μ)2

2σ2

μ

f(x)

μ

F(x)

1/2

1

Historical Note: First introduced by Abraham de Moivre in 1733;Rigorously justified by Gauss in 1809; Extended by Laplace in 1812;

Why not the de Moivre distribution? Stigler’s law.



Gaussian Distribution Example

Example

Consider the Normal (Gaussian) distribution PDF and CDF forμ = 0, σ2 = 0.2, 1.0, 5.0 and μ = −2, σ2 = 0.5

F,

2(x

)

0.8

0.6

0.4

0.2

0.0

5 3 1 3 5

1.0

1 0 2 424

=0,=0, = 2,

=0, =0.22

=1.02

=5.02

=0.52

f,

2(x

)

0.8

0.6

0.4

0.2

0.0

5 3 1 3 5

1.0

1 0 2 424

=0,=0, = 2,

=0, =0.22

=1.02

=5.02

=0.52



Binomial: x ∼ B(p, q) p + q = 1

Example

Toss a coin n times. What is the probability of getting k heads?

For p + q = 1, where q is probability of a tail, and p is the probabilityof a head:

Pr{x = k} =

(nk

)pkqn−k

[NOTE:

(nk

)=

n!

(n − k)!k !

]

⇒ f (x) =n∑

k=0

(nk

)pkqn−kδ(x − k)

⇒ F (x) =m∑

k=0

(nk

)pkqn−k m ≤ x < m + 1



Binomial Distribution Example I

Example

Toss a coin n times. What is the probability of getting k heads? Forn = 9, p = q = 1

2 (fair coin)

0

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 10

f(x)

0

0.5

1

0 1 2 3 4 5 6 7 8 9 10 11

F(x)



Binomial Distribution Example II

Example

Toss a coin n times. What is the probability of getting k heads? Forn = 20, p = 0.5, 0.7 and n = 40, p = 0.5.

0 10 20 30 40

0.00

0.05

0.10

0.15

0.20

0.25

p=0.5 and n=20p=0.7 and n=20p=0.5 and n=40

0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

p=0.5 and n=20p=0.7 and n=20p=0.5 and n=40


Probability Conditional Distributions

Conditional Distributions

Definition

The conditional distribution of x given event “M” has occurred is

Fx(x0|M) = Pr{x ≤ x0|M}=

Pr{x ≤ x0, M}Pr{M}

Example

Suppose M = {x ≤ a}, then

Fx(x0|M) =Pr{x ≤ x0, M}

Pr{x ≤ a}If x0 ≥ a, what happens?



Special Cases

Special Case: x0 ≥ a

Pr{x ≤ x0, x ≤ a} = Pr{x ≤ a}

⇒ Fx(x0|M) =Pr{x ≤ x0, M}

Pr{x ≤ a} =Pr{x ≤ a}Pr{x ≤ a} = 1

Special Case: x0 ≤ a

⇒ Fx(x0|M) =Pr{x ≤ x0, M}

Pr{x ≤ a} =Pr{x ≤ x0}Pr{x ≤ a}

=Fx(x0)

Fx(a)



Conditional Distribution Example

Example

Suppose

a

F(x)1

What does Fx(x |M) look like? Note M = {x ≤ a}.

⇒ Fx(x0|M) =

{Fx(x0)Fx(a) x ≤ a1 a ≤ x



)(xFx

)( axxFx ≤

a

F(x)1

Distribution properties hold for conditional cases:Limiting cases: F (∞|M) = 1 and F (−∞|M) = 0Probability range: Pr{x0 ≤ x ≤ x1|M} = F (x1|M) − F (x0|M)Density–distribution relations:

f (x |M) =∂F (x |M)

∂x

F (x0|M) =

∫ x0

−∞f (x |M)dx



Example (Fair Coin Toss)

Toss a fair coin 4 times. Let x be the number of heads. DeterminePr{x = k}.

Recall

Pr{x = k} =

(nk

)pkqn−k

In this case

Pr{x = k} =

(4k

)(12

)4

Pr{x = 0} = Pr{x = 4} =116

Pr{x = 1} = Pr{x = 3} =14

Pr{x = 2} =38



Density and Distribution Plots for Fair Coin (n = 4) Ex.

0

0.1

0.2

0.3

0.4

0.5

0 1 2 3 4 5

f(x)

161

41

83

161

41

0

0.5

1

0 1 2 3 4 5 6 7

F(x)

161

165

1611

1615 1

What type of distribution is this? Discrete. Thus,

F (xi) − F (x−i ) = Pr{x = xi} = Pi

F (x) =

∫ x

−∞f (x)dx =

∫ x

−∞

∑i

Piδ(x − xi)dx



Conditional Case

Example (Conditional Fair Coin Toss)

Toss a fair coin 4 times. Let x be the number of heads. SupposeM = [at least one flip produces a head]. Determine Pr{x = k |M}.

Recall,

Pr{x = k |M} =Pr{x = k , M}

Pr{M}Thus first determine Pr{M}

Pr{M} = 1 − Pr{No heads}= 1 − 1

16

=1516



Next determine Pr{x = k |M} for the individual cases, k = 0, 1, 2, 3, 4

Pr{x = 0|M} =Pr{x = 0, M}

Pr{M} = 0

Pr{x = 1|M} =Pr{x = 1, M}

Pr{M}=

Pr{x = 1}Pr{M} =

1/415/16

=4

15

Pr{x = 2|M} =Pr{x = 2}

Pr{M} =3/8

15/16=

615

Pr{x = 3|M} =415

Pr{x = 4|M} =115



Conditional and Unconditional Density Functions

0

0.1

0.2

0.3

0.4

0.5

0 1 2 3 4 5

f(x)

161

41

83

161

41

0

0.1

0.2

0.3

0.4

0.5

0 1 2 3 4 5

f(x\M)

154

156

154

151

Are they proper density functions?


Probability Total Probability and Bayes’ Theorem

Total Probability and Bayes’ Theorem

Let M1, M2, . . . , Mn forms a partition of S, i.e.⋃i

Mi = S and Mi

⋂i �=j

Mj = φ

Then

F (x) =∑

i

Fx(x |Mi)Pr(Mi)

f (x) =∑

i

fx(x |Mi)Pr(Mi)

Aside

Pr{A|B} =Pr{A, B}

Pr{B} =Pr{B, A}Pr{A}Pr{B}Pr{A} =

Pr{B|A}Pr{A}Pr{B}



From this we get

Pr{M|x ≤ x0} =Pr{x ≤ x0|M}Pr{M}

Pr{x ≤ x0}=

F (x0|M)Pr{M}F (x0)

and

Pr{M|x = x0} =f (x0|M)Pr{M}

f (x0)

By integration∫ ∞

−∞Pr{M|x = x0}f (x0)dx0 =

∫ ∞

−∞f (x0|M)Pr{M}dx0

= Pr{M}∫ ∞

−∞f (x0|M)dx0 = Pr{M}

⇒ Pr{M} =

∫ ∞

−∞Pr{M|x = x0}f (x0)dx0



Putting it all Together: Bayes’ Theorem

Bayes’ Theorem:

f (x0|M) =Pr{M|x = x0}f (x0)

Pr{M}=

Pr{M|x = x0}f (x0)∫∞−∞ Pr{M|x = x0}f (x0)dx0


Probability Functions of a R.V.

Functions of a R.V.

Problem Statement

Let x and g(x) be RVs such that

y = g(x)

Question: How do we determine the distribution of y?

NoteFy (y0) = Pr{y ≤ y0}

= Pr{g(x) ≤ y0}= Pr{x ∈ Ry0}

whereRy0 = {x : g(x) ≤ y0}

Question: If y = g(x) = x2, what is Ry0 ?



Example

Let y = g(x) = x2. Determine Fy (y0).

0y

0y− 0y x

2xy =

Note thatFy (y0) = Pr(y ≤ y0)

= Pr(−√y0 ≤ x ≤ √

y0)= Fx(

√y0) − Fx(−√

y0)



Example

Let x ∼ N(μ, σ) and

y = U(x) =

{1 if x > μ0 if x ≤ μ

Determine fy (y0) and Fy (y0).

21

)( yf y

0 1

)( yFy

21

0 1

1



General Function of a Random Variable Case

To determine the density of y = g(x) in terms of fx (x0), look at g(x)

fy (y0)dy0 = Pr(y0 ≤ y ≤ y0 + dy0)

= Pr(x1 ≤ x ≤ x1 + dx1) + Pr(x2 + dx2 ≤ x ≤ x2)

+Pr(x3 ≤ x ≤ x3 + dx3)



fy (y0)dy0 = Pr(x1 ≤ x ≤ x1 + dx1) + Pr(x2 + dx2 ≤ x ≤ x2)

+Pr(x3 ≤ x ≤ x3 + dx3)

= fx (x1)dx1 + fx (x2)|dx2| + fx (x3)dx3 (∗)Note that

dx1 =dx1

dy0dy0 =

dy0

dy0/dx1=

dy0

g′(x1)

Similarly

dx2 =dy0

g′(x2)and dx3 =

dy0

g′(x3)

Thus (∗) becomes

fy (y0)dy0 =fx (x1)

g′(x1)dy0 +

fx (x2)

|g′(x2)|dy0 +fx (x3)

g′(x3)dy0

or

fy (y0) =fx (x1)

g′(x1)+

fx (x2)

|g′(x2)|+

fx (x3)

g′(x3)



Function of a R.V. Distribution General Result

Set y = g(x) and let x1, x2, . . . be the roots, i.e.,

y = g(x1) = g(x2) = . . .

Then

fy (y) =fx (x1)

|g′(x1)| +fx (x2)

|g′(x2)| + . . .

Example

Suppose x ∼ U(−1, 2) and y = x 2. Determine fy (y).

31

-1 2

fx(x)

1 -1 1

y

0 2

2

1



Note thatg(x) = x2 ⇒ g′(x) = 2x

Consider special cases separately:Case 1: 0 ≤ y ≤ 1

y = x2 ⇒ x = ±√y

fy (y) =fx (x1)

|g′(x1)| +fx (x2)

|g′(x2)|=

1/3|2√y | +

1/3| − 2

√y | =

1/3√y

Case 2: 1 ≤ y ≤ 4y = x2 ⇒ x =

√y

fy (y) =fx (x1)

|g′(x1)| =1/32√

y=

1/6√y



Result: For x ∼ U(−1, 2) and y = x 2

31

-1 2

fx(x)

1 -1 1

y

0 2

2

1

fy (y) =

{ 1/3√y 0 ≤ y ≤ 1

1/6√y 1 < y ≤ 4

0

0.5

1

0 1 2 3 4 5

fy(y)

31

y



Example

Let x ∼ N(μ, σ) and y = ex . Determine fy (y).

Note g(x) ≥ 0 and g ′(x) = ex

Also, there is a single root (inverse solution):

x = ln(y)

Therefore,

fy (y) =fx (x)

|g′(x)| =fx (x)

ex

Expressing this in terms of y through substitution yields:

fy (y) =fx(ln(y))

eln(y)=

fx (ln(y))

y



Note that x is Gaussian:

fx (x) =1√2πσ

e− (x−μ)2

2σ2

⇒ fy (y) =1√

2πyσe− (ln(y)−μ)2

2σ2 , for y > 0

0

0.5

1

0 1 2 3 4 5

fy(y)

y

Log normal density



Distribution of Fx (x)

For any RV with continuous distribution Fx(x), the RV y = Fx (x) isuniform on [0, 1].

Proof: Note 0 < y < 1. Since

g(x) = Fx(x)

g′(x) = fx (x)

Thus

fy (y) =fx (x)

g′(x)=

fx (x)

fx (x)= 1

0 1

fy(y)

1

0 1

1

Fy(y)

y



Thus the function

g(x) = Fx(x)

performs the mapping:

The converse also holds:

Combining operationsyields Synthesis:

)(xFxx U – uniform [0,1]

pdf fx(x)

)(1 xFx−U x – pdf fx(x)

Uniform pdf

)(xFxx )(1 xFx− xU

)(1 xFy− y


Probability Mean and Variance

Mean and variance

Definitions

Mean E{x} =

∫ ∞

−∞xf (x)dx

Conditional Mean E{x |M} =

∫ ∞

−∞xf (x |M)dx

Example

Suppose M = {x ≥ a}. Then

E{x |M} =

∫ ∞

−∞xf (x |M)dx

=

∫∞a xf (x)dx∫∞a f (x)dx



For a function of a RV, y = g(x),

E{y} =

∫ ∞

−∞yfy (y)dy =

∫ ∞

−∞g(x)fx (x)dx

Example

Suppose g(x) is a step function:

0 x0

g(x)

1

Determine E{g(x)}.

E{g(x)} =

∫ ∞

−∞g(x)fx (x)dx =

∫ x0

−∞fx (x)dx = Fx(x0)



Definition (Variance)

Variance σ2 =

∫ ∞

−∞(x − η)2f (x)dx

where η = E{x}. Thus,

σ2 = E{(x − η)2} = E{x2} − E2{x}

Example

For x ∼ N(η, σ2), determine the variance.

f (x) =1√2πσ

e− (x−η)2

2σ2

Note: f (x) is symmetric about x = η ⇒ E{x} = η

Also ∫ ∞

−∞f (x)dx = 1 ⇒

∫ ∞

−∞e− (x−η)2

2σ2 dx =√

2πσ



∫ ∞

−∞e− (x−η)2

2σ2 dx =√

2πσ

Differentiating w.r.t. σ:

⇒∫ ∞

−∞

(x − η)2

σ3 e− (x−η)2

2σ2 dx =√

2π

Rearranging yields∫ ∞

−∞(x − η)2 1√

2πσe− (x−η)2

2σ2 dx = σ2

orE{(x − η)2} = σ2


Probability Moments

Definition (Moments)

Moments

mn = E{xn} =

∫ ∞

−∞xnf (x)dx

Central Moments

μn = E{(x − η)n} =

∫ ∞

−∞(x − η)nf (x)dx

From the binomial theorem

μn = E{(x − η)n} = E

{n∑

k=0

(nk

)xk (−η)n−k

}

=n∑

k=0

(nk

)mk (−η)n−k

⇒ μ0 = 1, μ1 = 0, μ2 = σ2, μ3 = m3 − 3ηm2 + 2η3


Probability Moments

Example

Let x ∼ N(0, σ2). Prove

E{xn} =

{0 n = 2k + 11 · 3 · · · · (n − 1)σn n = 2k

For n odd

E{xn} =

∫ ∞

−∞xnf (x) = 0

since xn is an odd function and f (x) is an even function.

To prove the second part, use the fact that∫ ∞

−∞e−αx2

dx =

√π

α


Probability Moments

Differentiate ∫ ∞

−∞e−αx2

dx =

√π

α

with respect to α, k times

⇒∫ ∞

−∞x2ke−αx2

dx =1 · 3 · · · · (2k − 1)

2k

√π

α2k+1

Let α = 12σ2 , then∫ ∞

−∞x2ke− x2

2σ2 dx = 1 · 3 · · · · (2k − 1)σ2k+1√

2π

Setting n = 2k and rearranging∫ ∞

−∞xn 1√

2πσe− x2

2σ2 dx = 1 · 3 · · · · (n − 1)σn [QED]

Note: Variance is a measure of a RV’s concentration around its meanK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 46 / 406

Probability Tchebycheff Inequality

Tchebycheff Inequality

For any ε > 0,

Pr(|x − η| ≥ ε) ≤ σ2

ε2

To prove this, note

Pr(|x − η| ≥ ε) =

∫ η−ε

−∞f (x)dx +

∫ ∞

η+εf (x)dx

=

∫|x−η|≥ε

f (x)dx

Also note that

σ2 =

∫ ∞

−∞(x − η)2f (x)dx

≥∫

|x−η|≥ε

(x − η)2f (x)dx

f(x)

η η+εη-ε


Probability Tchebycheff Inequality

σ2 ≥∫

|x−η|≥ε

(x − η)2f (x)dx

Using the fact that |x − η| ≥ ε in the above gives

σ2 ≥ ε2∫

|x−η|≥ε

f (x)dx

= ε2Pr{|x − η| ≥ ε}

Rearranging gives the desired result

⇒ Pr{|x − η| ≥ ε} ≤(σ

ε

)2

QED


Probability Characteristic & Moment Generating Functions

Definition (Characteristic Function)

The characteristic function of a random variable x with pdf fx (x) isdefined by

φx (ω) = E(

ejωx)

=

∫ ∞

−∞ejωx fx (x)dx

If fx(x) is symmetric about 0 (fx (x) = fx (−x)), then φx (x) is realThe magnitude of the characteristic function is bound by

|φx (ω)| ≤ φx (0) = 1

Theorem (Characteristic Function for the sum of independent RVs)

Let x1, x2, . . . , xN be independent (but not necessarily identicallydistributed) RVs and set sN =

∑ni=1 aixi where ai are constants. Then

φsN (ω) =N∏

i=1

φxi (aiω)



The theorem can be proved by a simple extension of the following: Letx and y be independent. Then

φx+y (ω) = E(

ejω(x+y))

= E(

ejωxejωy)

= E(

ejωx)

E(

ejωy)

= φx (ω)φy (ω)

Example

Determine the characteristic function of the sample mean operating oniid samples.

Note x = 1N

∑Ni=1 xi ⇒ ai = 1

N

⇒ φx (ω) =N∏

i=1

φxi (aiω) =(φxi

(ω

N

))N



The Moment Generating function is realized by making the substitutionjω → s in the above

Definition (Moment Generating Function)

The moment generating function of a random variable x with pdf f x (x)is defined by

Φx(s) = E (esx) =

∫ ∞

−∞esx fx (x)dx

Note Φx(jω) = φx (ω)

Theorem (Moment Generation)

Provided that Φx (s) exists in an open interval around s = 0, thefollowing hold

mn = E (xn) = Φ(n)x (0) =

dnΦx

dsn (0)

Simply noting that Φ(n)x (s) = E (xnesx) proves the result



Example

Let x be exponentially distributed,

f (x) = λe−λxU(x)

Determine η = m1, m2, and σ2

Note

Φx(s) = λ

∫ ∞

0esxe−λxdx = λ

∫ ∞

0e−x(λ−s)dx

=λ

λ − sThus

Φ(1)x (0) =

1λ

and Φ(2)x (0) =

2λ2

and

E{x} =1λ

, E{

x2}

=2λ2 ⇒ σ2 =

1λ2


Probability Bivariate Statistics

Bivariate Statistics

Given two RVs, x and y , the bivariate (joint) distribution is given by

F (x0, y0) = Pr{x ≤ x0, y ≤ y0}

y

x

y0

x0

Properties:

F (−∞, y) = F (x ,−∞) = 0

F (∞,∞) = 1

Fx(x) = F (x ,∞), Fy (y) = F (∞, y)



Special Cases

Case 1: M = {x1 ≤ x ≤ x2, y ≤ y0} x

y

y0

x1 x2

⇒ Pr{M} = F (x2, y0) − F (x1, y0)

Case 2: M = {x ≤ x0, y1 ≤ y ≤ y2}

y

x0

y1

y2

x

⇒ Pr{M} = F (x0, y2) − F (x0, y1)



Case 3: M = {x1 ≤ x ≤ x2, y1 ≤ y ≤ y2} Theny

x1

y1

y2

xx2

andPr{M} = F (x2, y2) − F (x1, y2) − F (x2, y1) + F (x1, y1)︸︷︷︸

↓

Added back because this region was subtracted twice


Probability Joint Statistics

Definition (Joint Statistics)

f (x , y) =∂2F (x , y)

∂x∂y

and

F (x , y) =

∫ x

−∞

∫ y

−∞f (α, β)dαdβ

In general, for some region M, the joint statistics are

Pr{(x , y) ∈ M} =

∫ ∫M

f (x , y)dxdy

Marginal Statistics: Fx(x) = F (x ,∞) and Fy (y) = F (∞, y)

⇒ fx (x) =

∫ ∞

−∞f (x , y)dy

⇒ fy (y) =

∫ ∞

−∞f (x , y)dx


Probability Independence

Independence

Definition (Independence)

Two RVs x and y are statistically independent if for arbitrary events(regions) x ∈ A and y ∈ B,

Pr{x ∈ A, y ∈ B} = Pr{x ∈ A}Pr{y ∈ B}

Letting A = {x ≤ x0} and B = {y ≤ y0}, we see x and y areindependent iff

Fx ,y(x , y) = Fx(x)Fy (y)

and by differentiation

fx ,y(x , y) = fx (x)fy (y)



If x and y are independent RVs, then

z = q(x) and w = h(y)

are also independent.

Function of two RVs

Given two RVs, let z = g(x , y). Define Dz to be the xy plane region

{z ≤ z0} = {g(x , y) ≤ z0} = {(x , y) ∈ Dz}

Then

Fz(z0) = Pr{z ≤ z0}= Pr{(x , y) ∈ Dz}=

∫ ∫Dz

f (x , y)dxdy



Example

Let z = x + y . Then, z ≤ z0 gives the region x + y ≤ z0 which isdelineated by the line x + y = z0

y

z0

x

z0

Thus

Fz(z0) =

∫ ∫Dz

f (x , y)dxdy

=

∫ ∞

−∞

∫ z0−y

−∞f (x , y)dxdy



We can obtain fz(z) by differentiation

∂Fz(z)

∂z=

∫ ∞

−∞

∂

∂z

∫ z−y

−∞f (x , y)dxdy

fz(z) =

∫ ∞

−∞f (z − y , y)dy (∗)

Note that if x and y are independent,

f (x , y) = fx (x)fy (y) (∗∗)

Thus utilizing (∗∗) in (∗)

fz(z) =

∫ ∞

−∞fx (z − y)fy (y)dy︸︷︷︸

Convolution

= fx (z) ∗ fy (z)



Example

Let z = x + y where x and y are independent with

fx(x) = αe−αxU(x)

fy (y) = αe−αyU(y)

Then

fz(z) =

∫ ∞

−∞fx (z − y)fy (y)dy

= α2∫ z

0e−α(z−y)e−αydy

= α2e−αz∫ z

0dy

= α2ze−αzU(z)



Example

Let z = max(x , y). Determine Fz(z0) and fz(z0).

NoteFz(z0) = Pr{z ≤ z0}

= Pr{max(x , y) ≤ z0}= Fxy(z0, z0)

y

x

z0

z0



If x and y are independent,

Fz(z0) = Fx(z0)Fy (z0)

and

fz(z0) =∂Fz(z0)

∂z0

=∂Fx (z0)

∂z0Fy (z0) +

∂Fy (z0)

∂z0Fx(z0)

= fx (z0)Fy (z0) + fy (z0)Fx(z0)


Probability Joint Moments

Joint Moments

For RVs x and y and function z = g(x , y)

E{z} =

∫ ∞

−∞zfz(z)dz

E{g(x , y)} =

∫ ∞

−∞

∫ ∞

−∞g(x , y)f (x , y)dxdy

Definition (Covariance)

For RVs x and y ,

Cxy = Cov(x , y)= E [(x − ηx )(y − ηy )]= E [xy ] − ηxE [y ] − ηyE [x ] + ηxηy

= E [xy ] − ηxηy



Definition (Correlation Coefficient)

The correlation coefficient is given by

r =Cxy

σxσy

Note that

0 ≤ E{[a(x − ηx ) + (y − ηy )]2}= E{(x − ηx )2}a2 + 2E{(x − ηx )(y − ηy )}a + E{(y − ηy )2}= σ2

x a2 + 2Cxya + σ2y

This is a positive quadratic function of a⇒ Roots are imaginary and discriminant is non-positive√

4C2xy − 4σ2

xσ2y → imaginary

⇒ 4C2xy − 4σ2

xσ2y ≤ 0

⇒ C2xy ≤ σ2

xσ2y



Thus,

|Cxy | ≤ σxσy and |r | =|Cxy |σxσy

≤ 1

Definition (Uncorrelated)

Two RVs are uncorrelated if their covariance is zero

Cxy = 0

⇒ r =Cxy

σxσy= 0

=E{xy} − E{x}E{y}

σxσy= 0

⇒ E{xy} = E{x}E{y}Thus

Cxy = 0 ⇔ E{xy} = E{x}E{y}



Result

If x and y are independent, then

E{xy} = E{x}E{y}

and x and y are uncorrelated

Note: Converse is not true (in general)

Converse only holds for Gaussian RVs

Independence is a stronger condition than uncorrelated

Definition (Orthogonality)

Two RVs are orthogonal if

E{xy} = 0

Note: If x and y are correlated, they are not orthogonal



Example

Consider the correlation between two RV’s, x and y , with samplesshown in a scatter plot


Probability Sequences and Vectors of Random Variables

Sequences and Vectors of Random Variables

Definition (Vector Distribution)

Let {x} be a sequence of RVs. Take N samples to form the randomvector

x = [x1, x2, . . . , xN ]T

Then the vector distribution function is

Fx(x0) = Pr{x1 ≤ x01 , x2 ≤ x0

2 , . . . , xN ≤ x0N}

�= Pr{x ≤ x0}

Special Case: For complex data

x = xr + jxi



⎡⎢⎢⎢⎣

x1

x2...xN

⎤⎥⎥⎥⎦ =

⎡⎢⎢⎢⎣

xr1

xr2...xrN

⎤⎥⎥⎥⎦+ j

⎡⎢⎢⎢⎣

xi1

xi2...xiN

⎤⎥⎥⎥⎦

The distribution in the complex case is defined as

Fx(x0) = Pr{xr ≤ x0r , xi ≤ x0

i }�= Pr{x ≤ x0}

The density function is given by

fx(x) =∂NFx(x)

∂x1∂x2 . . . ∂xN

Fx(x0) =

∫ x01

−∞

∫ x02

−∞. . .

∫ x0N

−∞fx(x)dx1dx2 . . . dxN



Properties:

Fx([∞,∞, · · · ,∞]T ) = 1∫ ∞

−∞

∫ ∞

−∞· · ·∫ ∞

−∞fx(x)dx = 1

Fx([x1, x2, · · · ,−∞, · · · , xN ]T ) = 0

Also

F ([∞, x2, x3, · · · , xN ]T ) = F ([x2, x3, · · · , xN ]T )∫ ∞

−∞f ([x1, x2, x3, · · · , xN ]T )dx1 = f ([x2, x3, · · · , xN ]T )

Setting xi = ∞ in the cdf eliminates this sample

Integrating over (−∞,∞) along xi in the pdf eliminates this sample



Joint Distribution

Definitions (Joint Distribution and Density)

Given two random vectors x and y, the joint distribution and density are

Fxy(x0, y0) = Pr{x ≤ x0, y ≤ y0}

fxy(x, y) =∂N∂MFxy(x, y)

∂x1∂x2 · · · ∂xN∂y1∂y2 · · · ∂yM

Definition (Vector Independence)

The vectors are independent iff

Fxy(x, y) = Fx(x)Fy(y)

or equivalentlyfxy(x, y) = fx(x)fy(y)



Expectations & Moments

Objective: Obtain partial description of process generating x

Solution: Use moments

The first moment, or mean, is

mx = E{x} = [m1, m2, . . . , mN ]T

=

∫ ∞

−∞

∫ ∞

−∞· · ·∫ ∞

−∞xfx(x)dx

⇒ mk =

∫ ∞

−∞

∫ ∞

−∞· · ·∫ ∞

−∞xk fx(x)dx1dx2 · · · dxN

=

∫ ∞

−∞xk fxk (xk )︸︷︷︸dxk

⇑ marginal distribution of xk



Definition (Correlation Matrix)

A complete set of second moments is given by the correlation matrix

Rx = E{xxH} = E{xx∗T }

=

⎡⎢⎢⎢⎣

E{|x1|2} E{x1x∗2} · · · E{x1x∗

N}E{x2x∗

1} E{|x2|2} · · · E{x2x∗N}

......

. . ....

E{xNx∗1} E{xNx∗

2} · · · E{|xN |2}

⎤⎥⎥⎥⎦

Result

The correlation matrix is Hermitian symmetric

(Rx)H = (E{xxH})H

= E{(xxH)H}= E{xxH} = Rx



Definition (Covariance Matrix)

The set of second central moments is given by the covariance

Cx = E{(x − mx)(x − mx)H}

= E{xxH} − mxE{xH} − E{x}mxH + mxmx

H

= Rx − mxmxH

ResultThe covariance is Hermitian symmetric

Cx = CxH



Result

The correlation and covariance matrices are positive semi-definite

aHRxa ≥ 0 aHCxa ≥ 0 (∀a)

To prove this, note

aHRxa = aHE{xxH}a

= E{aHxxHa}= E{(aHx)(aHx)H}= E{|aHx|2} ≥ 0

For most cases, R and C are positive define

aHRxa > 0 aHCxa > 0

⇒ no linear dependencies in Rx or Cx



Definitions (Cross-Correlation and Cross-Covariance)

For random vectors x and y,

Cross-correlation�= Rxy = E{xyH}

Cross-covariance�= Cxy = E{(x − mx)(y − my)

H}= Rxy − mxmy

H

Definition (Uncorrelated Vectors)

Two vectors x and y are uncorrelated if

Cxy = Rxy − mxmyH = 0

or equivalentlyRxy = E{xyH} = mxmy

H



Note that as in the scalar case

independence ⇒ uncorrelated

uncorrelated � independence

Also, x and y are orthogonal if

Rxy = E{xyH} = 0

Example

Let x and y be the same dimension. If

z = x + y

find Rz and Cz



By definition

Rz = E{(x + y)(x + y)H}= E{xxH} + E{xyH} + E{yxH} + E{yyH}= Rx + Rxy + Ryx + Ry

SimilarlyCz = Cx + Cxy + Cyx + Cy

Note: If x and y are uncorrelated,

Rz = Rx + mxmyH + mymx

H + Ry

andCz = Cx + Cy



Definition (Multivariate Gaussian Density)

For a N dimensional random vector x with covariance Cx, themultivariate Gaussian pdf is

fx(x) =1

(2π)N2 |Cx| 1

2

e− 12 (x−mx)HCx

−1(x−mx)

Note the similarity to the univariate case

fx (x) =1√2πσ

e− 12

(x−m)2

σ2

Example

Let N = 2 (bivariate case) and x be real. Then

x =

[x1

x2

]and mx = E{x} =

[m1

m2

]K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 80 / 406


Cx = E{(x − mx)(x − mx)T }

= E{xxT} − mxmxT

= E{[

x21 x1x2

x2x1 x22

]}−[

m21 m1m2

m2m1 m22

]

=

[E{x2

1} − m21 E{x1x2} − m1m2

E{x2x1} − m2m1 E{x22} − m2

2

]

Recall thatσ2

x = E{x2} − E2{x}and

r =E{x1x2} − m1m2

σx1σx2



Rearranging: Cx =

[σ2

x1rσx1σx2

rσx1σx2 σ2x2

]Also,

Cx−1 =

1σ2

x1σ2

x2− r2σ2

x1σ2

x2

[σ2

x2−rσx1σx2

−rσx1σx2 σ2x1

]

=1

σ2x1

σ2x2

(1 − r2)

[σ2

x2−rσx1σx2

−rσx1σx2 σ2x1

]

Substituting into the Gaussian pdf and simplifying

fx(x) =1

2π|Cx| 12

e− 12 (x−mx)T Cx

−1(x−mx)

=1

2πσx1σx2(1 − r2)12

e− 1

2(1−r 2)

[(x1−m1)2

σ2x1

−2r(x1−m1)(x2−m2)

σx1 σx2+

(x2−m2)2

σ2x2

]



Note: If uncorrelated, r = 0

⇒ fx(x) = 12πσx1σx2

e− 1

2 [(x1−m1)2

σ2x1

+(x2−m2)2

σ2x2

]

= fx1(x1)fx2(x2)

Gaussian special case result:

uncorrelated ⇒ independent

Example

Examine the contours defined by

(x − mx)T Cx

−1(x − mx) = constant

Why? For all values on the contour

fx(x) = constant



r = 0 σx1 = σx2

x2

x1m1

m2

r = 0 σx1 > σx2

x2

x1m1

m2

r > 0 σx1 > σx2

x2

x1m1

m2

r > 0 σx1 < σx2

x2

x1m1

m2



r < 0 and σx1 > σx2

x2

x1m1

m2

x 2

)( 22xf x

x1

)( 11xf x

Integrating over x2 yields fx1(x1)

Integrating over x1 yields fx2(x2)



Additional Gaussian (surface) examples:

r = 0 σx1 = σx2 r = 0 σx1 < σx2

r > 0 σx1 < σx2 r < 0 σx1 < σx2



Transformations of a vector

Let the N functions g1(·), g2(·), . . . , gN(·) map x to z, where

z1 = g1(x1, x2, . . . , xN)z2 = g2(x)...

zN = gN(x)

Forward mapping

Let g1(·), g2(·), . . . , gN(·) be independent and yield a one-to-onetransformation such that ∃ a set of functions

x1 = h1(z), x2 = h2(z), . . . , xN = hN(z)

where z = [z1, z2, . . . , zN ]T .Reverse mapping

Question: How do we determine the distribution fz(z)?



Let N = 2 and consider the probability of being in the region defined by

[z1, z1 + dz1] and [z2, z2 + dz2]

z1

z2

z2+d z2

z2

z1 z1+d z1

Az

),Pr{(}), xxAzzx1

x2z2+d z2

z2z1

z1+d z1

Ax

Identify an equivalent area in the x1, x2 domain and equate theprobabilities

Pr{(z1, z2) ∈ Az} = Pr{(x1, x2) ∈ Ax}fz1z2(z1, z2)Area(Az) = fx1x2(x1, x2)Area(Ax )



Area(Ax)

Area(Az)= abs

(J(

x1 x2

z1 z2

))

=1

abs

(J(

z1 z2

x1 x2

))The Jacobian is defined as

J(

x1 x2

z1 z2

)=

∣∣∣∣∣∂x1∂z1

∂x1∂z2

∂x2∂z1

∂x2∂z2

∣∣∣∣∣and

J(

z1 z2

x1 x2

)=

∣∣∣∣∣∂z1∂x1

∂z1∂x2

∂z2∂x1

∂z2∂x2

∣∣∣∣∣Note that

∂x1

∂z1=

∂h1(z)

∂z1and

∂z1

∂x1=

∂g1(x)

∂x1



Thusfz1z2(z1, z2)Area(Az) = fx1x2(x1, x2)Area(Ax )

⇒ fz1z2(z1, z2) =fx1x2(x1, x2)

Area(Az)/Area(Ax)=

fx1x2(x1, x2)

abs

(J(

z1 z2

x1 x2

))

General Case Result (Functions of Vectors)

fz(z) =fx(x)

abs

(J(

zx

))where

J(

zx

)=

∣∣∣∣∣∣∣∂z1∂x1

· · · ∂z1∂xN

.... . .

...∂zN∂x1

· · · ∂zN∂xN

∣∣∣∣∣∣∣K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 90 / 406


Example (Linear Transformation)

Let z and x be linearly related

z1 = a11x1 + a12x2

z2 = a21x1 + a22x2

or z = Ax and x = A−1z

Then J(

z1 z2

x1 x2

)=

∣∣∣∣∣∂z1∂x1

∂z1∂x2

∂z2∂x1

∂z2∂x2

∣∣∣∣∣=

∣∣∣∣ a11 a12

a21 a22

∣∣∣∣ = |A|



Let A−1 =

[b11 b12

b21 b22

]. Then

[x1

x2

]=

[b11 b12

b21 b22

] [z1

z2

]

and

fz1z2(z1, z2) =fx1x2(x1, x2)

abs|A|=

fx1x2(b11z1 + b12z2, b21z1 + b22z2)

abs|A|

General Case Result (Linear Transformations)

For case where z = Ax

fz(z) =1

abs|A| fx(A−1z)



Vector Statistics for Linear Transformations

For such linear transformations z = Ax

E{z} = E{Ax} = Amx

SimilarlyE{zzH} = E{Ax(Ax)H}

= E{AxxHAH}

⇒ Rz = ARxAH

By similar arguments it is easy to show

Cz = ACxAH

Note: Results in simple linear transformations of statistics


Stationary Process and Models Stationary Process

Stationary Process and Models

Statistic process describes the time evolution of statisticalphenomena

A stochastic process is not a single function of time but an infinitenumber of possible realizations

A single realization is called a time series

A full joint distribution function of an arbitrary stochastic process isdifficult to obtain or estimate

Settle for a partial characterization



Consider a discrete-time stochastic process

x(n), x(n − 1), . . . , x(n − M)

which may be complex.

Definitions (Mean, Auto-Correlation, and Auto-Covariance)

The mean process is given by

μ(n) = E{x(n)}

The auto–correlation is defined as

r(n, n − k) = E{x(n)x∗(n − k)}

The auto–covariance is given by

c(n, n − k) = E{[x(n) − μ(n)][x(n − k) − μ(n − k)]∗}= r(n, n − k) − μ(n)μ∗(n − k)



Definition (Wide-Sense Stationary)

A discrete-time stochastic process is wide–sense stationary (WSS) if

μ(n) = μ for all n

r(n, n − k) = r(k) and

c(n, n − k) = c(k) k = 0,±1,±2, . . .

Let x(n) = [x(n), x(n − 1), . . . , x(n − M + 1)]T be a M × 1 observationvector. Then for {x(n)} WSS, the correlation matrix is

R = E{x(n)xH(n)} =

⎡⎢⎢⎢⎣

r(0) r(1) · · · r(M − 1)r(−1) r(0) · · · r(M − 2)

......

. . ....

r(−M + 1) r(−M + 2) · · · r(0)

⎤⎥⎥⎥⎦



Properties of the correlation matrices

For a stationary discrete time process: RH = R (Hermetian)⎡⎢⎢⎢⎣

r(0) r(1) · · · r(M − 1)r(−1) r(0) · · · r(M − 2)

......

. . ....

r(−M + 1) r(−M + 2) · · · r(0)

⎤⎥⎥⎥⎦ =

⎡⎢⎢⎢⎣

r(0) r∗(−1) · · · r∗(−M + 1)r∗(1) r(0) · · · r∗(−M + 2)

......

. . ....

r∗(M − 1) r∗(M − 2) · · · r(0)

⎤⎥⎥⎥⎦

Consequence: ⇒ r(−k) = r ∗(k)



The correlation matrix is Toeplitz

R =

⎡⎢⎢⎢⎢⎢⎣

r(0) r(1) r(2) · · · r(M − 1)r∗(1) r(0) r(1) · · · r(M − 2)r∗(2) r∗(1) r(0) · · · r(M − 3)

......

......

...r∗(M − 1) r∗(M − 2) r∗(M − 3) · · · r(0)

⎤⎥⎥⎥⎥⎥⎦

For any non-zero vector a

aRaH ≥ 0 (positive semi-definite)

and usuallyaRaH > 0 (positive definite)

Result: R is positive definite if the samples in x are not linearlydependent. In this case R−1 exists.

Historical Note: Diagonal–constant matrices are named after themathematician Otto Toeplitz (1881–1940)


Stationary Process and Models Stochastic Models

Stochastic Models

A model is used to describe the hidden laws governing thegeneration of physical data observed

We assume that x(n), x(n − 1), · · · have statistical dependenciesthat can be modeled as

Discrete timelinear fitler

v(n) x(n)

processwhere v(n) is a purely random processLinear model types:

1 Auto Regressive – no past model input samples used2 Moving Average – no past model output samples used3 Auto Regressive Moving Average – both past input and output used


Stationary Process and Models Stochastic Models

General Stochastic Model:(Modeloutput

)+

(Linear combination

of past outputs

)︸︷︷︸

AR part

=

(Linear combination ofpresent & past inputs

)︸︷︷︸

MA part

Three model possibilities:1 AR – auto regressive2 MA – moving average3 ARMA – mixed AR and MA

Model Input: assumed to be an i.i.d. zero mean Gaussian process:

E{v(n)} = 0 for all n

E{v(n)v∗(k)} =

{σ2

v k = n0 otherwise


Stationary Process and Models Auto-Regressive Models

Auto-Regressive Models

Definition (Auto-Regressive)

The time series {x(n)} is said to be generated by an AR model if

x(n) + a∗1x(n − 1) + · · · + a∗

Mx(n − M) = v(n)

orx(n) = w∗

1 x(n − 1) + · · · + w∗Mx(n − M) + v(n)

where wk = −ak .

This is an order M model and v(n) is referred to as the noise termNote that we can set a0 = 1 and write

M∑k=0

a∗kx(n − k) = v(n)

which is a convolution sumK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 102 / 406


Thus taking Z-transforms

Z{a∗n} = A(z) =

M∑n=0

a∗nz−n

Z{x(n)} = X (z) =∞∑

n=0

x(n)z−n

Z{v(n)} = V (z) =∞∑

n=0

v(n)z−n

andM∑

k=0

a∗kx(n − k) = v(n) ⇒ A(z)X (z) = V (z)

If we regard v(n) as the output,then HA(z)x(n) v(n)

where HA(z) = V (z)X(z) = A(z)



[Notation note: figure uses u(n) as input, i.e., u(n) = v(n)]

This is called the process analyzerAnalyzer is an all zero system

Impulse response is finite (FIR)System is BIBO stable



If we view v(n) as the input, then wehave the process generator

HG(z)v(n) x(n)

HG(z) =X (z)

V (z)=

1A(z)

The process generator is an allpole system

Impulse response is infinite (IIR)System stability is an issue



Note

HG(z) =1

A(z)=

1∑Mn=0 a∗

nz−n

Factor the denominator and represent HG(z) in terms of its poles

HG(z) =1

(1 − p1z−1)(1 − p2z−1) · · · (1 − pMz−1)

p1, p2, . . . , pM are the poles of HG(z) defined as the roots of thecharacteristic equation

1 + a∗1z−1 + a∗

2z−2 + · · · + a∗Mz−M = 0

HG(z) is all pole (IIR) and BIBO stable only if all poles are in theunit circle, i.e.,

|pn| < 1 n = 1, 2, · · · , M


Stationary Process and Models Moving Average Model

Moving Average Model

Definition (Moving Average)

The time series {x(n)} is said to be generated by a Moving Average(MA) model if

x(n) = v(n) + b∗1v(n − 1) + · · · + b∗

K v(n − K )

where b1, b2, · · · , bk are the parameters of the order K MA model

v(n) is zero mean white Gaussian noiseThe process generation model is all zero (FIR)


Stationary Process and Models Auto-Regressive Moving Average Model

Auto-Regressive Moving Average Model

Definition (Auto-Regressive MovingAverage)

In this case, {x(n)} is a mixed processwhere the output is a function of pastoutputs and current/past inputs

x(n) + a∗1x(n − 1) + · · · + a∗

M(n − M)

= v(n) + b∗1v(n − 1) + · · · + b∗

K v(n − K )

The order is (M, K ).

v(n) is zero mean white Gaussiannoise

The process model has zeros andpoles (IIR)


Stationary Process and Models Analysis of Stochastic Models

Wold Decomposition (after Herman Wold (1908–92))

Any WSS discrete time stochastic process y(n) can be expressed as

y(n) = x(n) + s(n)

where:x(n) and s(n) are uncorrelatedx(n) can be expressed by the MA model

x(n) =∞∑

k=0

b∗kv(n − k)

b0 = 1 and∑∞

k=0 |bk | < ∞v(n) is white noise uncorrelated with s(n)

s(n) is perfectly predictableNote: If B(z) is minimum phase, then it can be represented by an allpole (AR) system.

AR models are widely used because they are tractableK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 109 / 406


Asymptotic statistics of AR processes

Recall that {x(n)} is generated by

x(n) + a∗1x(n − 1) + a∗

2x(n − 2) + · · · + a∗Mx(n − M) = v(n)

or

x(n) = w∗1 x(n − 1) + w∗

2 x(n − 2) + · · · + w∗Mx(n − M) + v(n)

Linear constant coefficient difference equation of order M drivenby v(n).

Z-transform representation:

X (z) =V (z)

1 +∑M

k=1 a∗kz−k



Inverse transforming X (z) = V (z)

1+∑M

k=1 a∗k z−k

yields

x(n) = xc(n)︸︷︷︸Homogeneous Solution

+ xp(n)︸︷︷︸Particular Solution

The particular solution is the result of driving HG(z) with v(n)

xp(n) = HG(z)v(n),

where z−1 is taken as the delay operator.

The particular solution has stationary statistics



The homogeneous solution is of the form

xc(n) = B1pn1 + B2pn

2 + · · · + BMpnM

where p1, p2, · · · , pM are the roots of

1 + a∗1z−1 + a∗

2z−2 + · · · + a∗Mz−M = 0

The B values depend on the initial conditions

The homogeneous solution is not stationary

The process is asymptotically stationary if |pn| < 1



Correlation of a stationary AR process

Recall that an AR process can be written as

M∑k=0

a∗kx(n − k) = v(n)

where a0 = 1.Multiply both sides by x∗(n − l) and take E{ }.

E

{M∑

k=0

a∗kx(n − k)x∗(n − l)

}= E{v(n)x∗(n − l)}

Note that

E{x(n − k)x∗(n − l)} = r(l − k)

E{v(n)x∗(n − l)} = 0 for l > 0



Thus

E

{M∑

k=0

a∗kx(n − k)x∗(n − l)

}= E{v(n)x∗(n − l)}

⇒M∑

k=0

a∗k r(l − k) = 0 for l > 0

Accordingly, the auto-correlation of the AR process satisfies

r(l) = w∗1 r(l − 1) + w∗

2 r(l − 2) + · · · + w∗Mr(l − M)

where wk = −ak . Note that this also has the solution

r(m) =M∑

k=1

ckpmk

where pk is the k th root of

1 − w∗1 z−1 − w∗

2 z−2 − · · · − w∗Mz−M = 0

Why? Diff. equation (no driving function; homogeneous solution only)K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 114 / 406


Recall that the AR characteristic equation is

1 + a∗1z−1 + a∗

2z−2 + · · · + a∗Mz−M = 0

This is identical to the auto-correlation characteristic equation

1 − w∗1 z−1 − w∗

2 z−2 − · · · − w∗Mz−M = 0

⇒ the roots are equalResult: A stable AR process ⇒ |pk | < 1 and

limm→∞ r(m) = lim

m→∞

M∑k=1

ckpmk = 0

(asymptotically uncorrelated)


Stationary Process and Models Yule-Walker Equations

Yule-Walker Equations

An AR model of order M is completely specified by

AR coefficients: a1, a2, . . . , aM

Variance of v(n): σ2v

Proposition: These parameters can be determined by theauto-correlation values: r(0), r(1), . . . , r(M).

Recall

r(l) = w∗1 r(l − 1) + w∗

2 r(l − 2) + · · · + w∗Mr(l − M)

Case 1: Let l = 1

r(1) = w∗1 r(0) + w∗

2 r(−1) + · · · + w∗Mr(1 − M)

Using the fact r(−k) = r∗(k)

r(1) = w∗1 r(0) + w∗

2 r∗(1) + · · · + w∗Mr∗(M − 1)



Taking the complex conjugate

r∗(1) = w1r(0) + w2r(1) + · · · + wMr(M − 1)

= wT [r(0), r(1), · · · , r(M − 1)]T

where wT = [w1, w2, · · · , wM ]

Case 2: Now let l = 2

r(2) = w∗1 r(1) + w∗

2 r(0) + w∗3 r(−1) + · · · + w∗

Mr(2 − M)

⇒ r∗(2) = w1r∗(1) + w2r(0) + w3r(1) · · · + wMr(M − 2)

= wT [r∗(1), r(0), r(1), · · · , r(M − 2)]T

Case 3: Similarly, for l = 3

r(3) = w∗1 r(2) + w∗

2 r(1) + w∗3 r(0) + w∗

4 r(−1) · · · + w∗Mr(3 − M)

⇒ r∗(3) = w1r∗(2) + w2r∗(1) + w3r(0) + w4r(1) · · · + wMr(M − 3)

= wT [r∗(2), r∗(1), r(0), r(1), · · · , r(M − 3)]T



Repeating the process & combining results in matrix form⎡⎢⎢⎢⎢⎢⎣

r(0) r(1) · · · r(M − 1)r∗(1) r(0) · · · r(M − 2)r∗(2) r∗(1) · · · r(M − 3)

......

. . ....

r∗(M − 1) r∗(M − 2) · · · r(0)

⎤⎥⎥⎥⎥⎥⎦

⎡⎢⎢⎢⎢⎢⎣

w1

w2

w3...

wM

⎤⎥⎥⎥⎥⎥⎦ =

⎡⎢⎢⎢⎢⎢⎣

r∗(1)r∗(2)r∗(3)...r∗(M)

⎤⎥⎥⎥⎥⎥⎦

or more compactly

Rw = r where r = [r(1), r(2), r(3), . . . , r(M)]H

Result: Given the auto-correlation values, the AR coefficients we canbe uniquely determine

w = R−1r

whereak = −wk k = 1, 2, · · · , M

Assumption: R is nonsingularK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 118 / 406


Still to be determined: variance of the driving sequence v(n)

Recall: E{∑M

k=0 a∗kx(n − k)x∗(n − l)

}= E{v(n)x∗(n − l)}

⇒M∑

k=0

a∗k r(l − k) = E{v(n)x∗(n − l)} (∗)

Note that

x∗(n) = [w1x∗(n−1)+w2x∗(n−2)+ · · ·+wMx∗(n−M)+v∗(n)] (∗∗)Let l = 0 in (∗) and use (∗∗) on the RHS

M∑k=0

a∗k r(−k) = E{v(n)x∗(n)}

= E{v(n)[w1x∗(n − 1) + w2x∗(n − 2) + · · ·+wMx∗(n − M) + v∗(n)]}

Note that

E{v(n)wk x∗(n − k)} = 0, k = 1, 2, · · · , M



Thus E{v(n)wk x∗(n − k)} = 0, k = 1, 2, · · · , M gives

M∑k=0

a∗k r(−k) = E{v(n)x∗(n)}

= E{v(n)[w1x∗(n − 1) + w2x∗(n − 2) + · · ·+wMx∗(n − M) + v∗(n)]}

= E{v(n)v∗(n)}or conjugating

σ2v =

M∑k=0

akr(k)

Final Yule–Walker Result

Rw = r and σ2v =

M∑k=0

akr(k)


Stationary Process and Models AR Order 2 Example

Example (AR Order-2 Process)

Consider the process defined by

x(n) + a1x(n − 1) + a2x(n − 2) = v(n)



The process

x(n) + a1x(n − 1) + a2x(n − 2) = v(n)

has characteristic equation

1 + a1z−1 + a2z−2 = 0

⇒ p1, p2 =−a1 ±

√a2

1 − 4a2

2Stability enforces the constraints

|pk | < 1 ⇒⎧⎨⎩

−1 ≤ a2 + a1

−1 ≤ a2 − a1

−1 ≤ a2 ≤ 1



Recall that the auto-correlation can be expressed as

r(m) =M∑

k=1

ckpmk = c1pm

1 + c2pm2

p1, p2 real, positive ⇒ r(m) positivedecaying exponential

p1, p2 real, negative ⇒ r(m) alternatesign decaying exponential

p1, p2 complex conjugate ⇒ r(m)exponentially decaying sinusoid





The characteristics of the AR process vary in a related fashion to thepole placements.




Stationary Process and Models Model Order Selection

Model Order Selection

Model Order Selection

A model is typically estimated from a finite set of observation data.

Result: Use Yule-Walker equations to estimate model parameters

Open Question: How do we estimate the model order?Solution: Use information theoretic criteria.

Akaike’s information criterion, developed by Hirotsugu Akaike underthe name of “an information criterion” (AIC) in 1971

Take xi = x(i), i = 1, 2, · · · , N to be N observations of a stationarydiscrete time process.

Let θ be the estimated model (AR/MA/ARMA) order m parameters

θm = [θ1m, θ2m, · · · , θMm]T


Stationary Process and Models Model Order Selection

Let fx (xi |θm) be the conditional pdf of xi given the estimated modeldefined by θm.Set L(θm) = max

θm

∑Ni=1 ln fx (xi |θm).

The likelihood function (log of conditional pdf evaluated at themaximum likelihood estimates of the model parameters, θm).

Then the AIC model order is given by m that minimizes

AIC(m) = −2L(θm)︸︷︷︸Always decreasing

+ 2m︸︷︷︸Parameter cost function

The AIC methodology attempts to find the model that bestexplains the data with a minimum of free parameters.

m optimal

AIC(m)

m


Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System

Transformation by a Linear System

Consider a LTI system characterized by impulse response h(n)

h(n)x(n) y(n)

⇒ y(n) = x(n) ∗ h(n) =∞∑

k=−∞h(n − k)x(k)

Example (Deterministic Case)

Suppose x(n) and h(n) are

0 1

x(n)

2

1

n0 1

h(n)

2

1

n



0 1

x(k)

2

1

k

-1 -2 -1

h(-k)

0

1

k

1

y(n) =∞∑

k=−∞h(n − k)x(k)

⇒ y(n) = 0 n < 0y(0) = 1y(1) = 1/2y(2) = 1y(3) = 1/2y(n) = 0 n ≥ 4



n

-1

y(n)

2

1

3

1/21

1/2

10 4

Stochastic Case: If x(n) is a stationary RV

y(n) =∞∑

k=−∞h(k)x(n − k)

⇒ E{y(n)} =∞∑

k=−∞h(k)E{x(n − k)}

⇒ my = mx

∞∑k=−∞

h(k)



Next, evaluate the cross-correlation

y(n) =∑

k

h(n − k)x(k)

⇒ E{y(n)x∗(n − l)} =∑

k

h(n − k)E{x(k)x∗(n − l)}

⇒ ryx(l) =∑

k

h(n − k)rx (k − n + l) [let m = n − k ]

=∑

m

h(m)rx (l − m)

⇒ ryx(l) = h(l) ∗ rx (l)



Known results:

ryx(l) = h(l) ∗ rx (l) [new result]

rx (l) = r∗x (−l) [prior result]

ryx(l) = r∗xy (−l) [prior result]

Using these results:

r∗xy (−l) = ryx(l)

= h(l) ∗ rx (l)

⇒ r∗xy(l) = h(−l) ∗ rx (−l)

= h(−l) ∗ r∗x (l)

⇒ rxy(l) = h∗(−l) ∗ rx (l)

Note: ryx (l) = h(l) ∗ rx (l) while rxy (l) = h∗(−l) ∗ rx (l)



Lastly, examine ry (l)

y(n) =∑

k

h(n − k)x(k)

⇒ E{y(n)y∗(n − l)} =∑

k

h(n − k)E{x(k)y∗(n − l)}

⇒ ry (l) =∑

k

h(n − k)rxy (k − n + l) [let m = n − k ]

=∑

m

h(m)rxy (l − m)

= h(l) ∗ rxy(l)

or, substituting back using rxy(l) = h∗(−l) ∗ rx (l)

⇒ ry (l) = h(l) ∗ h∗(−l) ∗ rx (l)



Summary of Results (Correlations for a LTI System)

ryx(l) = h(l) ∗ rx (l)

rxy (l) = h∗(−l) ∗ rx (l)

ry (l) = h(l) ∗ rxy(l)

= h(l) ∗ h∗(−l) ∗ rx (l)

Note: Similar results hold for the covariance.



Example

Suppose the impulse response of a LTI is

-1

h(n)

2

1

310 4

Question: If the system is driven by white zero mean noise, what aremy , ryx(l), rxy(l), and ry (l)?



Consider my first:

my = E{y(n)} = mx

∑k

h(k) = 0

Also note rx (l) = σ2xδ(l). Thus,


= h(l) ∗ σ2xδ(l)

= σ2x h(l)

By the symmetry property

rxy(l) = r∗yx(−l) = σ2xh∗(−l)

Plotting the results:

0 1

ryx(l)

2

l

-1

2xσ

3 4 -3 -2

rxy(l)

-1

l

-4

2xσ

0 1K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 138 / 406


Lastly

ry (l) = h(l) ∗ h∗(−l) ∗ rx (l)

= σ2x (h(l) ∗ h∗(−l))

Generating the result graphically:

0 1

h(n)

2-1 3 4

1

-3 -2

h*(-n)

-1-4 0 1

1

0

3

2-4 3 4

12

3

1

4

21

-2 -1-3

h(n)*h*(-n)

0

3σ2

2-4 3 41

4σ2

2σ2

1σ2

-2 -1-3

ry(l)

⇒ ⇒


Linear Systems, Spectral Representations, and Eigen Analysis Power Spectral Density

Power Spectral Density

Definition (Power Spectral Density)

The autocorrelation function of a wide-sense stationary process andthe power spectral density form a Fourier transform pair

S(ω) =∞∑

l=−∞r(l)e−jωl − π < ω ≤ π

r(l) =1

2π

∫ π

−πS(ω)ejωldω l = 0,±1,±2, · · ·

PSD Properties:

The PSD is periodic, with period 2π

S(ω + 2kπ) = S(ω) k = 0,±1,±2, . . .

⇒ All information contained in the Nyquist interval [−π, π]



The PSD of a stationary process is real

To prove this, note

S(ω) =∞∑

k=−∞r(k)e−jωk

= r(0) +∞∑

k=1

r(k)e−jωk +−1∑

k=−∞r(k)e−jωk

= r(0) +∞∑

k=1

r(k)e−jωk +∞∑

k=1

r(−k)ejωk

= r(0) +∞∑

k=1

r(k)e−jωk + r∗(k)ejωk

= r(0) + 2∞∑

k=1

Re[r(k)e−jωk ]

QEDK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 141 / 406


For real-valued sequences, S(ω) ≥ 0 and has even symmetry(prove yourself)

The PSD is (within a constant) a density describing theconcentration of power

Proof: Relate area under the PSD to signal power. Thus, let l = 0in the inverse transform,

r(l) =1

2π

∫ π

−πS(ω)ejωldω

⇒ r(0) =1

2π

∫ π

−πS(ω)dω

⇒ 12π

∫ π

−πS(ω)dω = E{x2(n)} [signal power]

⇒ Area under the PSD is (scaled by the 1/2π constant) equal tothe signal power. Moreover, it characterizes the signal power byfrequency



Example

If x(n) is zero mean i.i.d. with variance σ2x , what is the PSD?

In the case,

r(l) = σ2xδ(l)

⇒ S(ω) =∑

l

r(l)e−jωl

=∑

l

σ2xδ(l)e−jωl = σ2

x − π < ω ≤ π

2xσ

S(ω)

-π π

Note: Flat spectrums are said to be whitebecause they have a uniform distributionof power


Linear Systems, Spectral Representations, and Eigen Analysis Transmission Through a Linear System

Transmission Through a Linear System

Let x(n) be a stationary discrete time process that is passed through aLTI system characterized by h(n).

h(n)x(n) y(n)

Then by known properties and time/frequency relations


⇒ Syx(ω) = H(ω)Sx(ω)

whereH(ω) =

∑l

h(l)e−jωl



Also note that

FT{h∗(−l)} =∑

l

h∗(−l)e−jωl

=∑

l

h∗(l)ejωl

=

[∑l

h(l)e−jωl

]∗= H∗(ω)

Thus

ry (l) = h(l) ∗ h∗(−l) ∗ rx (l)

⇒ Sy(ω) = H(ω)H∗(ω)Sx (ω)

= |H(ω)|2Sx (ω)

Note: Only the PSD magnitude is affected by the system



Example

Let x(n) have correlation rx(l) = (12)|l | and be the into to a LTI system

with impulse response

h(n) = δ(n) + δ(n − 1).

Find ry (l) and Sy (ω)

Note ry (l) = h(l) ∗ h(−l) ∗ rx (l)

0 1

1

0 1

1

* =

0 1

1

-1

2

h(l)



Thus

h(l) ∗ h(−l) = δ(l + 1) + 2δ(l) + δ(l − 1)

⇒ ry (l) = h(l) ∗ h(−l) ∗ rx (l)

= [δ(l + 1) + 2δ(l) + δ(l − 1)] ∗(

12

)|l |

= 2(

12

)|l |+

(12

)|l+1|+

(12

)|l−1|

The result is a consequence of convolving with shifted delta functions

Next, compute Sx(ω) and |H(ω)|2Aside: Recall for |p| < 1,

∞∑k=0

pk =1

1 − pand

∞∑k=1

pk =p

1 − p



Given rx (k) = (12)|k |

Sx(ω) =∑

k

(12

)|k |e−jωk

=∞∑

k=0

(12

)k

e−jωk +∞∑

k=1

(12

)k

ejωk

=1

1 − 12e−jω

+12ejω

1 − 12ejω

=34

54 − cos(ω)

Also

H(ω) =∑

k

h(k)e−jωk

=∑

k

[δ(k) + δ(k − 1)]e−jωk

= 1 + e−jω



Taking the magnitude,

|H(ω)|2 = (1 + e−jω)(1 + e−jω)∗

= 1 + e−jω + ejω + 1

= 2 + 2 cos(ω)

= 2(1 + cos(ω))

Thus the output PSD is

Sy (ω) = |H(ω)|2Sx(ω)

= 2(1 + cos(ω))

(34

54 − cos(ω)

)

=32(1 + cos(ω))

54 − cos(ω)

Note: The system only affects the magnitude and Sy (ω) has evensymmetry. Question: Where is the power concentrated?


Linear Systems, Spectral Representations, and Eigen Analysis Spectral Factorization

Spectral Factorization

Suppose a system is described by the difference equation

M∑k=0

akx(n − k) =K∑

l=0

blv(n − l)

Then taking the z-transform

M∑k=0

akz−k X (z) =K∑

l=0

blz−l V (z)

orX (z)

V (z)

�= H(z) =

∑Kl=0 blz−l∑M

k=0 akz−k=

B(z)

A(z)

Note: H(z) is defined as the transfer function



All pass systems

Suppose a stable first-order system has a transfer function

H(z) =1 − a∗zz − a

Note: System has a zero at z = 1/a∗ and a pole at z = a.

Let a = rejθ. Then 1a∗ = 1

re−jθ = 1r ejθ. For a stable (r < 1) system:

ra

Note: Pole and zero are at the same angle



Evaluating H(ω) = H(z)|z=ejω

H(ω) =1 − a∗ejω

ejω − a=

ejω(e−jω − a∗)ejω − a

⇒ |H(ω)| = |ejω| |e−jω − a∗||ejω − a| = 1 [All pass filter]

General Result – All Pass Systems

Systems of the form

Hap(z) =N∏

i=1

(1 − a∗i z)

(z − ai)

are all pass. For stability, |ai | < 1.



Minimum Phase Systems

A system is minimum phase if it is causal and has all its poles andzeros inside the unit circle

Minimum phase systems are stable

If H(z) is minimum phase, then H−1(z) is minimum phase andstable

Maximum Phase Systems

A system is maximum phase if it has all its zeros outside the unit circle

The inverse of a maximum phase systems is not stable

Mixed Phase Systems

A system is mixed phase if it has zeros inside and outside the unitcircle

The inverse of a mixed phase systems is not stable



Factorization Result

Any rational H(z) can be factored as

H(z) = Hmin(z)︸︷︷︸Min. phase

Hap(z)︸︷︷︸All pass

Consider a non–minimum phase H(z) with a (single) zero outside theunit circle, at z = 1/a∗ where |a| < 1

H(z) = H1(z)(1 − a∗z) [factor out non–min. phase comp.]

= H1(z)(1 − a∗z)

(z − az − a

)

= H1(z)(z − a)︸︷︷︸Minimum phase

(1 − a∗zz − a

)︸︷︷︸

All pass

Note: Magnitude response is determined by the minimum phasecomponent, while both terms contribute to phase response



Example

Consider a stable system with a single zero outside the unit circle(maximum phase system):

H(z) Hmin(z) Hap(z)System with 1 System with reflected System with original zero

zero outside UC zero; min phase & reflected/cancelingpole; all pass

Note: The zero in Hmin(z) and the pole in Hap(z) cancel each other out


Linear Systems, Spectral Representations, and Eigen Analysis Eigen Analysis

Eigen Analysis

Objective: Utilize tools from linear algebra to characterize and analyzematrices, especially the correlation matrix

The correlation matrix plays a large role in statisticalcharacterization and processing.

Previously result: R is Hermitian.

Further insight into the correlation matrix is achieved througheigen analysis

Eigenvalues and vectorsMatrix diagonalizationApplication: Optimum filtering problems



Objective: For a Hermitian matrix R, find a vector q satisfying

Rq = λq

Interpretation: Linear transformation by R changes the scale, butnot the direction of q

Fact: A M × M matrix R has M eigenvectors and eigenvalues

Rqi = λiqi i = 1, 2, 3, · · · , M

To see this, note(R − λI)q = 0

For this to be true, the row/columns of (R − λI) must be linearlydependent,

⇒ det(R − λI) = 0



Note: det(R − λI) is a Mth order polynomial in λ

The roots of the polynomial are the eigenvalues λ1, λ2, · · · , λM

Rqi = λiqi

Each eigenvector qi is associated with one eigenvalue λi

The eigenvectors are not unique

Rqi = λiqi

⇒ R(aqi) = λi(aqi)

Consequence: eigenvectors are generally normalized, e.g.,|qi | = 1 for i = 1, 2, . . . , M



Example (General two dimensional case)

Let M = 2 and

R =

[R1,1 R1,2

R2,1 R2,2

]Determine the eigenvalues and eigenvectors.

Thus

det(R − λI) = 0

⇒∣∣∣∣ R1,1 − λ R1,2

R2,1 R2,2 − λ

∣∣∣∣ = 0

⇒ λ2 − λ(R1,1 + R2,2) + (R1,1R2,2 − R1,2R2,1) = 0

⇒ λ1,2 =12

[(R1,1 + R2,2) ±

√4R1,2R2,1 + (R1,1 − R2,2)

]



Back substitution yields the eigenvectors:[R1,1 − λ R1,2

R2,1 R2,2 − λ

] [q1

q2

]=

[00

]

In general, this yields a set of linear equations. In the M = 2 case:

(R1,1 − λ)q1 + R1,2q2 =0

R2,1q1 + (R2,2 − λ)q2 =0

Solving the set of linear equations for a specific eigenvalue λ i

yields the corresponding eigenvector, qi



Example (Two–dimensional white noise)

Let R be the correlation matrix of a two–sample vector of zero meanwhite noise

R =

[σ2 00 σ2

]Determine the eigenvalues and eigenvectors.

Carrying out the analysis yields eigenvalues

λ1,2 =12

[(R1,1 + R2,2) ±

√4R1,2R2,1 + (R1,1 − R2,2)

]

=12

[(σ2 + σ2) ±

√0 + (σ2 − σ2)

]= σ2

and eigenvectors

q1 =

[10

]and q2 =

[01

]Note: The eigenvectors are unit length (and orthogonal)


Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties

Eigen Properties

Property (eigenvalues of Rk )

If λ1, λ2, · · · , λM are the eigenvalues of R, then λk1, λk

2, · · · , λkM are the

eigenvalues of Rk .

Proof: Note Rqi = λiqi . Multiplying both sides by R k − 1 times,

Rkqi = λiRk−1qi = λki qi

Property (linear independence of eigenvectors)

The eigenvectors q1, q2, · · · , qM , of R are linearly independent, i.e.,

M∑i=1

aiqi �= 0

for all nonzero scalars a1, a2, · · · , aM .



Property (Correlation matrix eigenvalues are real & nonnegative)

The eigenvalues of R are real and nonnegative.

Proof:

Rqi = λiqi

⇒ qHi Rqi = λiqH

i qi [pre–multiply by qHi ]

⇒ λi =qH

i Rqi

qHi qi

≥ 0

Follows from the facts: R is positive semi-definite and qHi qi = |qi|2 > 0

Note: In most cases, R is positive definite and

λi > 0, i = 1, 2, · · · , M



Property (Unique eigenvalues ⇒ orthogonal eigenvectors)

If λ1, λ2, · · · , λM are unique eigenvalues of R, then the correspondingeigenvectors, q1, q2, · · · , qM , are orthogonal.

Proof:

Rqi = λiqi

⇒ qHj Rqi = λiqH

j qi (∗)

Also, since λj is real and R is Hermitian

Rqj = λjqj

⇒ qHj R = λjqH

j

⇒ qHj Rqi = λjqH

j qi

Substituting the LHS from (∗)⇒ λiqH

j qi = λjqHj qi



Thus

λiqHj qi = λjqH

j qi

⇒ (λi − λj)qHj qi = 0

Since λ1, λ2, · · · , λM are unique

qHj qi = 0 i �= j

⇒ q1, q2, · · · , qM are orthogonal.

QED



Diagonalization of R

Objective: Find a transformation that transforms the correlation matrixinto a diagonal matrix.

Let λ1, λ2, · · · , λM be unique eigenvectors of R and take q1, q2, · · · , qM

to be the M orthonormal eigenvectors

qHi qj =

{1 i = j0 i �= j

Define Q = [q1, q2, · · · , qM ] and Ω = diag(λ1, λ2, · · · , λM). Thenconsider

QHRQ =

⎡⎢⎢⎢⎣

qH1

qH2...

qHM

⎤⎥⎥⎥⎦R[q1, q2, · · · , qM ]



QHRQ =

⎡⎢⎢⎢⎣

qH1

qH2...

qHM

⎤⎥⎥⎥⎦R[q1, q2, · · · , qM ]

=

⎡⎢⎢⎢⎣

qH1

qH2...

qHM

⎤⎥⎥⎥⎦ [λ1q1, λ2q2, · · · , λNqM ]

=

⎡⎢⎢⎢⎣

λ1 0 · · · 00 λ2 · · · 0...

.... . .

...0 0 · · · λM

⎤⎥⎥⎥⎦

⇒ QHRQ = Ω (eigenvector diagonalization of R)



Property (Q is unitary)

Q is unitary, i.e., Q−1 = QH

Proof: Since the qi eigenvectors are orthonormal

QHQ =

⎡⎢⎢⎢⎣

qH1

qH2...

qHM

⎤⎥⎥⎥⎦ [q1, q2, · · · , qM ] = I

⇒ Q−1 = QH

Property (Eigen decomposition of R)

The correlation matrix can be expressed as

R =M∑

i=1

λiqiqHi



Proof: The correlation diagonalization result states

QHRQ = Ω

Isolating R and expanding,

R = QΩQH = [q1, q2, · · · , qM ]Ω

⎡⎢⎢⎢⎣

qH1

qH2...

qHM

⎤⎥⎥⎥⎦

= [q1, q2, · · · , qM ]

⎡⎢⎢⎢⎣

λ1qH1

λ2qH2

...λMqH

M

⎤⎥⎥⎥⎦ =

M∑i=1

λiqiqHi

Note: This also gives

R−1 = (QH)−1Ω−1Q−1 = QΩ−1QH

where Ω−1 = diag(1/λ1, 1/λ2, · · · , 1/λM)



Aside (trace & determinant for matrix products)

Note trace(A)�=∑

i Ai ,i . Also,

trace(AB) = trace(BA) similarly det(AB) = det(A)det(B)

Property (Determinant–Eigenvalue Relation)

The determinant of the correlation matrix is related to the eigenvaluesas follows:

det(R) =M∏

i=1

λi

Proof: Using R = QΩQH and the above,

det(R) = det(QΩQH)

= det(Q)det(QH)det(Ω) = det(Ω) =M∏

i=1

λi



Property (Trace–Eigenvalue Relation)

The trace of the correlation matrix is related to the eigenvalues asfollows:

trace(R) =M∑

i=1

λi

Proof: Note

trace(R) = trace(QΩQH)

= trace(QHQΩ)

= trace(Ω)

=M∑

i=1

λi

QED



Definition (Normal Matrix)

A complex square matrix A is a normal matrix if

AHA = AAH

That is, a matrix is normal if it commutes with its conjugate transpose.

Note

All Hermitian symmetric matrices are normal

Every matrix that can be diagonalized by the unitary transform isnormal

Definition (Condition Number)

The condition number reflects how numerically well–conditioned aproblem is, i.e, a low condition number ⇒ well–conditioned; a highcondition number ⇒ ill–conditioned.



Definition (Condition Number for Linear Systems)

For a linear systemAx = b

defined by a normal matrix A, the condition number is

χ(A) =λmax

λmin

where λmax and λmin are the maximum/minimum eigenvalues of A

Observations:

Large eigenvalue spread ⇒ ill–conditioned

Small eigenvalue spread ⇒ well–conditioned



Sensitivity Analysis: Suppose a system (filter) is related as:

Rw = d

where w defines the filter parameters; R and d are signal statisticmatrices

⇒ Introduce small signal statistic perturbations

Let R and d are perturbed such that ||δR||/||R|| and ||δd||/||d|| are onthe order ε � 1

Result: A bound on the resulting parameter perturbations is given by

||δw||||w|| ≤ εχ(R) = ε

λmax

λmin

Consequence: If R is ill conditioned, small changes in R or d can leadto big changes in w

⇒ System sensitivity is related to eigenvalue spread


Linear Systems, Spectral Representations, and Eigen Analysis The discrete Karhmen-Loeve Transform

The discrete Karhmen-Loeve Transform (KLT)

Definition (The discrete Karhmen-Loeve Transform (KLT))

A M sample vector x(n) from the process {x(n)} can be expressed as

x(n) =M∑

i=1

ci(n)qi

where q1, q2, · · · , qM are the orthonormal eigenvectors of the processcorrelation matrix, R, and c1(n), c2(n), · · · , cM(n) are a set of KLTcoefficients.

Signal is represented as a weighted sum of eigenvectors

Need to determine the coefficients



Determining Coefficients: Write the expression in matrix form

x(n) =M∑

i=1

ci(n)qi

= Qc(n) (∗)

whereQ = [q1, q2, · · · , qM ]

andc(n) = [c1(n), c2(n), · · · , cM(n)]T .

Solving (∗) for c(n):

c(n) = Q−1x(n) = QHx(n)

orci(n) = qH

i x(n)

Note: ci(n) is the projection of x(n) onto qi



Question: How related are the coefficients to reach other?

Answer: Consider the correlation between ci(n) terms

Rc(n) = E{c(n)cH(n)}= E{(QHx(n))(QHx(n))H}= E{(QHx(n)xH(n)Q}= QHRxQ

= Ω

Result:

E{c∗i (n)cj (n)} =

{λi i = j0 otherwise

⇒ KLT transform coefficients are uncorrelatedA desirable property – Why?



Question: Can we represent x(n) with fewer terms? If so, how do weminimize the representation error?

Approach: Use fewer terms in the KLT transform

x(n) =M∑

i=1

ci(n)qi

⇒ x(n) =N∑

i=1

ci(n)qi N < M

Thus

x(n) = x(n) + ε(n)

=N∑

i=1

ci(n)qi +M∑

i=N+1

ci(n)qi

Question: How do we minimize the representation error?K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 178 / 406


Approach: Analyzed and minimize the error power

The error power is given by

ε = E{εH(n)ε(n)}

= E

⎧⎨⎩

M∑i=N+1

c∗i (n)qH

i

M∑j=N+1

cj(n)qj

⎫⎬⎭

=M∑

i=N+1

E{c∗i (n)ci (n)} [result of orthogonality]

=M∑

i=N+1

λi [from prior result]

Result: To minimize the error select the qi eigenvectors associatedwith M largest eigenvalues.


Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter

The Matched Filter

Objective: Find the optimal filter coefficients for two cases:(1) deterministic signals and (2) stochastic signals

Signal model:x(n) = u(n)︸︷︷︸

signal

+ v(n)︸︷︷︸noise



Definitions:

Filter parameters: w = [w1, w2, · · · , wM ]T

Observation vector: x(n) = [x(n), x(n − 1), · · · , x(n − M + 1)]T

Then using x(n) = u(n) + v(n) and linearity,

y(n) = wT x(n)

= wT u(n) + wT v(n)

= ys(n) + yn(n)

where

ys(n) = wT u(n)

yn(n) = wT v(n)

Consider deterministic and stochastic cases separately



Case 1: Matched Filters for Deterministic Signals

Approach:

Analyzed output SNR

Set filter coefficients to maximize SNR

The SNR at time n can be defined as

SNR =|ys(n)|2

E{|yn(n)|2}where

ys(n) = wT u(n) [Deterministic]

yn(n) = wT v(n) [Stochastic]

Note: Case assumes u(n) is deterministic



Using ys(n) = wT u(n) and yn(n) = wT v(n),

SNR =|ys(n)|2

E{|yn(n)|2}

=(wT u(n))∗(uT (n)w)

E{(wT v(n))∗(vT (n)w)}

=wHu∗(n)uT (n)w

E{wHv∗(n)vT (n)w}

=wHu∗(n)uT (n)w

wHRvw

To analyze further, restrict the noise statistics



Suppose v(n) is zero mean and uncorrelated, Rv = σ2v I. Then

SNR =wHu∗(n)uT (n)w

wHRvw

=|wHu∗(n)|2

σ2vwHw

If we restrict |w|2 = 1, then the final result is

SNR =|wHu∗(n)|2

σ2v



Definition (Cauchy–Schwarz Inequality (1821 disc.; 1859 cont.))

Cauchy-Schwarz’s inequality states (matrix case)

|AHB|2 ≤ (AHA)(BHB)

with equality only if A = kB, where k is a constant

Thus

SNR =|wHu∗(n)|2

σ2v

≤ (wHw)(uT (n)u∗(n))

σ2v

=uH(n)u(n)

σ2v

=|u(n)|2

σ2v

Result: The SNR is maximized when w = ku∗(n)



Example

A communications system sends out one of two signals

Symbol “1”

Symbol “0”

0 1 2

n

3 4

1

2

0 1 2 n3 4

-2

-1

We receivey(n) = ui(n) + v(n)

where v(n) is i.i.d. Gaussian.

Objective: Find the optimal w (assume n = 3)K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 186 / 406


Set w = ku1(n) where k is such that |w|2 = 1. Thus,

w = u1(n)/|u1(n)| =1√10

[2, 2, 1, 1]T

The resulting impulse response of the filter is

0 1 2n

3

102

101

Observation: Optimal filter impulse response is sent (desired) signaltime–reversed & normalized

Determine ys(n) for signal “1” sent

ys(n) = wT u1(n) =|u1(n)|2|u1(n)| =

√10



For signal “0”

ys(n) = wT u0(n) =−|u0(n)|2|u1(n)| = −

√10

Received signal distributions:

10− 100

fy(y)

The SNR for the optimized system is

SNR =|u1(n)|2

σ2v

=10σ2

v


Linear Systems, Spectral Representations, and Eigen Analysis Matched Filter for Stochastic Signals

Case 2: Matched Filter for Stochastic Signals

As before

y(n) = wT x(n)

= wT u(n) + wT v(n)

= ys(n) + yn(n)

Now define the SNR as

SNR =E{|ys(n)|2}E{|yn(n)|2} =

wHRuwwHRvw

As before, if Rv = σ2v I and |w|2 = 1

SNR =wHRuw

σ2v



Since

SNR =wHRuw

σ2v

,

maximizing the SNR is equivalent to maximizing |wHRuw|Recall from Cauchy–Schwarz,

|AHB|2 ≤ (AHA)(BHB)

with equality only if A = kB

Let AH = wHRu and B = w and apply Cauchy–SchwarzResult: |wHRuw| is maximized for A = kB

⇒ Ruw = kw

⇒ k is an eigenvalue and w is an eigenvector!w = qi and k = λi , where Ruqi = λiqi

Question: How do we choose which eigenvector?



Using w = qi , or wH = qHi , and Ruw = Ruqi = λiqi gives

SNR =wHRuw

σ2v

=qH

i λiqi

σ2v

[orthonormal eigenvectors]

=λi

σ2v

Result: The SNR is maximized when w = qmax, the eigenvectorassociated with the largest eigenvalue of Ru.

When using the optimal filter weights (w = qmax), the resulting(maximized) SNR is

SNR =λmax

σ2v


Maximum Likelihood and Bayes Estimation ML Estimation

Maximum Likelihood and Bayes Estimation

Estimation

Estimation is the inference of unknown quantities. Two cases areconsidered:

1 Quantity is fixed, but unknown – parameter estimation2 Quantity is random and unknown – random variable estimator

Parameter Estimation

Consider a set of observations forming a vector

x = [x1, x2, · · · , xN ]T

Assumption: The xi RVs come from a known density governed byunknown (but fixed) parameter θ

Objective: Estimate θ. What optimality criteria should be used?



Definition (Maximum Likelihood Estimation)

The maximum likelihood estimate of θ is the value θML(x) which makesthe x observations most likely

θML(x) = argmaxθ

fx|θ(x|θ)

Example

Let xi ∼ N(μ, σ2). Given N observations, find the ML estimate of μ.

μ μx



For i.i.d. samples

fx|μ(x|μ) =N∏

i=1

fxi |μ(xi |μ)

=N∏

i=1

1√2πσ2

e− (xi−μ)2

2σ2 [Gaussian case]

�= likelihood function

Thus the estimate of the mean it is set as

μ = argmaxμ

fx|μ(x|μ)

Interpretation: Set the distribution mean to the value that makesobtaining the observed samples most likely.



Note: Maximizing fx|μ(x|μ) is equivalent to maximizing any monotonicfunction of fx|μ(x|μ). Choosing ln(·)

ln(fx|μ(x|μ)) = ln

(N∏

i=1

1√2πσ2

e− (xi−μ)2

2σ2

)

= −N ln(√

2πσ2) −N∑

i=1

(xi − μ)2

2σ2

= −N ln(√

2πσ2) −N∑

i=1

x2i

2σ2 + μN∑

i=1

xi

σ2 −N∑

i=1

μ2

2σ2

Taking the derivative and equating to 0,

∂ ln(fx|μ(x|μ))

∂μ=

N∑i=1

xi

σ2 − Nμ

σ2 = 0

⇒ μ =1N

N∑i=1

xi�= sample mean



General Maximum Likelihood Result

General Statement: The ML estimate of θ is

θML(x) = argmaxθ

fx|θ(x|θ)

Solution: The ML estimate of θ is obtained as the solution to

∂

∂θfx|θ(x|θ)

∣∣∣θ=θML

= 0

or∂

∂θln[fx|θ(x|θ)]

∣∣∣θ=θML

= 0

fx|θ(x|θ) is the likelihood function of θ.

θML is a RV since it is a function of the RVs x1, x2, · · · , xN

Historical Note: ML estimation was pioneered by geneticist andstatistician Sir R. A. Fisher between 1912 and 1922



Example

The time between customer arrivals at a bar is a RV with distribution

fT (T ) = αe−αT U(T )

Objective: Estimate the arrival rate α based on N measured arrivalintervals T1, T2, · · · , TN .

Assuming that the arrivals are independent,

f (T1, T2, · · · , TN) =N∏

i=1

fT (Ti)

=N∏

i=1

αe−αTi = αNe−α

N∑i=1

Ti

⇒ ln[f (T1, T2, · · · , TN)] = [N ln(α) − α

N∑i=1

Ti ]



Taking the derivative and equating to 0,

∂

∂αln[f (T1, T2, · · · , TN)] =

∂

∂α[N ln(α) − α

N∑i=1

Ti ]

=Nα

−N∑

i=1

Ti = 0

Solving for α gives the ML estimate

⇒ αML =1

1N

∑Ni=1 Ti

=1

T

Result: The ML estimate of arival rate for exponentially distributedsamples is the reciprocal of the sample mean arrival


Maximum Likelihood and Bayes Estimation Properties of Estimates

Properties of Estimates

Since θN is a function of RVs x1, x2, · · · , xN , estimates areRVs and wecan state the following properties:

An estimate θN is unbiased if

E{θN} = θ bias�= E{θN} − θ

θN is consistent (converges in probability) if

limN→∞

Pr{|θN − θ| < ε} = 1 for arbitrary ε

θN is efficient in comparison to other estimators if

var(θN) < var(θother)

Note: If θN is unbiased and efficient with respect to θN−1 for all N (i.e.,var(θN) converges to 0), then θN is a consistent estimate



To prove the consistent estimate result, note that by the Tchebycheffinequality

Pr{|θN − θ| > ε} ≤ var(θN)

ε2

If var(θN) < var(θN−1), the above gives

limN→∞

Pr{|θN − θ| > ε} = 0

orlim

N→∞Pr{|θN − θ| < ε} = 1

That is, it converges in probability, or is consistent

QED



Example

Let {xi} be WSS with uncorrelated samples. Is the sample mean aconsistent estimator for this sequence?

Step 1: Consider the bias

E{μN} = E

{1N

N∑i=1

xi

}

=1N

(Nμ) = μ

Result: μN is unbiased

Step 2: Consider the variance

var(μN) = E{

(μ − μ)2}



var(μN) = E{(μ − μ)2

}

= E

⎧⎨⎩((

1N

N∑i=1

xi

)− μ

)2⎫⎬⎭

=1

N2 E

⎧⎨⎩(

N∑i=1

(xi − μ)

)2⎫⎬⎭ [assume uncorrelated]

=1

N2

N∑i=1

E{(xi − μ)2} +1

N2 E(cross terms)︸︷︷︸=0

=1

N2

N∑i=1

E{(xi − μ)2} =1

N2 (Nσ2) =σ2

N

Result: μN is unbiased and var(μN) < var(θN−1) ⇒ μN is consistent


Maximum Likelihood and Bayes Estimation Cramer-Rao Bound

Theorem (Cramer-Rao Bound (1945, 1946))

If θ is an unbiased estimate of θ, then

var(θ) ≥(

E

{(∂

∂θln[fx|θ(x|θ)]

)2})−1

or equivalently

var(θ) ≥(−E

{∂2

∂θ2 ln[fx|θ(x|θ)]

})−1

where it is assumed

∂

∂θfx|θ(x|θ) and

∂2

∂θ2 fx|θ(x|θ) exist

Note: If any estimate satisfies the bound with equality, it is an efficient(minimum variance) estimate



Proof:

Since θ is unbiased

E{θ − θ} =

∫ ∞

−∞(θ − θ)fx|θ(x|θ)dx = 0

Taking the derivative

∂

∂θ

∫ ∞

−∞(θ − θ)fx|θ(x|θ)dx = 0

⇒ −∫ ∞

−∞fx|θ(x|θ)dx︸︷︷︸−1

+

∫ ∞

−∞

∂fx|θ(x|θ)

∂θ(θ − θ)dx = 0

⇒∫ ∞

−∞

∂fx|θ(x|θ)

∂θ(θ − θ)dx = 1 (∗)



Note the following equality

∂ ln[fx|θ(x|θ)]

∂θfx|θ(x|θ) =

∂fx|θ(x|θ)

∂θ

Using this in (∗) ∫ ∞

−∞

∂fx|θ(x|θ)

∂θ(θ − θ)dx = 1

⇒∫ ∞

−∞

∂ ln[fx|θ(x|θ)]

∂θfx|θ(x|θ)(θ − θ)dx = 1

This can be equivalently expressed as

(∫ ∞

−∞

(∂ ln[fx|θ(x|θ)]

∂θ

√fx|θ(x|θ)

)(√fx|θ(x|θ)(θ − θ)

)dx)2

= 1



Definition (Cauchy–Schwarz Inequality (1821 disc.; 1859 cont.))

Cauchy-Schwarz’s inequality states (for square–integrablecomplex–valued functions),∣∣∣∣

∫f (x)g(x) dx

∣∣∣∣2 ≤∫

|f (x)|2 dx ·∫

|g(x)|2 dx

with equality only if f (x) = k · g(x), where k is a constant

Thus(∫ ∞

−∞


∂θ

√fx|θ(x|θ)

)(√fx|θ(x|θ)(θ − θ)

)dx)2

= 1

⇒(∫ ∞

−∞


∂θ

)2

fx|θ(x|θ)dx

)(∫ ∞

−∞(θ − θ)2fx|θ(x|θ)dx

)≥ 1



Note ∫ ∞

−∞(θ − θ)2fx|θ(x|θ)dx = var(θ) (∗)

and∫ ∞

−∞


∂θ

)2

fx|θ(x|θ)dx = E

{(∂ ln(fx|θ(x|θ))

∂θ

)2}

(∗∗)

Thus using (∗) and (∗∗) in(∫ ∞

−∞


∂θ

)2

fx|θ(x|θ)dx

)(∫ ∞

−∞(θ − θ)2fx|θ(x|θ)dx

)≥ 1

⇒ var(θ) ≥[

E

{(∂ ln(fx|θ(x|θ))

∂θ

)2}]−1

with equality iff∂

∂θln(fx|θ(x|θ)) = k(θ − θ)

QEDK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 208 / 406


Thus the bound in met iff

∂

∂θln(fx|θ(x|θ)) = k(θ − θ)

Let θ = θML in the above

∂

∂θln(fx|θ(x|θ))

∣∣∣θ=θML︸︷︷︸

= 0 by ML criteria

= k(θ − θ)∣∣∣θ=θML

Therefore, the RHS must equal zero, or

θ = θML

Result: If an efficient estimate (one that satisfies the bound withequality) exists, then it is the ML estimate

Note: If an efficient estimator doesn’t exist, then we don’t know howgood θML is


Maximum Likelihood and Bayes Estimation Bayes Estimation

Definition (Bayes Estimation)

Objective: Estimate a random parameter (RV) from observationssamples x1, x2, · · · , xn that are statistically related to y by fy |x(·)Bayes Procedure: Define a nonnegative cost function C(y , y ) and sety to minimize the expected cost, or risk

R︸︷︷︸risk

= E{C(y , y )}

Since y and y are RVs

R =

∫ ∞

−∞

∫ ∞

−∞C(y , y )fy ,x(y , x)dydx

=

∫ ∞

−∞

[∫ ∞

−∞C(y , y )fy |x(y |x)dy

]︸︷︷︸

I(y)

fx(x)dx

Note: Minimizing I(y) it is equivalent to minimize R since fx(x) ≥ 0K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 210 / 406


Consider several cost functions

Case 1: Mean Squared cost function

C(y , y ) = |y − y |2

|ˆ| yy −In this case,

I(y) =

∫ ∞

−∞(y − y)2fy |x(y |x)dy

⇒ ∂I(y)

∂y= −2

∫ ∞

−∞(y − y)fy |x(y |x)dy = 0



or rearranging∫ ∞

−∞y fy |x(y |x)dy =

∫ ∞

−∞yfy |x(y |x)dy [y is a constant]

⇒ yMS =

∫ ∞

−∞yfy |x(y |x)dy = E{y |x}

Example

Let xi = a + μi for i = 1, 2, · · · , N, where μi ∼ N(0, σ2) anda ∼ N(0, σ2

a) are i.i.d. Determine aMS(x).

Note

fx|a(x|a) =N∏

i=1

1√2πσ

e− (xi−a)2

2σ2 =

(1

2πσ2

)N2

e− 1

2

(N∑

i=1

(xi−a)2

σ2

)(∗)

fa(a) =1√

2πσae− a2

2σ2a (∗∗)



To find aMS(x) we need

aMS(x) = E{a|x}By Bayes’s theorem we can write

fa|x(a|x) =fx|a(x|a)fa(a)

fx(x)

Substituting in (∗) and (∗∗), and rearranging

fa|x(a|x) =

( 12πσ2

)N2(

1√2πσa

)e− 1

2

(N∑

i=1

(xi−a)2

σ2 + a2

σ2a

)

fx(x)

This can be compactly written as

fa|x(a|x) = C(x) exp

⎧⎨⎩− 1

2σ2p

[a − σ2

a

σ2a + σ2/N

(1N

N∑i=1

xi

)]2⎫⎬⎭



Observations on

fa|x(a|x) = C(x) exp

⎧⎨⎩− 1

2σ2p

[a − σ2

a

σ2a + σ2/N

(1N

N∑i=1

xi

)]2⎫⎬⎭

= C(x) exp

{−(a − η)2

2σ2p

}

C(x) is a (normalizing) function of x onlyThe variance term is given by

σ2p =

(1σ2

a+

Nσ2

)−1

=σ2

aσ2

Nσ2a + σ2

Critical Observation: fa|x(a|x) is a Gaussian distribution!Result:

aMS = E{a|x} = η =σ2

a

σ2a + σ2/N

(1N

N∑i=1

xi

)



Case 2: Uniform cost function

C(y , y ) =

{0 |y − y | < ε1 else

Question: For what types of problems isthis cost function effective?

|ˆ| yy −

)ˆ,( yyC

0

1

ε

In this case,

I(y) =

∫ ∞


=

∫|y−y |≥ε

fy |x(y |x)dy

= 1 −∫

|y−y |<ε

fy |x(y |x)dy

How do we minimize I(y)?K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 215 / 406


Result: I(y) is minimized bymaximizing∫

|y−y |<ε

fy |x(y |x)dyy

∫<− ε|ˆ|

| )|(yy

y dyyf xx

)|(| xx yf y

Note: ε is arbitrarily small

⇒ I(y) is minimized when fy |x(y |x) takes its largest value

yMAP(x) = argmaxy

fy |x(y |x)

yMAP is referred to as the maximum a posteriori (MAP) estimatebecause it maximized the posterior density fy |x(y |x).



Example

Let

fx ,y(x , y) =

{10y 0 ≤ y ≤ x2, 0 ≤ x ≤ 10 otherwie

Find the MS and MAP estimates of y , i.e., yMS(x) and yMAP(x).

First step: determine the posterior density fy |x(y |x).

Since fy |x(y |x) =fx,y (x ,y)

fx (x) , we need

fx (x) =

∫ ∞

−∞fx ,y(x , y)dy

=

∫ x2

010ydy

= 5y2∣∣∣x2

0= 5x4 0 ≤ x ≤ 1



Thus,

fy |x(y |x) =fx ,y(x , y)

fx (x)

=10y5x4 =

2yx4 0 ≤ y ≤ x2

y

)|(| xyf xy

0

2

2

x

2x



MAP estimate:

yMAP(x) = argmaxy

fy |x(y |x)

= argmaxy

2yx4 0 ≤ y ≤ x2

= x2

MS estimate:

yMS(x) = E{y |x}

=

∫ x2

0yfy |x(y |x)dy

=

∫ x2

0

2y2

x4 dy

=23

y3

x4

∣∣∣x2

0=

23

x2



Note that the minimum MSE is

E{(y − yMS)2} =

∫ 1

0

∫ x2

0(y − yMS)

2fx ,y(x , y)dydx

=

∫ 1

0

∫ x2

0(y − 2

3x2)210ydydx =

5162

= 0.0309

The MSE of the MAP estimate is

E{(y − yMAP)2} =

∫ 1

0

∫ x2

0(y − yMAP)

2fx ,y(x , y)dydx

=

∫ 1

0

∫ x2

0(y − x2)210ydydx =

554

= 0.0926

Observation: This result is expected. Why?



Observation: MAP estimation can be used as an extension of MLestimation if some variability is assumed

Instead of an unknown constant θ, we have an unknown randomparameter with distribution fθ(θ)

To see this, note

fθ|x(θ|x) =fx|θ(x|θ)fθ(θ)

fx(x)

The MAP estimate maximizes the numerator since fx(x) is not afunction of θ,

θMAP = argmaxθ

fx|θ(x|θ)fθ(θ)

Question: For what distribution fθ(θ) does θMAP = θML? That is

θMAP = argmaxθ

fx|θ(x|θ)fθ(θ)?= argmax

θfx|θ(x|θ) = θML



Example

Let x(n) = A + μ(n) for n = 1, 2, · · · , N, where μ(n) ∼ N(0, σ2μ) and

A ∼ N(A0, σ2A) are i.i.d.

Determine the MAP estimate of A.

Need to maximize fx|A(x|A)fA(A), or

AMAP = argmaxA

[ln(fx|A(x|A)) + ln(fA(A))

]Note

ln(fx|A(x|A)) =N2

ln

(1

2πσ2μ

)−

N∑n=0

(x(n) − A)2

2σ2μ

and

ln(fA(A)) =12

ln

(1

2πσ2A

)− (A − A0)

2

2σ2A



Thus

AMAP = argminA

(1

2σ2μ

N∑n=1

(x(n) − A)2 +(A − A0)

2

2σ2A

)

Differentiating we get

− 1σ2

μ

N∑n=1

(x(n) − A) +(A − A0)

σ2A

∣∣∣∣∣A=AMAP

= 0

⇒N∑

n=1

x(n)

σ2μ

− NAMAP

σ2μ

=AMAP

σ2A

− A0

σ2A

⇒ AMAP

(1σ2

A

+Nσ2

μ

)=

1σ2

μ

N∑n=1

x(n) +A0

σ2A

⇒ AMAP =1

1σ2

A+ N

σ2μ

(1σ2

μ

N∑n=1

x(n) +A0

σ2A

)



AMAP =1

1σ2

A+ N

σ2μ

(1σ2

μ

N∑n=1

x(n) +A0

σ2A

)

Note that if σ2A → ∞ then there is no a priori information and

limσ2

A→∞AMAP =

1N

N∑n=1

x(n) = AML

21Aσ

22Aσ

2212 AA σσ > Observation: As fθ(θ) flattens out

θMAP → θML



Case 3: The absolute cost function

C(y , y ) = |y − y |Question: For what types of problems is this costfunction effective?

|ˆ| yy −

)ˆ,( yyC

0In this case

I(y) =

∫ ∞


=

∫y<y

(y − y)fy |x(y |x)dy +

∫y≥y

(y − y)fy |x(y |x)dy

=

∫ y

−∞(y − y)fy |x(y |x)dy +

∫ ∞

y(y − y)fy |x(y |x)dy



I(y) =

∫ y

−∞(y − y)fy |x(y |x)dy +

∫ ∞

y(y − y)fy |x(y |x)dy

Note that∫ y

−∞(y − y)fy |x(y |x)dy = yFy |x(y |x) −

∫ y

−∞yfy |x(y |x)dy

and similarly∫ ∞

y(y − y)fy |x(y |x)dy =

∫ ∞

yyfy |x(y |x)dy − y(1 − Fy |x(y |x))

Thus,

I(y) = yFy |x(y |x) −∫ y


−y(1 − Fy |x(y |x)) +

∫ ∞

yyfy |x(y |x)dy



I(y) = yFy |x(y |x) −∫ y


−y(1 − Fy |x(y |x)) +

∫ ∞

yyfy |x(y |x)dy

Taking the derivative

∂I(y)

∂y= Fy |x(y |x) + y fy |x(y |x) − y fy |x(y |x)

−(1 − Fy |x(y |x)) − y fy |x(y |x) + y fy |x(y |x)

= Fy |x(y |x) − (1 − Fy |x(y |x))

Result: Setting equal to 0, we see the yMAE is given by

Fy |x(yMAE|x) = 1 − Fy |x(yMAE|x)

or ∫ yMAE

−∞fy |x(y |x)dy =

∫ ∞

yMAE

fy |x(y |x)dy



∫ yMAE

−∞fy |x(y |x)dy =

∫ ∞

yMAE

fy |x(y |x)dy

Interpreting this graphically

)|(| xx yf y

Area A Area BMAEy

Area A = Area B

Observation:yMAE = median of fy |x(y |x)



Estimator Relations

If fy |x(y |x) is symmetric, then

yMAE = yMS

Why? For a symmetric distribution the conditional mean is equalto the (median) symmetry point

If fy |x(y |x) is symmetric and unimodal, then

yMAE = yMS = yMAP

Why? The unimodal constraint implies that the single mode mustbe at the distribution symmetry point ⇒ the MAP estimate islocated at the central point



Example

Determine yMAE for the previously considered case

fx ,y(x , y) =

{10y 0 ≤ y ≤ x2, 0 ≤ x ≤ 10 otherwise

We showed previously that

fy |x(y |x) =2yx4 0 ≤ y ≤ x2 ⇒ Fy |x(y |x) =

y2

x4 0 ≤ y ≤ x2

Thus determining yMAE

Fy |x(yMAE|x) = 1 − Fy |x(yMAE|x)

⇒ y2MAE

x4 = 1 − y2MAE

x4

⇒ yMAE =x2√

2



MAP estimate: (previous result)

yMAP(x) = argmaxy

fy |x(y |x) = x2

MS estimate: (previous result)

yMS(x) = E{y |x} =23

x2

MAE estimate:

yMAE(x) = median of fy |x(y |x) =x2√

2

y

)|(| xyf xy

0

2

2

x

2x

2

2x



Final ML and MAP Comments

ML estimation was pioneered by geneticist and statistician Sir R.A. Fisher between 1912 and 1922Under fairly weak regularity conditions the ML estimate isasymptotically optimal

The ML estimate is asymptotically unbiased, i.e., its bias tends tozero as the number of samples increases to infinityThe ML estimate is asymptotically efficient, i.e., it achieves theCramér-Rao lower bound when the number of samples tends toinfinityConsequence: No unbiased estimator has lower mean squarederror than the ML estimatorThe ML estimate is asymptotically normal, i.e., as the number ofsamples increases, the distribution of the ML estimate tends to theGaussian distribution

MAP estimation is a generalization of ML estimation thatincorporates the prior distribution of the quantity being estimated


Wiener (MSE) Filtering Theory Wiener Filtering

Problem Statement

Produce an estimate of a desired process statistically related to a setof observations

filterx(n) d(n)input y(n)

output

e(n) estimateerror

desiredresponse

+_

Historical Notes: The linear filtering problem was solved byAndrey Kolmogorov for discrete time – his 1938 paper“established the basic theorems for smoothing and predictingstationary stochastic processes”Norbert Wiener in 1941 for continuous time – not published untilthe 1949 paper Extrapolation, Interpolation, and Smoothing ofStationary Time Series



filterx(n) d(n)input y(n)

output

e(n) estimateerror

desiredresponse

+_

System restrictions and considerations:

Filter is linear

Filter is discrete time

Filter is finite impulse response (FIR)

The process is WSS

Statistical optimization is employed



For the discrete time case

z-1x(n) …

e(n)

z-1 z-1

d(n)

∗0w ∗

1w ∗−1Mw

…… +-)(ˆ nd

The filter impulse response is finite and given by

hk =

{w∗

k for k = 0, 1, · · · , M − 10 otherwise

The output d(n) is an estimate of the desired signal d(n)x(n) and d(n) are statistically related ⇒ d(n) and d(n) arestatistically related



In convolution and vector form

d(n) =M−1∑k=0

w∗k x(n − k) = wHx(n)

where

w = [w0, w1, · · · , wM−1]T [filter coefficient vector]

x = [x(n), x(n − 1), · · · , x(n − M + 1)]T [observation vector]

The error can now be written as

e(n) = d(n) − d(n) = d(n) − wHx(n)

Question: Under what criteria should the error be minimized?

Selected Criteria: Mean squared-error (MSE)

J(w) = E{e(n)e∗(n)} (∗)Result: The w that minimizes J(w) is the optimal (Wiener) filter



Utilizing e(n) = d(n) − wHx(n) in (∗) and expanding,

J(w) = E{e(n)e∗(n)}= E{(d(n) − wHx(n))(d∗(n) − xH(n)w)}= E{|d(n)|2 − d(n)xH(n)w − wHx(n)d∗(n)

+wHx(n)xH(n)w}= E{|d(n)|2} − E{d(n)xH(n)}w − wHE{x(n)d ∗(n)}

+wHE{x(n)xH(n)}w (∗∗)Let

R = E{x(n)xH(n)} [autocorrelation of x(n)]

p = E{x(n)d ∗(n)} [cross correlation between x(n) and d(n)]

Then (∗∗) can be compactly expressed as

J(w) = σ2d − pHw − wHp + wHRw

where we have assumed x(n) & d(n) are zero mean, WSSK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 238 / 406


The MSE criteria as a function of the filter weight vector w


Observation: The error is a quadratic function of w

Consequences: The error is an M–dimensional bowl–shaped functionof w with a unique minimum

Result: The optimal weight vector, w0, is determined by differentiatingJ(w) and setting the result to zero

∇wJ(w)|w=w0 = 0

A closed form solution exists



Example

Consider a two dimensional case, i.e., a M = 2 tap filter. Plot the errorsurface and error contours.

Error Surface Error Contours



Aside (Matrix Differentiation): For complex data,

wk = ak + jbk , k = 0, 1, · · · , M − 1

the gradient, with respect to wk , is

∇k (J) =∂J∂ak

+ j∂J∂bk

, k = 0, 1, · · · , M − 1

The complete gradient is thus given by

∇w(J) =

⎡⎢⎢⎢⎣

∇0(J)∇1(J)

...∇M−1(J)

⎤⎥⎥⎥⎦ =

⎡⎢⎢⎢⎢⎣

∂J∂a0

+ j ∂J∂b0

∂J∂a1

+ j ∂J∂b1

...∂J

∂aM−1+ j ∂J

∂bM−1

⎤⎥⎥⎥⎥⎦



Example

Let c and w be M × 1 complex vectors. For g = cHw, find ∇w(g)

Note

g = cHw =M−1∑k=0

c∗k wk =

M−1∑k=0

c∗k (ak + jbk )

Thus

∇k (g) =∂g∂ak

+ j∂g∂bk

= c∗k + j(jc∗

k ) = 0, k = 0, 1, · · · , M − 1

Result: For g = cHw

∇w(g) =

⎡⎢⎢⎢⎣

∇0(g)∇1(g)

...∇M-1(g)

⎤⎥⎥⎥⎦ =

⎡⎢⎢⎢⎣

00...0

⎤⎥⎥⎥⎦ = 0



Example

Now suppose g = wHc. Find ∇w(g)

In this case,

g = wHc =M−1∑k=0

w∗k ck =

M−1∑k=0

ck (ak − jbk )

and

∇k (g) =∂g∂ak

+ j∂g∂bk

= ck + j(−jck ) = 2ck , k = 0, 1, · · · , M − 1

Result: For g = wHc

∇w(g) =

⎡⎢⎢⎢⎣

∇0(g)∇1(g)

...∇M-1(g)

⎤⎥⎥⎥⎦ =

⎡⎢⎢⎢⎣

2c0

2c1...

2cM−1

⎤⎥⎥⎥⎦ = 2c



Example

Lastly, suppose g = wHQw. Find ∇w(g)

In this case,

g =M−1∑i=0

M−1∑j=0

w∗i wjqi ,j

=M−1∑i=0

M−1∑j=0

(ai − jbi)(aj + jbj)qi ,j

⇒ ∇k (g) =∂g∂ak

+ j∂g∂bk

= 2M−1∑j=0

(aj + jbj)qk ,j + 0

= 2M−1∑j=0

wjqk ,j



Result: For g = wHQw

∇w(g) =

⎡⎢⎢⎢⎣

∇0(g)∇1(g)

...∇M-1(g)

⎤⎥⎥⎥⎦ = 2

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

M−1∑i=0

q0,iwi

M−1∑i=0

q1,iwi

...M-1∑i=0

qM−1,iwi

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

= 2Qw

Observation: Differentiation result depends on matrix ordering



Returning to the MSE performance criteria


Approach: Minimize error by differentiating with respect to w and setresult to 0

∇w(J) = 0 − 0 − 2p + 2Rw

= 0

⇒ Rw0 = p [normal equation]

Result: The Wiener filter coefficients are defined by

w0 = R−1p

Question: Does R−1 always exist? Recall R is positive semi-definite,and usually positive definite


Wiener (MSE) Filtering Theory Orthogonality Principle

Orthogonality Principle

Consider again the normal equation that defines the optimal solution

Rw0 = p

⇒ E{x(n)xH(n)}w0 = E{x(n)d ∗(n)}

Rearranging

E{x(n)d ∗(n)} − E{x(n)xH(n)}w0 = 0

E{x(n)[d ∗(n) − xH(n)w0]} = 0

E{x(n)e∗0(n)} = 0

Note: e∗0(n) is the error when the optimal weights are used, i.e.,

e∗0(n) = d∗(n) − xH(n)w0


Wiener (MSE) Filtering Theory Orthogonality Principle

Thus

E{x(n)e∗0(n)} = E

⎡⎢⎢⎢⎣

x(n)e∗0(n)

x(n − 1)e∗0(n)

...x(n − M + 1)e∗

0(n)

⎤⎥⎥⎥⎦ =

⎡⎢⎢⎢⎣

00...0

⎤⎥⎥⎥⎦

Orthogonality Principle

A necessary and sufficient condition for a filter to be optimal is that theestimate error, e∗(n), be orthogonal to each input sample in x(n)

Interpretation: The observations samples and error are orthogonal andcontain no mutual “information”


Wiener (MSE) Filtering Theory Minimum MSE

Objective: Determine the minimum MSE

Approach: Use the optimal weights w0 = R−1p in the MSE expression


⇒ Jmin = σ2d − pHw0 − wH

0 p + wH0 R(R−1p)

= σ2d − pHw0 − wH

0 p + wH0 p

= σ2d − pHw0

Result:Jmin = σ2

d − pHR−1p

where the substitution w0 = R−1p has been employed


Wiener (MSE) Filtering Theory Excess MSE

Objective: Consider the excess MSE introduced by using a weightedvector that is not optimal.

J(w)−Jmin = (σ2d −pHw−wHp+wHRw)−(σ2

d −pHw0−wH0 p+wH

0 Rw0)

Using the fact that

p = Rw0 and pH = wH0 R

yields

J(w) − Jmin = −pHw − wHp + wHRw + pHw0 + wH0 p − wH

0 Rw0

= −wH0 Rw − wHRw0 + wHRw + wH

0 Rw0

+ wH0 Rw0 − wH

0 Rw0

= −wH0 Rw − wHRw0 + wHRw + wH

0 Rw0

= (w − w0)HR(w − w0)

⇒ J(w) = Jmin + (w − w0)HR(w − w0)


Wiener (MSE) Filtering Theory Excess MSE

Finally, using the eigenvalue and vector representation R = QΩQH

J(w) = Jmin + (w − w0)HQΩQH(w − w0)

or defining the eigenvector transformed difference

v = QH(w − w0) (∗)⇒ J(w) = Jmin + vHΩv

= Jmin +M∑

k=1

λkvkv∗k

Result:

J(w) = Jmin +M∑

k=1

λk |vk |2

Note: (∗) shows that vk is the difference (w − w0) projected ontoeigenvector qk


Wiener (MSE) Filtering Theory Communications Example

Example

Consider the following system

ARsystemv1(n) u(n)

d(n) x(n)

Noisy transformedobservation

(noise)

Communicationchanel

v2(n)

Desiredsignal

FIRfilteru(n) )(ˆ nd

Objective: Determine the optimal filter for a given system and channel



Specific Objective

Determine the optimal order two filter weights, w0, for

H1(z) =1

1 + 0.8458z−1 [AR process]

H2(z) =1

1 − 0.9458z−1 [communication channel]

where v1(n) and v2(n) zero mean white noise processes withσ2

1 = 0.27 and σ22 = 0.1

Note: To determine w0, we need:

Ru — auto–correlation of the received signal

p — the cross correlation between received signal u(n) and thedesired signal d(n)



ARsystemv1(n) u(n)

d(n) x(n)


(noise)

Communicationchanel

v2(n)

Desiredsignal


Procedure: Consider Ru first

Since u(n) = x(n) + v2(n), where v2(n) is white with σ22 = 0.1 and is

uncorrelated with x(n)

Ru = Rx + Rv2 = Rx +

[0.1 00 0.1

]



ARsystemv1(n) u(n)

d(n) x(n)


(noise)

Communicationchanel

v2(n)

Desiredsignal


Next, consider the relation between x(n) and v1(n)

X (z) = H1(z)H2(z)V1(z)

where for the give systems

H1(z)H2(z) =1

(1 + 0.8458z−1)(1 − 0.9458z−1)

Converting to the time domain, we see x(n) is an order 2 AR process

x(n) − 0.1x(n − 1) − 0.8x(n − 2) = v1(n)

x(n) + a1x(n − 1) + a2x(n − 2) = v1(n)



Since x(n) is a real valued order two AR process, the Yule-Walkerequations are given by[

r(0) r(1)r∗(1) r(0)

] [ −a1

−a2

]=

[r∗(1)r∗(2)

][

r(0) r(1)r(1) r(0)

] [ −a1

−a2

]=

[r(1)r(2)

]Solving for the coefficients:

−a1 =r(1)[r(0) − r(2)]

r2(0) − r2(1)

− a2 =r(0)r(2) − r2(1)

r2(0) − r2(1)

Question: What are the known and unknown terms in this system?



Note: We must solve the system to obtain the unknown r(·) values

Noting r(0) = σ2x and rearranging to solve for r(1) and r(2)

r(1) =−a1

1 + a2σ2

x (∗)

r(2) =

(−a2 +

a21

1 + a2

)σ2

x (∗∗)

The Yule-Walker equations also stipulate

σ2v1

= r(0) + a1r(1) + a2r(2) (∗ ∗ ∗)

Next, utilize the determined r(1), r(2) and given a1, a2, σ2v1

values todetermine r(0) = σ2

x



Substituting (∗) and (∗∗) into (∗ ∗ ∗), utilize r(0) = σ2x , and rearranging

σ2v1

= r(0) + a1r(1) + a2r(2)

= σ2x + a1

( −a1

1 + a2

)σ2

x + a2

(−a2 +

a21

1 + a2

)σ2

x

=

(1 +

−a21

1 + a2− a2

2 +a2

1a2

1 + a2

)σ2

x

⇒ σ2x =

(1 + a2

1 − a2

)σ2

v1

(1 + a2)2 − a21

Note: All correlation terms have been determined since r(0) = σ2x



Using a1 = −0.1, a2 = −0.8, and σ2v1

= 0.27,

σ2x =

(1 + a2

1 − a2

)σ2

v1

(1 + a2)2 − a21

= 1 =

Thus r(0) = 1. Similarly

r(1) =−a1

1 + a2σ2

x = 0.5

and finally, Rx is

Rx =

[1 0.5

0.5 1

]



Recall the overall system

ARsystemv1(n) u(n)

d(n) x(n)


(noise)

Communicationchanel

v2(n)

Desiredsignal


Putting the pieces of Ru together

Ru = Rx + Rv2

=

[1 0.5

0.5 1

]+

[0.1 00 0.1

]

=

[1.1 0.50.5 1.1

]



Recall: The Wiener solution is given by w0 = R−1p

Note: R is known, but p is still to be determined

p = E{[

d(n)x(n)d(n)u(n − 1)

]}

Recall

X (z) = H2(z)D(z) =D(z)

1 − 0.9458z−1

or in the time domain

x(n) − 0.9458x(n − 1) = d(n)

Lastly, the observation is corrupted by additive noise

u(n) = x(n) + v2(n)



Thus

E{u(n)d(n)} = E{[x(n) + v2(n)][x(n) − 0.9458x(n − 1)]}= E{x2(n)} + E{x(n)v2(n)} − 0.9458E{x(n)x(n − 1)}

−0.9458E{v2(n)x(n − 1)}= σ2

x + 0 − 0.9458r(1) − 0

= 1 − 0.9458(

12

)= 0.5272

Similarly,

E{u(n − 1)d(n)} = E{[x(n − 1) + v2(n − 1)][x(n) − 0.9458x(n − 1)]}= r(1) − 0.9458r(0)

= −0.4458



Thus

p =

[0.5272−0.4458

]Final Solution:

w0 = R−1p

=

[1.1 0.50.5 1.1

]−1 [0.5272−0.4458

]

=

[0.8360−0.7853

]

The optimal filter weights are a function of the source signalstatistics and the communications channel

Question: How do we optimized a filter of when the statistics arenot known in closed form or a priori?


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Motivation

Adaptive Optimization and Filtering Methods

Motivation

Adaptive optimization and filtering methods are appropriate,advantageous, or necessary when:

Signal statistics are not known a priori and must be “learned” fromobserved or representative samples

Signal statistics evolve over time

Time or computational restrictions dictate that simple, if repetitive,operations be employed rather than solving more complex, closedform expressionsTo be considered are the following algorithms:

Steepest Descent (SD) – deterministicLeast Means Squared (LMS) – stochasticRecursive Least Squares (RLS) – deterministic


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Steepest Descent

Definition (Steepest Descent (SD))

Steepest descent, also known as gradient descent, it is an iterativetechnique for finding the local minimum of a function.

Approach: Given an arbitrary starting point, the current location (value)is moved in steps proportional to the negatives of the gradient at thecurrent point.

SD is an old, deterministic method, that is the basis for stochasticgradient based methodsSD is a feedback approach to finding local minimum of an errorperformance surfaceThe error surface must be known a prioriIn the MSE case, SD converges converges to the optimal solution,w0 = R−1p, without inverting a matrix

Question: Why in the MSE case does this converge to the globalminimum rather than a local minimum?



Example

Consider a well structured cost function with a single minimum. Theoptimization proceeds as follows:

Contour plot showing that evolution of the optimization



Example

Consider a gradient ascent example in which there are multipleminima/maxima

Surface plot showing the multipleminima and maxima Contour plot illustrating that the final

result depends on starting value



To derive the approach, consider the FIR case:

{x(n)} – the WSS input samples{d(n)} – the WSS desired output{d(n)} – the estimate of the desired signal given by

d(n) = wH(n)x(n)

where

x(n) = [x(n), x(n − 1), · · · , x(n − M + 1)]T [obs. vector]

w(n) = [w0(n), w1(n), · · · , wM-1(n)]T [time indexed filter coefs.]



Then similarly to previously considered cases

e(n) = d(n) − d(n) = d(n) − wH(n)x(n)

and the MSE at time n is

J(n) = E{|e(n)|2}= σ2

d − wH(n)p − pHw(n) + wH(n)Rw(n)

where

σ2d – variance of desired signal

p – cross-correlation between x(n) and d(n)

R – correlation matrix of x(n)

Note: The weight vector and cost function are time indexed (functionsof time)



When w(n) is set to the (optimal) Wiener solution,

w(n) = w0 = R−1p

andJ(n) = Jmin = σ2

d − pHw0

Use the method of steepest descent to iteratively find w0.

The optimal result is achieved since the cost function is a secondorder polynomial with a single unique minimum



Example

Let M = 2. The MSE is a bowl–shaped surface, which is a function ofthe 2-D space weight vector w(n)

w0

w

J(w)

J(w)

w1

w2

Surface Plot

w0

w2

w1

)(J∇−

2w

J

∂∂−

1w

J

∂∂−

Contour Plot

Imagine dropping a marble at any point on the bowl-shaped surface.The ball will reach the minimum point by going through the path ofsteepest descent.



Observation: Set the direction of filter update as: −∇J(n)

Resulting Update:

w(n + 1) = w(n) +12

μ[−∇J(n)]

or, since ∇J(n) = −2p + 2Rw(n)

w(n + 1) = w(n) + μ[p − Rw(n)] n = 0, 1, 2, · · ·

where w(0) = 0 (or other appropriate value) and μ is the step size

Observation: SD uses feedback, which makes it possible for thesystem to be unstable

Bounds on the step size guaranteeing stability can be determinedwith respect to the eigenvalues of R (Widrow, 1970)


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis

Convergence Analysis

Define the error vector for the tap weights as

c(n) = w(n) − w0

Then using p = Rw0 in the update,

w(n + 1) = w(n) + μ[p − Rw(n)]

= w(n) + μ[Rw0 − Rw(n)]

= w(n) − μRc(n)

and subtracting w0 from both sides

w(n + 1) − w0 = w(n) − w0 − μRc(n)

⇒ c(n + 1) = c(n) − μRc(n)

= [I − μR]c(n)



Using the Unitary Similarity Transform

R = QΩQH

we have

c(n + 1) = [I − μR]c(n)

= [I − μQΩQH ]c(n)

⇒ QHc(n + 1) = [QH − μQHQΩQH ]c(n)

= [I − μΩ]QHc(n) (∗)Define the transformed coefficients as

v(n) = QHc(n)

= QH(w(n) − w0)

Then (∗) becomesv(n + 1) = [I − μΩ]v(n)



Consider the initial condition of v(n)

v(0) = QH(w(0) − w0)

= −QHw0 [if w(0) = 0]

Consider the k th term (mode) in

v(n + 1) = [I − μΩ]v(n)

Note [I − μΩ] is diagonalThus all modes are independently updatedThe update for the k th term can be written as

vk (n + 1) = (1 − μλk )vk (n) k = 1, 2, · · · , M

or using recursion

vk (n) = (1 − μλk )nvk (0)



Observation: Conversion to the optimal solution requires

limn→∞w(n) = w0

⇒ limn→∞ c(n) = lim

n→∞w(n) − w0 = 0

⇒ limn→∞ v(n) = lim

n→∞QHc(n) = 0

⇒ limn→∞ vk (n) = 0 k = 1, 2, · · · , M (∗)

Result: According to the recursion

vk (n) = (1 − μλk )nvk (0)

the limit in (∗) holds if and only if

|1 − μλk | < 1 for all k

Thus since the eigenvalues are nonnegative, 0 < μλmax < 2, or

0 < μ <2

λmax



Observation: The k th mode has geometric decay

vk (n) = (1 − μλk )nvk (0)

The rate of decay it is characterized by the time it takes to decayto e−1 of the initial value

Let τk denote this time for the k th mode

vk (τk ) = (1 − μλk )τk vk (0) = e−1vk (0)

⇒ e−1 = (1 − μλk )τk

⇒ τk =−1

ln(1 − μλk )≈ 1

μλkfor μ � 1

Result: The overall rate of decay is

−1ln(1 − μλmax)

≤ τ ≤ −1ln(1 − μλmin)



Example

Consider the typical behavior of a single mode



Error Analysis

Recall that

J(n) = Jmin + (w(n) − w0)HR(w(n) − w0)

= Jmin + (w(n) − w0)HQΩQH(w(n) − w0)

= Jmin + v(n)HΩv(n)

= Jmin +M∑

k=1

λk |vk (n)|2 [sub in vk (n) = (1 − μλk )nvk (0)]

= Jmin +M∑

k=1

λk (1 − μλk )2n|vk (0)|2

Result: If 0 < μ < 2λmax

, then

limn→∞ J(n) = Jmin


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor

Example

Consider a two–tap predictor for real–valued input

Analyzed the effects of the following cases:

Varying the eigenvalue spread χ(R) = λmaxλmin

while keeping μ fixed

Varying μ and keeping the eigenvalue spread χ(R) fixed



SD loci plots (with shown J(n) contours) as a function of [v1(n), v2(n)]for step-size μ = 0.3

Eigenvalue spread: χ(R) = 1.22

Small eigenvalue spread ⇒modes converge at a similar rate

Eigenvalue spread: χ(R) = 3

Moderate eigenvalue spread ⇒modes converge at moderatelysimilar rates



SD loci plots (with shown J(n) contours) as a function of [v1(n), v2(n)]for step-size μ = 0.3


Large eigenvalue spread ⇒modes converge at different rates


Very large eigenvalue spread ⇒modes converge at very differentrates

Principle direction convergence isfastest



SD loci plots (with shown J(n) contours) as a function of [w1(n), w2(n)]for step-size μ = 0.3

Eigenvalue spread: χ(R) = 1.22

Small eigenvalue spread ⇒modes converge at a similar rate


Moderate eigenvalue spread ⇒modes converge at moderatelysimilar rates



SD loci plots (with shown J(n) contours) as a function of [w1(n), w2(n)]for step-size μ = 0.3


Large eigenvalue spread ⇒modes converge at different rates


Very large eigenvalue spread ⇒modes converge at very differentrates

Principle direction convergence isfastest



Learning curves of steepest-descent algorithm with step-sizeparameter μ = 0.3 and varying eigenvalue spread.



SD loci plots (with shown J(n) contours) as a function of [v1(n), v2(n)]with χ(R) = 10 and varying step–sizes

Step–sizes: μ = 0.3

This is over–damped ⇒ slowconvergence

Step–sizes: μ = 1

This is under–damped ⇒ fast(erratic) convergence



SD loci plots (with shown J(n) contours) as a function of [w1(n), w2(n)]with χ(R) = 10 and varying step–sizes







Example

Consider a system identification problem

w(n) system

{x(n)}

)(ˆ nd d(n)+_

e(n)

Suppose M = 2 and

Rx =

[1 0.8

0.8 1

]P =

[0.80.5

]



From eigen analysis we have

λ1 = 1.8, λ2 = 0.2 ⇒ μ <2

1.8

also

q1 =1√2

[11

]q2 =

1√2

[1−1

]and

Q =1√2

[1 11 −1

]Also,

w0 = R−1p =

[1.11

−0.389

]



Thusv(n) = QH [w(n) − w0]

Noting that

v(0) = −QHw0 = − 1√2

[1 11 −1

] [1.11

−0.389

]=

[0.511.06

]

andv1(n) = (1 − μ(1.8))n0.51

v2(n) = (1 − μ(0.2))n1.06



SD convergence properties for two μ values






Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Least Mean Squares (LMS)

Least Mean Squares (LMS)

Definition (Least Mean Squares (LMS) Algorithm)

Motivation: The error performance surface used by the SD method isnot always known a priori

Solution: Use estimated values. We will use the followinginstantaneous estimates

R(n) = x(n)xH(n)

p(n) = x(n)d ∗(n)

Result: The estimates are RVs and thus this leads to a stochasticoptimization

Historical Note: Invented in 1960 by Stanford University professorBernard Widrow and his first Ph.D. student, Ted Hoff



Recall the SD update

w(n + 1) = w(n) +12μ[−∇(J(n))]

where the gradient of the error surface at w(n) was shown to be

∇(J(n)) = −2p + 2Rw(n)

Using the instantaneous estimates,

∇(J(n)) = −2x(n)d ∗(n) + 2x(n)xH(n)w(n)

= −2x(n)[d ∗(n) − xH(n)w(n)]

= −2x(n)[d ∗(n) − d∗(n)]

= −2x(n)e∗(n)

where e∗(n) is the complex conjugate of the estimate error.



Utilizing ∇(J(n)) = −2x(n)e∗(n) in the update

w(n + 1) = w(n) +12μ[−∇(J(n))]

= w(n) + μx(n)e∗(n) [LMS Update]

The LMS algorithm belongs to the family of stochastic gradientalgorithms

The update is extremely simple

Although the instantaneous estimates may have large variance,the LMS algorithm is recursive and effectively averages theseestimates

The simplicity and good performance of the LMS algorithm makeit the benchmark against which other optimization algorithms arejudged



Convergence Analysis

Independence Theorem

The following conditions hold:1 The vectors x(1), x(2), · · · , x(n) are statistically independent2 x(n) is independent of d(1), d(2), · · · , d(n − 1)

3 d(n) is statistically dependent on x(n), but is independent ofd(1), d(2), · · · , d(n − 1)

4 x(n) and d(n) are mutually Gaussian

The independence theorem is invoked in the LMS algorithmanalysisThe independence theorem is justified in some cases, e.g.,beamforming where we receive independent vector observationsIn other cases it is not well justified, but allows the analysis toproceeds (i.e., when all else fails, invoke simplifying assumptions)



We will invoke the independence theorem to show that w(n) convergesto the optimal solution in the mean

limn→∞ E{w(n)} = w0

To prove this, evaluate the update

w(n + 1) = w(n) + μx(n)e∗(n)

⇒ w(n + 1) − w0 = w(n) − w0 + μx(n)e∗(n)

⇒ c(n + 1) = c(n) + μx(n)(d ∗(n) − xH(n)w(n))

= c(n) + μx(n)d ∗(n) − μx(n)xH(n)[w(n) − w0 + w0]

= c(n) + μx(n)d ∗(n) − μx(n)xH(n)c(n)

−μx(n)xH(n)w0

= [I − μx(n)xH(n)]c(n) + μx(n)[d∗(n) − xH(n)w0]

= [I − μx(n)xH(n)]c(n) + μx(n)e∗0(n)



Take the expectation of the update noting thatw(n) is based on past inputs and desired values

⇒ w(n), and consequently c(n)), are independent of x(n)(Independence Theorem)

Thus

c(n + 1) = [I − μx(n)xH(n)]c(n) + μx(n)e∗0(n)

⇒ E{c(n + 1)} = (I − μR)E{c(n)} + μ E{x(n)e∗0(n)}︸︷︷︸

=0 why?

= (I − μR)E{c(n)}Using arguments similar to the SD case we have

limn→∞E{c(n)} = 0 if 0 < μ <

2λmax

or equivalently

limn→∞E{w(n)} = w0 if 0 < μ <

2λmax



Noting that∑M

i=1 λi = trace[R]

⇒ λmax ≤ trace[R] = Mr(0) = Mσ2x

Thus a more conservative bound (and one easier to determine) is

0 < μ <2

Mσ2x

Convergence in the mean

limn→∞ E{w(n)} = w0

is a weak condition that says nothing about the variance, whichmay even grow

A stronger condition is convergence in the mean square, whichsays

limn→∞E{|c(n)|2} = constant



Proving convergence in the mean square is equivalent to showing that

limn→∞ J(n) = lim

n→∞ E{|e(n)|2} = constant

To evaluate the limit, write e(n) as

e(n) = d(n) − d(n) = d(n) − wH(n)x(n)

= d(n) − wH0 x(n) − [wH(n) − wH

0 ]x(n)

= e0(n) − cH(n)x(n)

Thus

J(n) = E{|e(n)|2}= E

{(e0(n) − cH(n)x(n)

) (e∗

0(n) − xH(n)c(n))}

= Jmin + E{cH(n)x(n)xH(n)c(n)}︸︷︷︸Jex(n)

[Cross terms → 0, why?]

= Jmin + Jex(n)



Since Jex(n) is a scalar

Jex(n) = E{cH(n)x(n)xH(n)c(n)}= E{trace[cH(n)x(n)xH(n)c(n)]}= E{trace[x(n)xH(n)c(n)cH(n)]}= trace[E{x(n)xH(n)c(n)cH(n)}]

Invoking the independence theorem

Jex(n) = trace[E{x(n)xH(n)}E{c(n)cH(n)}]= trace[RK(n)]

whereK(n) = E{c(n)cH(n)}



Thus

J(n) = Jmin + Jex(n)

= Jmin + trace[RK(n)]

RecallQHRQ = Ω or R = QΩQH

SetS(n) ≡ QHK(n)Q

where S(n) need not be diagonal. Then

K(n) = QQHK(n)QQH [since Q−1 = QH ]

= QS(n)QH



Utilizing R = QΩQH and K(n) = QS(n)QH in the excess errorexpression

Jex(n) = trace[RK(n)]

= trace[QΩQHQS(n)QH ]

= trace[QΩS(n)QH ]

= trace[QHQΩS(n)]

= trace[ΩS(n)]

Since Ω is diagonal

Jex(n) = trace[ΩS(n)] =M∑

i=1

λi si(n)

where s1(n), s2(n), · · · , sM (n) are the diagonal elements of S(n).



The previously derived recursion

E{c(n + 1)} = (I − μR)E{c(n)}can be modified to yield a recursion on S(n),

S(n + 1) = (I − μΩ)S(n)(I − μΩ) + μ2JminΩ

which for the diagonal elements is

si(n + 1) = (1 − μλi)2si (n) + μ2Jminλi i = 1, 2, · · · , M

Suppose Jex(n) converges, then si(n + 1) = si(n)

⇒ si (n) = (1 − μλi)2si(n) + μ2Jminλi

⇒ si (n) =μ2Jminλi

1 − (1 − μλi)2 =μ2Jminλi

2μλi − μ2λ2i

=μJmin

2 − μλii = 1, 2, · · · , M



Consider again

Jex(n) = trace[ΩS(n)] =M∑

i=1

λi si(n)

Taking the limit and utilizing si(n) = μJmin2−μλi

,

limn→∞ Jex(n) = Jmin

M∑i=1

μλi

2 − μλi

The LMS misadjustment is defined as

MA =limn→∞ Jex(n)

Jmin=

M∑i=1

μλi

2 − μλi

Note: A misadjustment at 10% or less is generally consideredacceptable.


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor

Example

This is a one tap predictor

x(n) = w(n)x(n − 1)

Take the underlying process to be areal order one AR process

x(n) = −ax(n − 1) + v(n)

The weight update is

w(n + 1) = w(n) + μx(n − 1)e(n) [LMS update for obs. x(n − 1)]

= w(n) + μx(n − 1)[x(n) − w(n)x(n − 1)]



Sincex(n) = −ax(n − 1) + v(n) [AR model]

and

x(n) = w(n)x(n − 1) [one tap predictor]

⇒ w0 = −a

Note thatE{x(n − 1)eo(n)} = E{x(n − 1)v(n)} = 0

proves the optimality

Set μ = 0.05 and consider two cases

a σ2x

-0.99 0.936270.99 0.995



Figure: Transient behavior of adaptive first-order predictor weight w(n) forμ = 0.05.



Figure: Transient behavior of adaptive first-order predictor squared error forμ = 0.05.



Figure: Mean-squared error learning curves for an adaptive first-orderpredictor with varying step-size parameter μ.



Consider the expected trajectory of w(n). Recall

w(n + 1) = w(n) + μx(n − 1)e(n)

= w(n) + μx(n − 1)[x(n) − w(n)x(n − 1)]

= [1 − μx(n − 1)x(n − 1)]w(n) + μx(n − 1)x(n)

In this example, x(n) = −ax(n − 1) + v(n). Substituting in:

w(n + 1) = [1 − μx(n − 1)x(n − 1)]w(n) + μx(n − 1)[−ax(n − 1)

+v(n)]

= [1 − μx(n − 1)x(n − 1)]w(n) − μax(n − 1)x(n − 1)

+μx(n − 1)v(n)

Taking the expectation and invoking the dependence theorem

E{w(n + 1)} = (1 − μσ2x )E{w(n)} − μσ2

xa



Figure: Comparison of experimental results with theory, based on w(n).



Next, derive a theoretical expression for J(n).

Note that the initial value of J(n) is

J(0) = E{(x(0) − w(0)x(−1))2} = E{(x(0))2} = σ2x

and the final value is

J(∞) = Jmin + Jex

= E{(x(n) − w(n)x(n − 1))2} + Jex

= E{(v(n))2} + Jex

= σ2v + Jmin

μλ1

2 − μλ1

Note λ1 = σ2x . Thus,

J(∞) = σ2v + σ2

v

(μσ2

x

2 − μσ2x

)

= σ2v

(1 +

μσ2x

2 − μσ2x

)



And if μ is small

J(∞) = σ2v

(1 +

μσ2x

2 − μσ2x

)

≈ σ2v

(1 +

μσ2x

2

)

Putting all the components together:

J(n) = [σ2x − σ2

v (1 +μ

2σ2

x )]︸︷︷︸J(0)−J(∞)

(1 − μσ2x )2n︸︷︷︸

1→0

+ σ2v (1 +

μ

2σ2

x )︸︷︷︸J(∞)

Also, the time constant is

τ = − 12 ln(1 − μλ1)

= − 12 ln(1 − μσ2

x )≈ 1

2μσ2x



Figure: Comparison of experimental results with theory for the adaptivepredictor, based on the mean-square error for μ = 0.001.


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Adaptive Equalization

Example (Adaptive Equalization)

Objective: Pass a known signal through an unknown channel to invertthe effects the channel and noise have on the signal



The signal is a Bernoulli sequence

xn =

{+1 with probability 1/2−1 with probability 1/2

The additive noise is ∼ N(0, 0.001)

The channel has a raised cosine response

hn =

{ 12

[1 + cos

(2πw (n − 2)

)]n = 1, 2, 3

0 otherwise

⇒ w controls the eigenvalue spread χ(R)⇒ hn is symmetric about n = 2 and thus introduces a delay of 2

We will use an M = 11 tap filter, which is symmetric about n = 5⇒ Introduce a delay of 5

Thus an overall delay of δ = 5 + 2 = 7 is added to the system



Channel response and Filter response

Figure: (a) Impulse response of channel; (b) impulse response of optimumtransversal equalizer.



Consider three w values

Note the step size is bound by the w = 3.5 case

μ ≤ 2Mr(0)

=2

11(1.3022)= 0.14

Choose μ = 0.075 in all cases.



Figure: Learning curves of the LMS algorithm for an adaptive equalizer withnumber of taps M = 11, step-size parameter μ = 0.075, and varyingeigenvalue spread χ(R).



Ensemble-average impulse response of the adaptive equalizer (after1000 iterations) for each of four different eigenvalue spreads.



Figure: Learning curves of the LMS algorithm for an adaptive equalizer withthe number of taps M = 11, fixed eigenvalue spread, and varying step-sizeparameter μ.


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: LMS Directionality

Example

Directionality of the LMS algorithm

The speed of convergence of the LMS algorithm is faster incertain directions in the weight space

If the convergence is in the appropriate direction, the convergencecan be accelerated by increased eigenvalue spread

To investigate this phenomenon, consider the deterministic signal

x(n) = A1 cos(ω1n) + A2 cos(ω2n)

Even though it is deterministic, a correlation matrix can be determined:

R =12

[A2

1 + A22 A2

1 cos(ω1) + A22 cos(ω2)

A21 cos(ω1) + A2

2 cos(ω2) A21 + A2

2

]



Determining the eigenvalues and eigenvectors yields

λ1 =12

A21(1 + cos(ω1)) +

12

A22(1 + cos(ω2))

λ2 =12

A21(1 − cos(ω1)) +

12

A22(1 − cos(ω2))

and

q1 =

[11

]q2 =

[ −11

]

Case 1: A1 = 1, A2 = 0.5, ω1 = 1.2, ω2 = 0.1

xa(n) = cos(1.2n) + 0.5 cos(0.1n) and χ(R) = 2.9

Case 2: A1 = 1, A2 = 0.5, ω1 = 0.6, ω2 = 0.23

xb(n) = cos(0.6n) + 0.5 cos(0.23n) and χ(R) = 12.9



Since p undefined, set p = λiqi

Then since p = Rw0, we see (two cases)

p = λ1q1 ⇒ Rw0 = λ1q1 ⇒ w0 = q1 =

[11

]

p = λ2q2 ⇒ Rw0 = λ2q2 ⇒ w0 = q2 =

[ −11

]

Utilize 200 iterations of the algorithm.

Consider the minimum eigenfilter first, w0 = q2 =

[ −11

]

Consider the maximum eigenfilter second, w0 = q1 =

[11

]



Convergence of the LMS algorithm, for a deterministic sinusoidalprocess, along “slow” eigenvector (i.e., minimum eigenfilter).

For input xa(n) (χ(R) = 2.9) For input xb(n) (χ(R) = 12.9)



Convergence of the LMS algorithm, for a deterministic sinusoidalprocess, along “fast” eigenvector (i.e., maximum eigenfilter).

For input xa(n) (χ(R) = 2.9) For input xb(n) (χ(R) = 12.9)


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Normalized LMS Algorithm

Observation: The LMS correction is proportional to μx(n)e∗(n)

w(n + 1) = w(n) + μx(n)e∗(n)

If x(n) is large, the LMS update suffers from gradient noiseamplificationThe normalized LMS algorithm seeks to avoid gradient noiseamplification

The step size is made time varying, μ(n), and optimized tominimize the next step error

w(n + 1) = w(n) +12

μ(n)[−∇J(n)]

= w(n) + μ(n)[p − Rw(n)]

Choose μ(n), such that w(n + 1) produces the minimum MSE,

J(n + 1) = E{|e(n + 1)|2}



Let ∇(n) ≡ ∇J(n) and note

e(n + 1) = d(n + 1) − wH(n + 1)x(n + 1)

Objective: Choose μ(n) such that it minimizes J(n + 1)

The optimal step size, μ0(n), will be a function of R and ∇(n).

⇒ Use instantaneous estimates of these values

To determine μ0(n), expand J(n + 1)

J(n + 1) = E{e(n + 1)e∗(n + 1)}= E{(d(n + 1) − wH(n + 1)x(n + 1))

(d∗(n + 1) − xH(n + 1)w(n + 1))}= σ2

d − wH(n + 1)p − pHw(n + 1)

+wH(n + 1)Rw(n + 1)



Now use the fact that w(n + 1) = w(n) − 12μ(n)∇(n)

J(n + 1) = σ2d − wH(n + 1)p − pHw(n + 1)

+wH(n + 1)Rw(n + 1)

= σ2d −

[w(n) − 1

2μ(n)∇(n)

]H

p

−pH[w(n) − 1

2μ(n)∇(n)

]

+

[w(n) − 1

2μ(n)∇(n)

]H

R[w(n) − 1

2μ(n)∇(n)

]︸︷︷︸= wH(n)Rw(n) − 1

2μ(n)wH(n)R∇(n)

−12

μ(n)∇H(n)Rw(n) +14

μ2(n)∇H(n)R∇(n)



J(n + 1) = σ2d −

[w(n) − 1

2μ(n)∇(n)

]H

p

−pH[w(n) − 1

2μ(n)∇(n)

]+wH(n)Rw(n) − 1

2μ(n)wH(n)R∇(n)

−12μ(n)∇H(n)Rw(n) +

14μ2(n)∇H(n)R∇(n)

Differentiating with respect to μ(n),

∂J(n + 1)

∂μ(n)=

12∇H(n)p +

12

pH∇(n) − 12

wHR∇(n)

−12∇H(n)Rw(n) +

12

μ(n)∇H(n)R∇(n) (∗)



Setting (∗) equal to 0

μ0(n)∇H(n)R∇(n) = wH(n)R∇(n) − pH∇(n)

+∇H(n)Rw(n) −∇H(n)p

⇒ μ0(n) =wH(n)R∇(n) − pH∇(n) + ∇H(n)Rw(n) −∇H(n)p

∇H(n)R∇(n)

=[wH(n)R − pH ]∇(n) + ∇H(n)[Rw(n) − p]

∇H(n)R∇(n)

=[Rw(n) − p]H∇(n) + ∇H(n)[Rw(n) − p]

∇H(n)R∇(n)

=12∇H(n)∇(n) + 1

2∇H(n)∇(n)

∇H(n)R∇(n)

=∇H(n)∇(n)

∇H(n)R∇(n)



Using instantaneous estimates

R = x(n)xH(n) and p = x(n)d ∗(n)

⇒ ∇(n) = 2[Rw(n) − p]

= 2[x(n)xH(n)w(n) − x(n)d ∗(n)]

= 2[x(n)(d∗(n) − d∗(n))]

= −2x(n)e∗(n)

Thus

μ0(n) =∇H(n)∇(n)

∇H(n)R∇(n)=

4xH(n)e(n)x(n)e∗(n)

2xH(n)e(n)x(n)xH(n)2x(n)e∗(n)

=|e(n)|2xH(n)x(n)

|e(n)|2(xH(n)x(n))2

=1

xH(n)x(n)=

1||x(n)||2



Result: The NLMS update is

w(n + 1) = w(n) +μ

||x(n)||2︸︷︷︸μ(n)

x(n)e∗(n)

μ is introduced to scale the update

To avoid problems when ||x(n)||2 ≈ 0 we add an offset

w(n + 1) = w(n) +μ

a + ||x(n)||2 x(n)e∗(n)

where a > 0


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) NLMS Convergnce

Objective: Analyze the NLMS convergence

w(n + 1) = w(n) +μ

||x(n)||2 x(n)e∗(n)

Substituting e(n) = d(n) − wH(n)x(n)

w(n + 1) = w(n) +μ

||x(n)||2 x(n)[d∗(n) − xH(n)w(n)]

=

[I − μ

x(n)xH(n)

||x(n)||2]

w(n) + μx(n)d∗(n)

||x(n)||2



Objective: Compare the NLMS and LMS algorithms:

NLMS:

w(n + 1) =

[I − μ

x(n)xH(n)

||x(n)||2]

w(n) + μx(n)d∗(n)

||x(n)||2

LMS:

w(n + 1) = [I − μx(n)xH(n)]w(n) + μx(n)d ∗(n)

By observation, we see the following corresponding terms

LMS NLMS

μ μ

x(n)xH(n)x(n)xH(n)

||x(n)||2x(n)d∗(n)

x(n)d∗(n)

||x(n)||2



LMS NLMS

μ μ

x(n)xH(n)x(n)xH(n)

||x(n)||2x(n)d∗(n)

x(n)d∗(n)

||x(n)||2

LMS case:

0 < μ <2

trace[E{x(n)xH(n)}] =2

trace[R]

guarantees stabilityBy analogy,

0 < μ <2

trace[E{

x(n)xH (n)||x(n)||2

}]guarantees stability of the NLMS



To analyze the bound, make the following approximation

E{

x(n)xH(n)

||x(n)||2}

≈ E{x(n)xH(n)}E{||x(n)||2}

Then

trace[E{

x(n)xH(n)

||x(n)||2}]

=trace[E{x(n)xH(n)}]

E{||x(n)||2}

=E{trace[x(n)xH(n)]}

E{||x(n)||2}=

E{trace[xH(n)x(n)]}E{||x(n)||2}

=E{trace[||x(n)||2]}

E{||x(n)||2}= 1



Thus

0 < μ <2

trace[E{

x(n)xH (n)||x(n)||2

}] = 2

Final Result: The NLMS update

w(n + 1) = w(n) + μx(n)

||x(n)||2 e∗(n)

will converge if 0 < μ < 2

Note:

The NLMS has a simpler convergence criterion than the LMS

The NLMS generally converges faster than the LMS algorithm


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method

Method of Least Squares (LS)

Definition (Method of Least Squares (LS))

Motivation: Develop a general method for optimallyadjusting parameters to model observed data

Solution: Set the sum of squared residuals (errors) asthe performance criteria and restrict the model to belinear

The LS filtering method is a deterministic method

Can be applied to linear and nonlinear systems

LS corresponds to the ML criterion if the errors havea normal distribution

The method is related to linear regression

Optimization procedure results in a LS best fit for afilter over the observed (training) samples

HistoricalNote: Gaussdeveloped LSin 1795 at theage of 18



Consider the linear transversal filter

and a fixed number of observed samples: i = 1, 2, · · · , N.

M – the number of taps in the filter

{x(i)} – input sequence

{d(i)} – desired output sequence



Objective: Set the tap weights to minimize the sum of squared errors

ε(w) =N∑

i=M

|e(i)|2

Let

w = [w0, w1, · · · , wM−1]T [weight vector]

x(i) = [x(i), x(i − 1), · · · , x(i − M + 1)]T , M ≤ i ≤ N [obs. vect.]

The error at time i ise(i) = d(i) − wHx(i)

The full set of error values can be compiled into a vector



Define the (N − M + 1) × 1 vectors:

εH = [e(M), e(M + 1), · · · , e(N)] [error vector]

dH = [d(M), d(M + 1), · · · , d(N)] [desired vector]

Denoting the filter output as d(i) and using vector form:

dH = [d(M), d(M + 1), · · · , d(N)]

= [wHx(M), wHx(M + 1), · · · , wHx(N)]

= wH [x(M), x(M + 1), · · · , x(N)]

= wHAH

whereAH = [x(M), x(M + 1), · · · , x(N)]

is the observation data matrix



Expanding the data matrix

AH = [x(M), x(M + 1), · · · , x(N)]

=

⎡⎢⎢⎢⎣

x(M) x(M + 1) · · · x(N)x(M − 1) x(M) · · · x(N − 1)

......

. . ....

x(1) x(2) · · · x(N − M + 1)

⎤⎥⎥⎥⎦

⇒ AH is a M × (N − M + 1) rectangular toplitz matrix.

Combining all the above:

Filter output vector: dH = wHAH

Desired output vector: dH

Error vector: εH = dH − dH = dH − wHAH

Note: All incorporate samples for M ≤ i ≤ N



The sum of the squared estimate errors can now be written as

ε(w) =N∑

i=M

|e(i)|2

= εHε

= (dH − wHAH)(d − Aw)

= dHd − dHAw − wHAHd + wHAHAw

Minimizing with respect to w,

∂ε(w)

∂w= −2AHd + 2AHAw (∗)

Setting (∗) equal to zero gives the optimal LS weight w

⇒ AHAw = AHd [Deterministic normal equation]



Note: A is not generally square, and thus not invertible, but AHA issquare and generally invertible

AHAw = AHd

⇒ w = (AHA)−1AHd

The deterministic normal equation can be rearranged as

AHAw − AHd = 0

AH(Aw − d) = 0 [or using εmin = d − Aw]

AHεmin = 0

Observation: The LS orthogonality principle states that the estimateerror εmin is orthogonal to the row vectors of the data matrix AH



Objective: Determine the minimum sum of squared errors (emin)

emin = εHminεmin

= (dH − wHAH)(d − Aw)

= dHd − wHAHd − dHAw + wHAHAw

Utilizing the normal equations wHAHd = wHAHAw

emin = dHd − wHAHd︸︷︷︸wHAHAw

− dHAw + wHAHAw

= dHd − dHAw

or using w = (AHA)−1AHd

emin = dHd − dHA(AHA)−1AHd (∗)Note that

dHd =N∑

i=M

|d(i)|2 [energy of desired response]



Consider again the deterministic normal equation

AHAw = AHd

Note that

AHA = [x(M), x(M + 1), · · · , x(N)]

⎡⎢⎢⎢⎣

xH(M)

xH(M + 1)...

xH(N)

⎤⎥⎥⎥⎦

=N∑

i=M

x(i)xH(i)

= Φ [time averaged correlation matrix, size M × M ]



From Φ =∑N

i=M x(i)xH(i) it can be shown that:1 Φ is Hermitian2 Φ is nonnegative definite

To prove this, note that for any a

aHΦa =N∑

i=M

aHx(i)xH(i)a

=N∑

i=M

[aHx(i)][aHx(i)]H

=N∑

i=M

|aHx(i)|2 ≥ 0

3 From (1) and (2) we can prove that the eigenvalues of Φ are realand nonnegative



The deterministic normal equation,

AHAw = AHd

also employs

AHd = [x(M), x(M + 1), · · · , x(N)]

⎡⎢⎢⎢⎣

d∗(M)d∗(M + 1)

...d∗(N)

⎤⎥⎥⎥⎦

=N∑

i=M

x(i)d∗(i)

= θ [Time averaged cross-correlation vector, size M × 1]



Thus the deterministic normal equation, AHAw = AHd, reduces to

Φw = θ

Φ is usually positive definite (always positive semi-definite) ⇒ thesolution is well defined

w = Φ−1θ [LS optimal weight vector]

Also, recall from (∗) that emin can be expressed as

emin = dHd − dHA︸︷︷︸θH

(AHA)−1︸︷︷︸Φ−1

AHd︸︷︷︸θ

= ed − θHΦ−1θ

where ed is the energy of desired signal



Consider again the orthogonality principle

AHεmin = 0

Recall that d = Aw. Thus

AHεmin = 0

⇒ wHAHεmin = wH0

⇒ dHεmin = 0

Result: The minimum estimation error vector, εmin, is orthogonal to thedata matrix AH and the LS estimate d


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Analysis of the LS Solution

Objective: Analyze the Least Squares solution in terms of

Bias – it is the LS solution unbiased?

BLUE – is the LS solution the Best Linear Unbiased Estimate?

Assumption: Take the true underlying system to be a linear

d(i) =M−1∑k=0

w∗0kx(i − k) + e0(i)

= wH0 x(i) + e0(i)

e0(i) is the unobservable measurement error⇒ e0(i) is white (uncorrelated) with zero mean and variance σ2

Express the desired signal in vector form

d = Aw0 + ε0

where εH0 = [e0(M), e0(M + 1), · · · , e0(N)]



Objective: Evaluate the bias of w

Recall thatw = (AHA)−1AHd

Using d = Aw0 + ε0 in the above

w = (AHA)−1AHd

= (AHA)−1AH(Aw0 + ε0)

= (AHA)−1AHAw0 + (AHA)−1AHε0

= w0 + (AHA)−1AHε0 (∗)

Note A is fixed. Thus taking the expectation of (∗) yields

E{w} = w0 + (AHA)−1AHE{ε0}= w0

Result: The LS estimate, w, is unbiased



Objective: Evaluate the covariance of w

Note that from (∗)

w = w0 + (AHA)−1AHε0

⇒ w − w0 = (AHA)−1AHε0

Thus

cov[w] = E{(w − w0)(w − w0)H}

= E{(AHA)−1AHε0εH0 A(AHA)−1}

= Φ−1AH E{ε0εH0 }︸︷︷︸

σ2I

AΦ−1

= σ2Φ−1ΦΦ−1 = σ2Φ−1 (�1)

Result: The covariance of w is proportional to: (1) the variance of themeasurement noise and (2) the inverse of the time average correlationmatrix



Objective: Show that the LS estimate w is the Best Linear UnbiasedEstimate (BLUE)

Consider any linear unbiased estimate wNote that w is a linear function of the observed date and can thusbe written as

w = Bd

where B is a M × (N − M + 1) matrix

Substituting d = Aw0 + ε0 into the above,

w = BAw0 + Bε0 (∗)⇒ E{w} = BAw0

⇒ BA = I [since w unbiased]

Thus BA = I and (∗) ⇒w = w0 + Bε0



Rearranging w = w0 + Bε0,

w − w0 = Bε0

⇒ cov[w] = E{(w − w0)(w − w0)H}

= E{Bε0εH0 BH}

= σ2BBH (�2)

Now define

Ψ = B − (AHA)−1AH

⇒ ΨΨH = [B − Φ−1AH ][BH − AΦ−1]

= BBH − BA︸︷︷︸I

Φ−1 − Φ−1 AHBH︸︷︷︸I

+ Φ−1AHAΦ−1︸︷︷︸Φ−1ΦΦ−1

= BBH − Φ−1 − Φ−1 + Φ−1

= BBH − Φ−1

= BBH − (AHA)−1



Observation: The diagonal elements at ΨΨH must be ≥ 0

Thus ΨΨH = BBH − (AHA)−1 ⇒

diag[BBH ] ≥ diag[(AHA)−1]

⇒ diag[σ2BBH ] ≥ diag[σ2(AHA)−1] (∗)

But recall from (�1) and (�2) that

cov[w] = σ2(AHA)−1 and cov[w] = σ2BBH

Utilizing these results in (∗) ⇒

variance[wi ] ≥ variance[wi ] i = 1, 2, · · · , M

Thus the weights in w have lower variance than any other linearestimates

Result: The LS estimate w is unbiased and has the smallest weightvariance ⇒ it is the Best Linear Unbiased Estimate (BLUE)


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)

Definition (Recursive Least Squares (RLS))

Motivation: LS requires solving

w = (AHA)−1AHd

= Φ−1θ

where

Φ =N∑

i=M

x(i)xH(i) and θ =N∑

i=M

x(i)d∗(i)

(AHA) is M × M and inversion requires O(M3) multiplications andadditions

Approach: Suppose the LS optimal weights are known at time n, w(n).As time evolves, find the new estimate, w(n + 1), in terms of w(n).

Employ the matrix inversion lemma to reduce the number ofcomputations



Let the observation sequence be x(1), x(2), · · · , x(n)

⇒ Assume x(l) = 0 for l ≤ 0

Define the error as

ε(n) =n∑

i=1

β(n, i)|e(i)|2

where

e(i) = d(i) − wH(n)x(i)

x(i) = [x(i), x(i − 1), · · · , x(i − M + 1)]T

w(n) = [w0(n), w1(n), · · · , wM−1(n)]T

⇒ β(n, i) ∈ (0, 1] is a forgetting factor used in non–stationarystatistics cases



A commonly used forgetting factor is the exponential forgetting factor

β(n, i) = λn−i i = 1, 2, · · · , n, λ ∈ (0, 1]

Thus,

ε(n) =n∑

i=1

λn−i |e(i)|2

The LS solution is given by the deterministic normal equation

Φ(n)w(n) = θ(n)

where now

Φ(n) =n∑

i=1

λn−ix(i)xH(i)

θ(n) =n∑

i=1

λn−ix(i)d∗(i)



The normal equation terms can be updated recursively,

Φ(n) =n∑

i=1

λn−ix(i)xH(i)

= λ

[n−1∑i=1

λ(n−1)−ix(i)xH(i)

]︸︷︷︸

Φ(n−1)

+ x(n)xH(n)

= λΦ(n − 1) + x(n)xH(n)

Similarly

θ(n) =n∑

i=1

λn−ix(i)d∗(i)

= λ

[n−1∑i=1

λ(n−1)−ix(i)d∗(i)

]+ x(n)d∗(n)

= λθ(n − 1) + x(n)d ∗(n)



Aside: Matrix inversion lemma: If

A︸︷︷︸M×M

= B−1︸︷︷︸M×M

+ C︸︷︷︸M×L

D−1︸︷︷︸L×L

CH︸︷︷︸L×M

where A, B, D are positive definite (non-singular), then

A−1 = B − BC[D + CHBC]−1CHB

Apply the lemma to

Φ(n) = λΦ(n − 1) + x(n)xH(n)

Accordingly, set

A = Φ(n) [M × M] B−1 = λΦ(n − 1) [M × M]C = x(n) [M × 1] D = 1 [1 × 1]



UtilizingA = Φ(n) B−1 = λΦ(n − 1)C = x(n) D = 1

andA−1 = B − BC[D + CHBC]−1CHB (∗)

we get

[D + CHBC]−1 = [1 + λ−1xH(n)Φ−1(n − 1)x(n)]−1

which is a scalar. Thus evaluating (∗) yields

Φ−1(n) = λ−1Φ−1(n − 1) − λ−2Φ−1(n − 1)x(n)xH(n)Φ−1(n − 1)

1 + λ−1xH(n)Φ−1(n − 1)x(n)

To simplify the result, let P(n) = Φ−1(n) and

k(n)︸︷︷︸Gain vector

=λ−1P(n − 1)x(n)

1 + λ−1xH(n)P(n − 1)x(n)



Utilizing P(n) = Φ−1(n) and k(n) = λ−1P(n−1)x(n)1+λ−1xH(n)P(n−1)x(n)

Φ−1(n) = λ−1Φ−1(n − 1) − λ−2Φ−1(n − 1)x(n)xH(n)Φ−1(n − 1)

1 + λ−1xH(n)Φ−1(n − 1)x(n)

⇒ P(n) = λ−1P(n − 1) − λ−1k(n)xH(n)P(n − 1) (∗)

Also, the gain vector can be simplified as

k(n) =λ−1P(n − 1)x(n)

1 + λ−1xH(n)P(n − 1)x(n)[multiply by denom.]

⇒ k(n) = λ−1P(n − 1)x(n) − λ−1k(n)xH(n)P(n − 1)x(n)

= [λ−1P(n − 1) − λ−1k(n)xH(n)P(n − 1)]︸︷︷︸=P(n) from (∗)

x(n)

= P(n)x(n) = Φ−1(n)x(n) (∗∗)



We must now derive an update for the tap weight vector. Recall,

w(n) = Φ−1(n)θ(n) = P(n)θ(n)

Using the recursion θ(n) = λθ(n − 1) + x(n)d ∗(n) in the above

w(n) = λP(n)θ(n − 1) + P(n)x(n)d ∗(n) (∗ ∗ ∗)Using the update (∗)

P(n) = λ−1P(n − 1) − λ−1k(n)xH(n)P(n − 1)

in the first P(n) term of (∗ ∗ ∗)w(n) = λP(n)θ(n − 1) + P(n)x(n)d ∗(n)

= λ[λ−1P(n − 1) − λ−1k(n)xH(n)P(n − 1)]θ(n − 1)

+ P(n)x(n)d ∗(n)



*

w(n) = λ[λ−1P(n − 1) − λ−1k(n)xH(n)P(n − 1)]θ(n − 1)

+P(n)x(n)d ∗(n)

= P(n − 1)θ(n − 1)︸︷︷︸w(n−1)

− k(n)xH(n) P(n − 1)θ(n − 1)︸︷︷︸w(n−1)

+P(n)x(n)d ∗(n)

= w(n − 1) − k(n)xH(n)w(n − 1) + P(n)x(n)︸︷︷︸=k(n) from (∗∗)

d∗(n)

= w(n − 1) − k(n)[xH(n)w(n − 1) − d ∗(n)]

= w(n − 1) + k(n)α∗(n)

where α(n) = d(n) − wH(n − 1)x(n)

Observation: Difference between e(n) and α(n):

e(n) = d(n) − wH(n)x(n) ⇒ a posteriori error

α(n) = d(n) − wH(n − 1)x(n) ⇒ a priori error



RLS Algorithm Summary

1 Given a new sample x(n), update the gain vector

k(n) =λ−1P(n − 1)x(n)

1 + λ−1xH(n)P(n − 1)x(n)

2 Update the innovation: α(n) = d(n) − wH(n − 1)x(n)

3 Update the tap weight vector: w(n) = w(n − 1) + k(n)α∗(n)

4 Update inverse correlation matrix

P(n) = λ−1P(n − 1) − λ−1k(n)xH(n)P(n − 1)

Initial Conditions: w(0) = 0 and Φ(0) = δI, where δ is a small positiveconstant, δ ≈ 0.01σ2

x .



Algorithm Comparison: RLS and LMS algorithm terms:

Entity RLS LMSError

error) priori a(

)()1(ˆ)()( nnndn H xw −−=αerror) posteriari a(

)()(ˆ)()( nnndne H xw−=

WeightUpdate

)()()1(ˆ)(ˆ nnnn ∗+−= αkww )()()()1( nennn ∗+=+ xww μ

Gain oferror update

)()()1()(1

)1(1

1

nnnn

nH

xxPx

P⎟⎟⎠

⎞⎜⎜⎝

⎛−+

−−

−

λλ )()( nxμ


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Complexity Analysis

Objective: Compare the complexities (number of additions andmultiplies) for the LMS, LS, and RLS algorithms.

Assume the data is real and the filter is of size M

Case 1 – The LMS algorithm: Algorithm stages:1 d(n) = wT (n)x(n)

2 e(n) = d(n) − d(n)

3 w(n + 1) = w(n) + μx(n)e(n)

ComplexityStage O× O+

(1) M M − 1(2) 0 1(3) M + 1 M

Total complexityO×(2M + 1) O+(2M)

per iteration



Case 2 – The LS algorithm: Algorithm solves

w(n) = Φ−1(n)θ(n)

and has stages:1 Φ(n + 1) = Φ(n) + x(n + 1)xH(n + 1)

2 θ(n + 1) = θ(n) + x(n + 1)d(n + 1)

3 w(n + 1) = Φ−1(n + 1)θ(n + 1)


(1) M2 M2

(2) M M(3) M3 + M2 M3 + M(M − 1)

Total complexityO×(M3 + 2M2 + M) O+(M3 + 2M2)

per iteration



Case 3 – The RLS algorithm: Algorithm has stages (assuming λ = 1):1 k(n) = λ−1P(n−1)x(n)

1+xT (n)P(n−1)x(n)

2 α(n) = d(n) − wT (n − 1)x(n)3 w(n) = w(n − 1) + k(n)α(n)4 P(n) = P(n − 1) − k(n)xT (n)P(n − 1)

Note: The operation xT (n)P(n − 1) is repeated (but only performedonce). Corresponding steps are underlined in the chart.


(1) numerator M2 M(M − 1)(1) denominator M2 + M M(M − 1) + M

(1) division M(2) M M(3) M M(4) M2 + M2 M(M − 1) + M2

Total complexityO×(3M2 + 4M) O+(3M2 + M)

per iteration


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis

Objective: Analyze the RLS algorithm in terms of

Bias

Convergence in the mean; Convergence in the mean square

Learning curve decay rate

Assumptions:1 The desired signal is formed by the regression model

d(n) = wH0 x(n) + e0(n)

where e0(n) is white with variance σ2.2 λ = 1 and n ≥ M.

Thenw(n) = Φ−1(n)θ(n)

where

Φ(n) =n∑

i=1

x(i)xH(i) and θ(n) =n∑

i=1

x(i)d∗(i)



Substituting d∗(n) = xH(n)w0 + e∗0(n) into θ(n)

θ(n) =n∑

i=1

x(i)[xH(i)w0 + e∗0(i)]

=n∑

i=1

x(i)xH(i)w0 +n∑

i=1

x(i)e∗0(i)

= Φ(n)w0 +n∑

i=1

x(i)e∗0(i)

Thus

w(n) = Φ−1(n)θ(n)

= Φ−1(n)[Φ(n)w0 +n∑

i=1

x(i)e∗0(i)]

= w0 + Φ−1(n)n∑

i=1

x(i)e∗0(i) (∗)



Note that E{A} = E{E{A|B}}. Thus

w(n) = w0 + Φ−1(n)n∑

i=1

x(i)e∗0(i)

⇒ E{w(n)} = w0 + E{E{Φ−1(n)n∑

i=1

x(i)e∗0(i)|x(i), i = 1, 2, · · · , n}}

= w0 + E{Φ−1(n)n∑

i=1

x(i)E{e∗0(i)}} = w0

The above follows from the fact that Φ(n) and e∗0(i) are independent.

Why?e0(i) is independent of all observations and the x(i) terms aregiven, uniquely defining Φ(n). ⇒ independence of Φ(n) and e∗

0(i).

Result: The RLS algorithm is unbiased and convergent in the mean forn ≥ M.

Question: How does this compare to the LMS algorithm?K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 375 / 406


Next, consider the convergence in the mean square. Recall (∗)

w(n) = w0 + Φ−1(n)n∑

i=1

x(i)e∗0(i)

which gives

ε(n) = w(n) − w0 = Φ−1(n)n∑

i=1

x(i)e∗0(i)

Thus the weight error correlation matrix is

K(n) = E{ε(n)εH(n)}

= E{Φ−1(n)

⎛⎝ n∑

i=1

n∑j=1

x(i)e∗0(i)e0(j)xH(j)

⎞⎠Φ−1(n)}



Again using E{A} = E{E{A|B}} yields

K(n) = E

⎧⎪⎨⎪⎩Φ−1(n)

⎛⎜⎝ n∑

i=1

n∑j=1

x(i) E{e∗0(i)e0(j)}︸︷︷︸

σ2δ(i−j)

xH(j)

⎞⎟⎠Φ−1(n)

⎫⎪⎬⎪⎭

= σ2E

{Φ−1(n)

(n∑

i=1

x(i)xH(i)

)Φ−1(n)

}

= σ2E{Φ−1(n)Φ(n)Φ−1(n)}= σ2E{Φ−1(n)}

Note: Φ−1(n) has a Wishart distribution, the expectation of which is

E{Φ−1(n)} =1

n − M − 1R−1 n > M + 1



Using K(n) = σ2

n−M−1R−1 and the trace

E{||ε(n)||2} = E{εH(n)ε(n)}= E{trace[εH(n)ε(n)]}= E{trace[ε(n)εH(n)]}= traceE{ε(n)εH(n)}= trace[K(n)]

=σ2

n − M − 1trace[R−1]

=σ2

n − M − 1

M∑i=1

1λi

n > M + 1

Results:

The weight vector MSE is initially proportional to∑M

i=11λi

The weight vector converges linearly in the mean squared sense



Objective: Evaluate the RLS (error) learning curve

Recall the a priori estimation error

α(n) = d(n) − wH(n − 1)x(n)

= d(n) − d0(n) + d0(n) − wH(n − 1)x(n)

= e0(n) + wH0 x(n) − wH(n − 1)x(n)

= e0(n) − εH(n − 1)x(n)

Now consider the MSE of α(n)

Jα(n) = E{|α(n)|2}= E{[e∗

0(n) − xH(n)ε(n − 1)][e0(n) − εH(n − 1)x(n)]}= E{|e0(n)|2} − E{xH(n)ε(n − 1)e0(n)}

−E{εH(n − 1)x(n)e∗0(n)} + E{xH(n)ε(n − 1)εH(n − 1)x(n)}

To analyze Jα(n), consider each term individually



Jα(n) = E{|e0(n)|2} − E{xH(n)ε(n − 1)e0(n)}−E{εH(n − 1)x(n)e∗

0(n)} + E{xH(n)ε(n − 1)εH(n − 1)x(n)}

Term: E{|e0(n)|2}. Clearly,

E{|e0(n)|2} = σ2

Term: E{εH(n − 1)x(n)e∗0(n)}. By the independence theorem, ε(n − 1)

is independent of x(n) and e0(n). Thus,

E{εH(n − 1)x(n)e∗0(n)} = E{εH(n − 1)}E{x(n)e∗

0(n)}= 0

where the final result is due to the orthogonality principle.

Term: E{xH(n)ε(n − 1)e0(n)} → 0 by similar arguments



Jα(n) = E{|e0(n)|2} − E{xH(n)ε(n − 1)e0(n)}−E{εH(n − 1)x(n)e∗

0(n)} + E{xH(n)ε(n − 1)εH(n − 1)x(n)}

Term: E{xH(n)ε(n − 1)εH(n − 1)x(n)}

E{xH(n)ε(n − 1)εH(n − 1)x(n)} = E{trace[xH(n)ε(n − 1)εH(n − 1)x(n)]}= E{trace[x(n)xH(n)ε(n − 1)εH(n − 1)]}

Invoking the independence theorem

E{xH(n)ε(n − 1)εH(n − 1)x(n)}= trace[E{x(n)xH(n)}E{ε(n − 1)εH(n − 1)}]= trace[RK(n − 1)]



Utilizing K(n − 1) = σ2

n−M−2R−1 and substituting back each of thecomponents

Jα(n) = σ2 + trace[RK(n − 1)]

= σ2 +Mσ2

n − M − 2n > M + 1

Results:

The ensemble average learning curve of the RLS converges inabout 2M iterations, which is typically an order of magnitude fasterthan the LMS

limn→∞ Jα(n) = σ2 thus there is no excess MSE

Convergence of the RLS algorithm is independent of theeigenvalues of Φ(n)


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Channel Equalization

Example

Consider again the channel equalization problem

where

hn =

{ 12 [1 + cos(2π

W (n − 1))] n = 1, 2, 30 otherwise

As before an 11-tap filter is used

The SNR is 30dB and W is varied to control the eigenvalue spread



Observations:

The RLS algorithm converges in about 20 iterations (twice thenumber of filter taps)

The convergence (rate) is insensitive to the eigenvalue spread



Observations:

The RLS algorithm converges faster than the LMS algorithm

The RLS algorithm has lower steady state error than the LMSalgorithm


Application: Blind Deconvolution Motivation

Blind Deconvolution

Motivation:

Adaptive equalizers typically require a training period during whichthey operate on known signals/statistics.This known signal training is not always appropriate, e.g., inmobile communications

Cost is too high (time/bandwidth)Multipathing or other time varying interference

In such cases, we must use blind equalization


Application: Blind Deconvolution System Model

System components and assumptions:

Channel introduces distortion (dominant)

System has additive noise (not dominant)Assume a baseband model of communications

Multilevel pulse amplitude modulation (M-ary PAM)Received signal:

u(n) =∑

k

hk x(n − k)

Dominating interference is due to intersymbol interference (ISI)from channel distortion⇒ The noise is ignored



Also assume that:

h �= 0 for n < 0 (noncausal)∑k

h2k = 1 (to keep the variance of the output constant)

Example (4-ary PAM modulation)

A 4–ary PAM modulation scheme uses 4 signals

Or

S1

S2

S3

S4

S4 S2 S1 S3x



To solve the equalization problem, we need a statistical model of thedata. Assume,

1 The data is white

E{x(n)} = 0

E{x(n)x(k)} =

{1 k = n0 otherwise

2 The pdf of x(n) is symmetric and uniform)(xf x

0

32

1

33−



Deconvolution Objective: If {wi} are the coefficients of the idealinverse filter, then ∑

i

wihl−i = δl =

{1 l = 00 else

If this is the case, the output of the equalizer is

y(n) =∑

i

wiu(n − i)

=∑

i

∑k

wihkx(n − i − k) [let k = l − i ]

=∑

l

x(n − l)∑

i

wihl−i

=∑

l

x(n − l)δl

= x(n)

Problem: hn is not known ⇒ the exact inverse can not be usedK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 391 / 406


Solution: Use an iterative procedure to find the filter.

Let the output at iteration n be given by

y(n) =L∑

i=−L

wi(n)u(n − i)

where a 2L + 1 tap filter is used

Setting wi(n) = 0 for |i | > L, we can write

y(n) =∑

i

wi(n)u(n − i)

=∑

i

wiu(n − i) +∑

i

[wi(n) − wi ]u(n − i)

= x(n) + v(n)

[since x(n) =

∑i

wiu(n − i)

]

Note v(n) =∑

i [wi(n) − wi ]u(n − i) is the residual ISIK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 392 / 406


Interpretation: v(n) =∑

i [wi(n) − wi ]u(n − i) is the convolution noise(residual ISI) resulting from the fact that ideal filter was not used

Estimation Approach: Apply the output y(n) = x(n) + v(n) to a zeromemory nonlinear estimator

x(n) = g[y(n)]

Complete System:

The nonlinear estimate g[y(n)] can be used to update theequalizer to produce a better estimate at time n + 1


Application: Blind Deconvolution Optimization

Define the equalizer error to be

e(n) = x(n) − y(n)

Use e(n) in the LMS algorithm to update the equalizer weights

wi(n + 1) = wi(n) + μu(n − i)e(n) i = 0,±1,±2, · · · ,±L

Note that in this case, the cost function is

J(n) = E{e2(n)}= E{(x(n) − y(n))2}= E{(g[y(n)] − y(n))2}

Observation: J(n) is a nonconvex function of the filter weights (g[·] isnonlinear)

The cost function can have numerous local minima



To solve the optimization, first evaluate the convolution noise,

u(n) =∑

k

hkx(n − k)

⇒ u(n − i) =∑

k

hkx(n − i − k)

Next, use this in

y(n) =∑

i

wiu(n − i)

︸︷︷︸x(n)

+∑

i

[wi(n) − wi ]u(n − i)

︸︷︷︸v(n)

where recall

wi ≡ perfect equalizer weights

wi(n) ≡ finite approximate equalizer with wi(n) = 0 for |i | > L



Then using u(n − i) =∑k

hkx(n − i − k)

v(n) =∑

i

[wi(n) − wi ]u(n − i)

=∑

i

∑k

hk [wi(n) − wi ]x(n − i − k)

=∑

l

x(l)∇(n − l)

where the last line follows from letting n − i − k = l and defining

∇(n) =∑

k

hk [wn−k (n) − wn−k ]

∇(n) is the residual impulse response of the channel due toimperfect equalization

∇(n) is small in value, but long and oscillatory



Since the convolution noise is given by

v(n) =∑

l

x(l)∇(n − l)

⇒ E{v(n)} =∑

l

E{x(l)}∇(n − l) = 0

Also

E{v(n)v(n − j)} = E

{∑l

x(l)∇(n − l)∑

m

x(m)∇(n − m − j)

}

=∑

l

∑m

∇(n − l)∇(n − m − j)E{x(l)x(m)}

=∑

l

∇(n − l)∇(n − l − j)



Since ∇(n) is long and oscillatory

E{v(n)v(n − j)} =∑

l

∇(n − l)∇(n − l − j)

tends to average to 0 for j �= 0. Thus

E{v(n)v(n − j)} =

{σ2 j = 00 j �= 0

whereσ2 = σ2(n) =

∑l

∇2(n − l)

Observation: v(n) =∑

lx(l)∇(n − l) is a weighted sum of i.i.d. RVs ⇒

v(n) is Gaussian (central limit theorem)



Lastly, consider the cross correlation of x(n) and v(n),

E{x(n)v(n − j)} = E{x(n)∑

l

x(l)∇(n − l − j)}

=∑

l

∇(n − l − j)E{x(n)x(l)}

= ∇(−j)

�∑

l

∇2(l) = σ2

Observation: Thus we can say x(n) and v(n) are essentiallyindependent



Finally, consider the nonlinear estimation of x(n)

Component statistics:

x(n) is uniformly distributed with zero mean and unit variance

v(n) is white Gaussian noise with zero mean and variance σ2(n)

x(n) and v(n) are independent

Estimation Approach: Utilize Bayes estimation, which exploitsknowledge of the distributions



The estimation risk is defined by

R = E{c(x , x(n))}=

∫ ∞

−∞

∫ ∞

−∞c(x , x(n))fxy (x , y)dydx

Recall that if c(x , x(n)) = (x − x(n))2, then the risk is minimized by

x =

∫ ∞

−∞xfx (x |y)dx

= E{x |y} [conditional expectation]

where we can use the fact that

fx (x |y) =fy (y |x)fx (x)

fy (y)



Employ the current model,

y = c0x + v

where c0 < 1 is a scaling factor included to ensure E{y 2} = 1.

Thus since x and v are independent, fy (y |x) = fv (y − c0x) and

x =

∫ ∞

−∞xfx (x |y)dx

=1

fy (y)

∫ ∞

−∞xfy (y |x)fx (x)dx

=1

fy (y)

∫ ∞

−∞xfv (y − c0x)fx (x)dx

where

fx(x)

32

1

33− 0

fv(v)

Variance = σ2



Evaluating this yields (Berllini, 1988)

x =1

c0y− σ

c0

Z (y1) − Z (y2)

Q(y1) − Q(y2)

where

y1 =1σ

(y +√

3c0)

y2 =1σ

(y −√

3c0)

and

Z (y) =1√2π

e− y2

2 [normalized Gaussian pdf]

Q(y) =1√2π

∫ ∞

ye− u2

2 du [normalized Gaussian cdf]



Results for an 8 level PAM system:For an 8 level PAM system:

Plotted for three different noise to (signal + noise) ratios

Note that this tends to a step function as the noise goes to zero



Implementation Point: When the blind equalizer has converged, thealgorithm is switched to decision-directed mode.

In the case of binary symbols,

x(n) =

{+1 for symbol 1−1 for symbol 2

which results in the sample estimate

⇒ x(n) = sgn(y(n)) =

{+1 if y(n) ≥ 0−1 else



Final System:

Final Observation: This system works well as long as what?


ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Documents