ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer Engineering University of Delaware Spring 2009 K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 1 / 406 Course Objectives & Structure Course Objectives & Structure Objective: Given a discrete time sequence {x (n)}, develop Statistical and spectral signal representations Filtering, prediction, and system identification algorithms Optimization methods that are Statistical Adaptive Course Structure: Weekly lectures [notes: www.ece.udel.edu/ barner] Periodic homework (theory & Matlab implementations) [10%] Midterm & Final examinations [80%] Final project [10%] K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 2 / 406
200
Embed
ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ELEG-636: Statistical Signal Processing
Kenneth E. Barner
Department of Electrical and Computer EngineeringUniversity of Delaware
Spring 2009
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 1 / 406
Course Objectives & Structure
Course Objectives & Structure
Objective: Given a discrete time sequence {x(n)}, develop
Statistical and spectral signal representations
Filtering, prediction, and system identification algorithmsOptimization methods that are
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 3 / 406
Probability
Signal Characterization
Assumption: Many methods take {x(n)} to be deterministicReality: Real world signals are usually statistical in nature
Thus,. . . x(−1), x(0), x(1), . . .
can be interpreted as a sequence of random variables.We begin by analyzing each observation x(n) as a R.V.Then, to capture dependencies, we consider random vectors
This assignments yield different distribution functions
Fx(2) = Pr{HH, HT} = 1/2
Fy(2) = Pr{HH, HT , TH} = 3/4
How do we attain an intuitive interpretation of the distribution function?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 7 / 406
Probability Random Variables
Distribution Plots
0
0.5
1
0 1 2 3 4 5
)(xFx
x 0
0.5
1
-200 -100 0 100 200 300 400 500 600 700
)(yFy
y
43
Note properties hold:1 F (+∞) = 1, F (−∞) = 02 F (x) is continuous from the right
F (x+) = F (x)
3 Pr{x1 < x ≤ x2} = F (x2) − F (x1)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 8 / 406
Probability Random Variables
Definition
The probability density function is defined as,
f (x) =dF (x)
dx
or F (x) =
∫ x
−∞f (x)dx
Thus F (∞) = 1 ⇒∫ ∞
−∞f (x)dx = 1
Types of distributions:
Continuous: Pr{x = x0} = 0 ∀x0
Discrete: F (xi) − F (x−i ) = Pr{x = xi} = Pi
In which case f (x) =∑
i Piδ(x − xi)
Mixed: discontinuous but not discrete
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 9 / 406
Probability Random Variables
Distribution examples
Uniform: x ∼ U(a, b) a < b
f (x) =
{ 1b−a x ∈ [a, b]
0 else
ab −1
a b
f(x)
a b
F(x)1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 10 / 406
Probability Random Variables
Gaussian: x ∼ N(μ, σ)
f (x) =1√2πσ
e− (x−μ)2
2σ2
μ
f(x)
μ
F(x)
1/2
1
Historical Note: First introduced by Abraham de Moivre in 1733;Rigorously justified by Gauss in 1809; Extended by Laplace in 1812;
Why not the de Moivre distribution? Stigler’s law.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 11 / 406
Probability Random Variables
Gaussian Distribution Example
Example
Consider the Normal (Gaussian) distribution PDF and CDF forμ = 0, σ2 = 0.2, 1.0, 5.0 and μ = −2, σ2 = 0.5
F,
2(x
)
0.8
0.6
0.4
0.2
0.0
5 3 1 3 5
1.0
1 0 2 424
=0,=0, = 2,
=0, =0.22
=1.02
=5.02
=0.52
f,
2(x
)
0.8
0.6
0.4
0.2
0.0
5 3 1 3 5
1.0
1 0 2 424
=0,=0, = 2,
=0, =0.22
=1.02
=5.02
=0.52
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 12 / 406
Probability Random Variables
Binomial: x ∼ B(p, q) p + q = 1
Example
Toss a coin n times. What is the probability of getting k heads?
For p + q = 1, where q is probability of a tail, and p is the probabilityof a head:
Pr{x = k} =
(nk
)pkqn−k
[NOTE:
(nk
)=
n!
(n − k)!k !
]
⇒ f (x) =n∑
k=0
(nk
)pkqn−kδ(x − k)
⇒ F (x) =m∑
k=0
(nk
)pkqn−k m ≤ x < m + 1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 13 / 406
Probability Random Variables
Binomial Distribution Example I
Example
Toss a coin n times. What is the probability of getting k heads? Forn = 9, p = q = 1
2 (fair coin)
0
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 10
f(x)
0
0.5
1
0 1 2 3 4 5 6 7 8 9 10 11
F(x)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 14 / 406
Probability Random Variables
Binomial Distribution Example II
Example
Toss a coin n times. What is the probability of getting k heads? Forn = 20, p = 0.5, 0.7 and n = 40, p = 0.5.
0 10 20 30 40
0.00
0.05
0.10
0.15
0.20
0.25
p=0.5 and n=20p=0.7 and n=20p=0.5 and n=40
0 10 20 30 40
0.0
0.2
0.4
0.6
0.8
1.0
p=0.5 and n=20p=0.7 and n=20p=0.5 and n=40
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 15 / 406
Probability Conditional Distributions
Conditional Distributions
Definition
The conditional distribution of x given event “M” has occurred is
Fx(x0|M) = Pr{x ≤ x0|M}=
Pr{x ≤ x0, M}Pr{M}
Example
Suppose M = {x ≤ a}, then
Fx(x0|M) =Pr{x ≤ x0, M}
Pr{x ≤ a}If x0 ≥ a, what happens?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 16 / 406
Probability Conditional Distributions
Special Cases
Special Case: x0 ≥ a
Pr{x ≤ x0, x ≤ a} = Pr{x ≤ a}
⇒ Fx(x0|M) =Pr{x ≤ x0, M}
Pr{x ≤ a} =Pr{x ≤ a}Pr{x ≤ a} = 1
Special Case: x0 ≤ a
⇒ Fx(x0|M) =Pr{x ≤ x0, M}
Pr{x ≤ a} =Pr{x ≤ x0}Pr{x ≤ a}
=Fx(x0)
Fx(a)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 17 / 406
Probability Conditional Distributions
Conditional Distribution Example
Example
Suppose
a
F(x)1
What does Fx(x |M) look like? Note M = {x ≤ a}.
⇒ Fx(x0|M) =
{Fx(x0)Fx(a) x ≤ a1 a ≤ x
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 18 / 406
Probability Conditional Distributions
)(xFx
)( axxFx ≤
a
F(x)1
Distribution properties hold for conditional cases:Limiting cases: F (∞|M) = 1 and F (−∞|M) = 0Probability range: Pr{x0 ≤ x ≤ x1|M} = F (x1|M) − F (x0|M)Density–distribution relations:
f (x |M) =∂F (x |M)
∂x
F (x0|M) =
∫ x0
−∞f (x |M)dx
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 19 / 406
Probability Conditional Distributions
Example (Fair Coin Toss)
Toss a fair coin 4 times. Let x be the number of heads. DeterminePr{x = k}.
Recall
Pr{x = k} =
(nk
)pkqn−k
In this case
Pr{x = k} =
(4k
)(12
)4
Pr{x = 0} = Pr{x = 4} =116
Pr{x = 1} = Pr{x = 3} =14
Pr{x = 2} =38
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 20 / 406
Probability Conditional Distributions
Density and Distribution Plots for Fair Coin (n = 4) Ex.
0
0.1
0.2
0.3
0.4
0.5
0 1 2 3 4 5
f(x)
161
41
83
161
41
0
0.5
1
0 1 2 3 4 5 6 7
F(x)
161
165
1611
1615 1
What type of distribution is this? Discrete. Thus,
F (xi) − F (x−i ) = Pr{x = xi} = Pi
F (x) =
∫ x
−∞f (x)dx =
∫ x
−∞
∑i
Piδ(x − xi)dx
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 21 / 406
Probability Conditional Distributions
Conditional Case
Example (Conditional Fair Coin Toss)
Toss a fair coin 4 times. Let x be the number of heads. SupposeM = [at least one flip produces a head]. Determine Pr{x = k |M}.
Recall,
Pr{x = k |M} =Pr{x = k , M}
Pr{M}Thus first determine Pr{M}
Pr{M} = 1 − Pr{No heads}= 1 − 1
16
=1516
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 22 / 406
Probability Conditional Distributions
Next determine Pr{x = k |M} for the individual cases, k = 0, 1, 2, 3, 4
Pr{x = 0|M} =Pr{x = 0, M}
Pr{M} = 0
Pr{x = 1|M} =Pr{x = 1, M}
Pr{M}=
Pr{x = 1}Pr{M} =
1/415/16
=4
15
Pr{x = 2|M} =Pr{x = 2}
Pr{M} =3/8
15/16=
615
Pr{x = 3|M} =415
Pr{x = 4|M} =115
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 23 / 406
Probability Conditional Distributions
Conditional and Unconditional Density Functions
0
0.1
0.2
0.3
0.4
0.5
0 1 2 3 4 5
f(x)
161
41
83
161
41
0
0.1
0.2
0.3
0.4
0.5
0 1 2 3 4 5
f(x\M)
154
156
154
151
Are they proper density functions?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 24 / 406
Probability Total Probability and Bayes’ Theorem
Total Probability and Bayes’ Theorem
Let M1, M2, . . . , Mn forms a partition of S, i.e.⋃i
Mi = S and Mi
⋂i �=j
Mj = φ
Then
F (x) =∑
i
Fx(x |Mi)Pr(Mi)
f (x) =∑
i
fx(x |Mi)Pr(Mi)
Aside
Pr{A|B} =Pr{A, B}
Pr{B} =Pr{B, A}Pr{A}Pr{B}Pr{A} =
Pr{B|A}Pr{A}Pr{B}
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 25 / 406
Probability Total Probability and Bayes’ Theorem
From this we get
Pr{M|x ≤ x0} =Pr{x ≤ x0|M}Pr{M}
Pr{x ≤ x0}=
F (x0|M)Pr{M}F (x0)
and
Pr{M|x = x0} =f (x0|M)Pr{M}
f (x0)
By integration∫ ∞
−∞Pr{M|x = x0}f (x0)dx0 =
∫ ∞
−∞f (x0|M)Pr{M}dx0
= Pr{M}∫ ∞
−∞f (x0|M)dx0 = Pr{M}
⇒ Pr{M} =
∫ ∞
−∞Pr{M|x = x0}f (x0)dx0
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 26 / 406
Probability Total Probability and Bayes’ Theorem
Putting it all Together: Bayes’ Theorem
Bayes’ Theorem:
f (x0|M) =Pr{M|x = x0}f (x0)
Pr{M}=
Pr{M|x = x0}f (x0)∫∞−∞ Pr{M|x = x0}f (x0)dx0
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 27 / 406
Probability Functions of a R.V.
Functions of a R.V.
Problem Statement
Let x and g(x) be RVs such that
y = g(x)
Question: How do we determine the distribution of y?
NoteFy (y0) = Pr{y ≤ y0}
= Pr{g(x) ≤ y0}= Pr{x ∈ Ry0}
whereRy0 = {x : g(x) ≤ y0}
Question: If y = g(x) = x2, what is Ry0 ?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 28 / 406
Probability Functions of a R.V.
Example
Let y = g(x) = x2. Determine Fy (y0).
0y
0y− 0y x
2xy =
Note thatFy (y0) = Pr(y ≤ y0)
= Pr(−√y0 ≤ x ≤ √
y0)= Fx(
√y0) − Fx(−√
y0)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 29 / 406
Probability Functions of a R.V.
Example
Let x ∼ N(μ, σ) and
y = U(x) =
{1 if x > μ0 if x ≤ μ
Determine fy (y0) and Fy (y0).
21
)( yf y
0 1
)( yFy
21
0 1
1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 30 / 406
Probability Functions of a R.V.
General Function of a Random Variable Case
To determine the density of y = g(x) in terms of fx (x0), look at g(x)
fy (y0)dy0 = Pr(y0 ≤ y ≤ y0 + dy0)
= Pr(x1 ≤ x ≤ x1 + dx1) + Pr(x2 + dx2 ≤ x ≤ x2)
+Pr(x3 ≤ x ≤ x3 + dx3)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 31 / 406
Probability Functions of a R.V.
fy (y0)dy0 = Pr(x1 ≤ x ≤ x1 + dx1) + Pr(x2 + dx2 ≤ x ≤ x2)
+Pr(x3 ≤ x ≤ x3 + dx3)
= fx (x1)dx1 + fx (x2)|dx2| + fx (x3)dx3 (∗)Note that
dx1 =dx1
dy0dy0 =
dy0
dy0/dx1=
dy0
g′(x1)
Similarly
dx2 =dy0
g′(x2)and dx3 =
dy0
g′(x3)
Thus (∗) becomes
fy (y0)dy0 =fx (x1)
g′(x1)dy0 +
fx (x2)
|g′(x2)|dy0 +fx (x3)
g′(x3)dy0
or
fy (y0) =fx (x1)
g′(x1)+
fx (x2)
|g′(x2)|+
fx (x3)
g′(x3)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 32 / 406
Probability Functions of a R.V.
Function of a R.V. Distribution General Result
Set y = g(x) and let x1, x2, . . . be the roots, i.e.,
y = g(x1) = g(x2) = . . .
Then
fy (y) =fx (x1)
|g′(x1)| +fx (x2)
|g′(x2)| + . . .
Example
Suppose x ∼ U(−1, 2) and y = x 2. Determine fy (y).
31
-1 2
fx(x)
1 -1 1
y
0 2
2
1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 33 / 406
Probability Functions of a R.V.
Note thatg(x) = x2 ⇒ g′(x) = 2x
Consider special cases separately:Case 1: 0 ≤ y ≤ 1
y = x2 ⇒ x = ±√y
fy (y) =fx (x1)
|g′(x1)| +fx (x2)
|g′(x2)|=
1/3|2√y | +
1/3| − 2
√y | =
1/3√y
Case 2: 1 ≤ y ≤ 4y = x2 ⇒ x =
√y
fy (y) =fx (x1)
|g′(x1)| =1/32√
y=
1/6√y
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 34 / 406
Probability Functions of a R.V.
Result: For x ∼ U(−1, 2) and y = x 2
31
-1 2
fx(x)
1 -1 1
y
0 2
2
1
fy (y) =
{ 1/3√y 0 ≤ y ≤ 1
1/6√y 1 < y ≤ 4
0
0.5
1
0 1 2 3 4 5
fy(y)
31
y
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 35 / 406
Probability Functions of a R.V.
Example
Let x ∼ N(μ, σ) and y = ex . Determine fy (y).
Note g(x) ≥ 0 and g ′(x) = ex
Also, there is a single root (inverse solution):
x = ln(y)
Therefore,
fy (y) =fx (x)
|g′(x)| =fx (x)
ex
Expressing this in terms of y through substitution yields:
fy (y) =fx(ln(y))
eln(y)=
fx (ln(y))
y
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 36 / 406
Probability Functions of a R.V.
Note that x is Gaussian:
fx (x) =1√2πσ
e− (x−μ)2
2σ2
⇒ fy (y) =1√
2πyσe− (ln(y)−μ)2
2σ2 , for y > 0
0
0.5
1
0 1 2 3 4 5
fy(y)
y
Log normal density
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 37 / 406
Probability Functions of a R.V.
Distribution of Fx (x)
For any RV with continuous distribution Fx(x), the RV y = Fx (x) isuniform on [0, 1].
Proof: Note 0 < y < 1. Since
g(x) = Fx(x)
g′(x) = fx (x)
Thus
fy (y) =fx (x)
g′(x)=
fx (x)
fx (x)= 1
0 1
fy(y)
1
0 1
1
Fy(y)
y
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 38 / 406
Probability Functions of a R.V.
Thus the function
g(x) = Fx(x)
performs the mapping:
The converse also holds:
Combining operationsyields Synthesis:
)(xFxx U – uniform [0,1]
pdf fx(x)
)(1 xFx−U x – pdf fx(x)
Uniform pdf
)(xFxx )(1 xFx− xU
)(1 xFy− y
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 39 / 406
Probability Mean and Variance
Mean and variance
Definitions
Mean E{x} =
∫ ∞
−∞xf (x)dx
Conditional Mean E{x |M} =
∫ ∞
−∞xf (x |M)dx
Example
Suppose M = {x ≥ a}. Then
E{x |M} =
∫ ∞
−∞xf (x |M)dx
=
∫∞a xf (x)dx∫∞a f (x)dx
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 40 / 406
Probability Mean and Variance
For a function of a RV, y = g(x),
E{y} =
∫ ∞
−∞yfy (y)dy =
∫ ∞
−∞g(x)fx (x)dx
Example
Suppose g(x) is a step function:
0 x0
g(x)
1
Determine E{g(x)}.
E{g(x)} =
∫ ∞
−∞g(x)fx (x)dx =
∫ x0
−∞fx (x)dx = Fx(x0)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 41 / 406
Probability Mean and Variance
Definition (Variance)
Variance σ2 =
∫ ∞
−∞(x − η)2f (x)dx
where η = E{x}. Thus,
σ2 = E{(x − η)2} = E{x2} − E2{x}
Example
For x ∼ N(η, σ2), determine the variance.
f (x) =1√2πσ
e− (x−η)2
2σ2
Note: f (x) is symmetric about x = η ⇒ E{x} = η
Also ∫ ∞
−∞f (x)dx = 1 ⇒
∫ ∞
−∞e− (x−η)2
2σ2 dx =√
2πσ
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 42 / 406
Probability Mean and Variance
∫ ∞
−∞e− (x−η)2
2σ2 dx =√
2πσ
Differentiating w.r.t. σ:
⇒∫ ∞
−∞
(x − η)2
σ3 e− (x−η)2
2σ2 dx =√
2π
Rearranging yields∫ ∞
−∞(x − η)2 1√
2πσe− (x−η)2
2σ2 dx = σ2
orE{(x − η)2} = σ2
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 43 / 406
Probability Moments
Definition (Moments)
Moments
mn = E{xn} =
∫ ∞
−∞xnf (x)dx
Central Moments
μn = E{(x − η)n} =
∫ ∞
−∞(x − η)nf (x)dx
From the binomial theorem
μn = E{(x − η)n} = E
{n∑
k=0
(nk
)xk (−η)n−k
}
=n∑
k=0
(nk
)mk (−η)n−k
⇒ μ0 = 1, μ1 = 0, μ2 = σ2, μ3 = m3 − 3ηm2 + 2η3
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 44 / 406
Probability Moments
Example
Let x ∼ N(0, σ2). Prove
E{xn} =
{0 n = 2k + 11 · 3 · · · · (n − 1)σn n = 2k
For n odd
E{xn} =
∫ ∞
−∞xnf (x) = 0
since xn is an odd function and f (x) is an even function.
To prove the second part, use the fact that∫ ∞
−∞e−αx2
dx =
√π
α
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 45 / 406
Probability Moments
Differentiate ∫ ∞
−∞e−αx2
dx =
√π
α
with respect to α, k times
⇒∫ ∞
−∞x2ke−αx2
dx =1 · 3 · · · · (2k − 1)
2k
√π
α2k+1
Let α = 12σ2 , then∫ ∞
−∞x2ke− x2
2σ2 dx = 1 · 3 · · · · (2k − 1)σ2k+1√
2π
Setting n = 2k and rearranging∫ ∞
−∞xn 1√
2πσe− x2
2σ2 dx = 1 · 3 · · · · (n − 1)σn [QED]
Note: Variance is a measure of a RV’s concentration around its meanK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 46 / 406
Probability Tchebycheff Inequality
Tchebycheff Inequality
For any ε > 0,
Pr(|x − η| ≥ ε) ≤ σ2
ε2
To prove this, note
Pr(|x − η| ≥ ε) =
∫ η−ε
−∞f (x)dx +
∫ ∞
η+εf (x)dx
=
∫|x−η|≥ε
f (x)dx
Also note that
σ2 =
∫ ∞
−∞(x − η)2f (x)dx
≥∫
|x−η|≥ε
(x − η)2f (x)dx
f(x)
η η+εη-ε
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 47 / 406
Probability Tchebycheff Inequality
σ2 ≥∫
|x−η|≥ε
(x − η)2f (x)dx
Using the fact that |x − η| ≥ ε in the above gives
σ2 ≥ ε2∫
|x−η|≥ε
f (x)dx
= ε2Pr{|x − η| ≥ ε}
Rearranging gives the desired result
⇒ Pr{|x − η| ≥ ε} ≤(σ
ε
)2
QED
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 48 / 406
Probability Characteristic & Moment Generating Functions
Definition (Characteristic Function)
The characteristic function of a random variable x with pdf fx (x) isdefined by
φx (ω) = E(
ejωx)
=
∫ ∞
−∞ejωx fx (x)dx
If fx(x) is symmetric about 0 (fx (x) = fx (−x)), then φx (x) is realThe magnitude of the characteristic function is bound by
|φx (ω)| ≤ φx (0) = 1
Theorem (Characteristic Function for the sum of independent RVs)
Let x1, x2, . . . , xN be independent (but not necessarily identicallydistributed) RVs and set sN =
∑ni=1 aixi where ai are constants. Then
φsN (ω) =N∏
i=1
φxi (aiω)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 49 / 406
Probability Characteristic & Moment Generating Functions
The theorem can be proved by a simple extension of the following: Letx and y be independent. Then
φx+y (ω) = E(
ejω(x+y))
= E(
ejωxejωy)
= E(
ejωx)
E(
ejωy)
= φx (ω)φy (ω)
Example
Determine the characteristic function of the sample mean operating oniid samples.
Note x = 1N
∑Ni=1 xi ⇒ ai = 1
N
⇒ φx (ω) =N∏
i=1
φxi (aiω) =(φxi
(ω
N
))N
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 50 / 406
Probability Characteristic & Moment Generating Functions
The Moment Generating function is realized by making the substitutionjω → s in the above
Definition (Moment Generating Function)
The moment generating function of a random variable x with pdf f x (x)is defined by
Φx(s) = E (esx) =
∫ ∞
−∞esx fx (x)dx
Note Φx(jω) = φx (ω)
Theorem (Moment Generation)
Provided that Φx (s) exists in an open interval around s = 0, thefollowing hold
mn = E (xn) = Φ(n)x (0) =
dnΦx
dsn (0)
Simply noting that Φ(n)x (s) = E (xnesx) proves the result
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 51 / 406
Probability Characteristic & Moment Generating Functions
Example
Let x be exponentially distributed,
f (x) = λe−λxU(x)
Determine η = m1, m2, and σ2
Note
Φx(s) = λ
∫ ∞
0esxe−λxdx = λ
∫ ∞
0e−x(λ−s)dx
=λ
λ − sThus
Φ(1)x (0) =
1λ
and Φ(2)x (0) =
2λ2
and
E{x} =1λ
, E{
x2}
=2λ2 ⇒ σ2 =
1λ2
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 52 / 406
Probability Bivariate Statistics
Bivariate Statistics
Given two RVs, x and y , the bivariate (joint) distribution is given by
F (x0, y0) = Pr{x ≤ x0, y ≤ y0}
y
x
y0
x0
Properties:
F (−∞, y) = F (x ,−∞) = 0
F (∞,∞) = 1
Fx(x) = F (x ,∞), Fy (y) = F (∞, y)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 53 / 406
Probability Bivariate Statistics
Special Cases
Case 1: M = {x1 ≤ x ≤ x2, y ≤ y0} x
y
y0
x1 x2
⇒ Pr{M} = F (x2, y0) − F (x1, y0)
Case 2: M = {x ≤ x0, y1 ≤ y ≤ y2}
y
x0
y1
y2
x
⇒ Pr{M} = F (x0, y2) − F (x0, y1)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 54 / 406
Probability Bivariate Statistics
Case 3: M = {x1 ≤ x ≤ x2, y1 ≤ y ≤ y2} Theny
x1
y1
y2
xx2
andPr{M} = F (x2, y2) − F (x1, y2) − F (x2, y1) + F (x1, y1)︸ ︷︷ ︸
↓
Added back because this region was subtracted twice
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 55 / 406
Probability Joint Statistics
Definition (Joint Statistics)
f (x , y) =∂2F (x , y)
∂x∂y
and
F (x , y) =
∫ x
−∞
∫ y
−∞f (α, β)dαdβ
In general, for some region M, the joint statistics are
Pr{(x , y) ∈ M} =
∫ ∫M
f (x , y)dxdy
Marginal Statistics: Fx(x) = F (x ,∞) and Fy (y) = F (∞, y)
⇒ fx (x) =
∫ ∞
−∞f (x , y)dy
⇒ fy (y) =
∫ ∞
−∞f (x , y)dx
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 56 / 406
Probability Independence
Independence
Definition (Independence)
Two RVs x and y are statistically independent if for arbitrary events(regions) x ∈ A and y ∈ B,
Pr{x ∈ A, y ∈ B} = Pr{x ∈ A}Pr{y ∈ B}
Letting A = {x ≤ x0} and B = {y ≤ y0}, we see x and y areindependent iff
Fx ,y(x , y) = Fx(x)Fy (y)
and by differentiation
fx ,y(x , y) = fx (x)fy (y)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 57 / 406
Probability Independence
If x and y are independent RVs, then
z = q(x) and w = h(y)
are also independent.
Function of two RVs
Given two RVs, let z = g(x , y). Define Dz to be the xy plane region
{z ≤ z0} = {g(x , y) ≤ z0} = {(x , y) ∈ Dz}
Then
Fz(z0) = Pr{z ≤ z0}= Pr{(x , y) ∈ Dz}=
∫ ∫Dz
f (x , y)dxdy
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 58 / 406
Probability Independence
Example
Let z = x + y . Then, z ≤ z0 gives the region x + y ≤ z0 which isdelineated by the line x + y = z0
y
z0
x
z0
Thus
Fz(z0) =
∫ ∫Dz
f (x , y)dxdy
=
∫ ∞
−∞
∫ z0−y
−∞f (x , y)dxdy
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 59 / 406
Probability Independence
We can obtain fz(z) by differentiation
∂Fz(z)
∂z=
∫ ∞
−∞
∂
∂z
∫ z−y
−∞f (x , y)dxdy
fz(z) =
∫ ∞
−∞f (z − y , y)dy (∗)
Note that if x and y are independent,
f (x , y) = fx (x)fy (y) (∗∗)
Thus utilizing (∗∗) in (∗)
fz(z) =
∫ ∞
−∞fx (z − y)fy (y)dy︸ ︷︷ ︸
Convolution
= fx (z) ∗ fy (z)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 60 / 406
Probability Independence
Example
Let z = x + y where x and y are independent with
fx(x) = αe−αxU(x)
fy (y) = αe−αyU(y)
Then
fz(z) =
∫ ∞
−∞fx (z − y)fy (y)dy
= α2∫ z
0e−α(z−y)e−αydy
= α2e−αz∫ z
0dy
= α2ze−αzU(z)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 61 / 406
Probability Independence
Example
Let z = max(x , y). Determine Fz(z0) and fz(z0).
NoteFz(z0) = Pr{z ≤ z0}
= Pr{max(x , y) ≤ z0}= Fxy(z0, z0)
y
x
z0
z0
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 62 / 406
Probability Independence
If x and y are independent,
Fz(z0) = Fx(z0)Fy (z0)
and
fz(z0) =∂Fz(z0)
∂z0
=∂Fx (z0)
∂z0Fy (z0) +
∂Fy (z0)
∂z0Fx(z0)
= fx (z0)Fy (z0) + fy (z0)Fx(z0)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 63 / 406
Result: R is positive definite if the samples in x are not linearlydependent. In this case R−1 exists.
Historical Note: Diagonal–constant matrices are named after themathematician Otto Toeplitz (1881–1940)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 99 / 406
Stationary Process and Models Stochastic Models
Stochastic Models
A model is used to describe the hidden laws governing thegeneration of physical data observed
We assume that x(n), x(n − 1), · · · have statistical dependenciesthat can be modeled as
Discrete timelinear fitler
v(n) x(n)
processwhere v(n) is a purely random processLinear model types:
1 Auto Regressive – no past model input samples used2 Moving Average – no past model output samples used3 Auto Regressive Moving Average – both past input and output used
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 100 / 406
Stationary Process and Models Stochastic Models
General Stochastic Model:(Modeloutput
)+
(Linear combination
of past outputs
)︸ ︷︷ ︸
AR part
=
(Linear combination ofpresent & past inputs
)︸ ︷︷ ︸
MA part
Three model possibilities:1 AR – auto regressive2 MA – moving average3 ARMA – mixed AR and MA
Model Input: assumed to be an i.i.d. zero mean Gaussian process:
E{v(n)} = 0 for all n
E{v(n)v∗(k)} =
{σ2
v k = n0 otherwise
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 101 / 406
Stationary Process and Models Auto-Regressive Models
Auto-Regressive Models
Definition (Auto-Regressive)
The time series {x(n)} is said to be generated by an AR model if
x(n) + a∗1x(n − 1) + · · · + a∗
Mx(n − M) = v(n)
orx(n) = w∗
1 x(n − 1) + · · · + w∗Mx(n − M) + v(n)
where wk = −ak .
This is an order M model and v(n) is referred to as the noise termNote that we can set a0 = 1 and write
M∑k=0
a∗kx(n − k) = v(n)
which is a convolution sumK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 102 / 406
Stationary Process and Models Auto-Regressive Models
Thus taking Z-transforms
Z{a∗n} = A(z) =
M∑n=0
a∗nz−n
Z{x(n)} = X (z) =∞∑
n=0
x(n)z−n
Z{v(n)} = V (z) =∞∑
n=0
v(n)z−n
andM∑
k=0
a∗kx(n − k) = v(n) ⇒ A(z)X (z) = V (z)
If we regard v(n) as the output,then HA(z)x(n) v(n)
where HA(z) = V (z)X(z) = A(z)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 103 / 406
Stationary Process and Models Auto-Regressive Models
This is called the process analyzerAnalyzer is an all zero system
Impulse response is finite (FIR)System is BIBO stable
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 104 / 406
Stationary Process and Models Auto-Regressive Models
If we view v(n) as the input, then wehave the process generator
HG(z)v(n) x(n)
HG(z) =X (z)
V (z)=
1A(z)
The process generator is an allpole system
Impulse response is infinite (IIR)System stability is an issue
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 105 / 406
Stationary Process and Models Auto-Regressive Models
Note
HG(z) =1
A(z)=
1∑Mn=0 a∗
nz−n
Factor the denominator and represent HG(z) in terms of its poles
HG(z) =1
(1 − p1z−1)(1 − p2z−1) · · · (1 − pMz−1)
p1, p2, . . . , pM are the poles of HG(z) defined as the roots of thecharacteristic equation
1 + a∗1z−1 + a∗
2z−2 + · · · + a∗Mz−M = 0
HG(z) is all pole (IIR) and BIBO stable only if all poles are in theunit circle, i.e.,
|pn| < 1 n = 1, 2, · · · , M
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 106 / 406
Stationary Process and Models Moving Average Model
Moving Average Model
Definition (Moving Average)
The time series {x(n)} is said to be generated by a Moving Average(MA) model if
x(n) = v(n) + b∗1v(n − 1) + · · · + b∗
K v(n − K )
where b1, b2, · · · , bk are the parameters of the order K MA model
v(n) is zero mean white Gaussian noiseThe process generation model is all zero (FIR)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 107 / 406
Stationary Process and Models Auto-Regressive Moving Average Model
Auto-Regressive Moving Average Model
Definition (Auto-Regressive MovingAverage)
In this case, {x(n)} is a mixed processwhere the output is a function of pastoutputs and current/past inputs
x(n) + a∗1x(n − 1) + · · · + a∗
M(n − M)
= v(n) + b∗1v(n − 1) + · · · + b∗
K v(n − K )
The order is (M, K ).
v(n) is zero mean white Gaussiannoise
The process model has zeros andpoles (IIR)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 108 / 406
Stationary Process and Models Analysis of Stochastic Models
Wold Decomposition (after Herman Wold (1908–92))
Any WSS discrete time stochastic process y(n) can be expressed as
y(n) = x(n) + s(n)
where:x(n) and s(n) are uncorrelatedx(n) can be expressed by the MA model
x(n) =∞∑
k=0
b∗kv(n − k)
b0 = 1 and∑∞
k=0 |bk | < ∞v(n) is white noise uncorrelated with s(n)
s(n) is perfectly predictableNote: If B(z) is minimum phase, then it can be represented by an allpole (AR) system.
AR models are widely used because they are tractableK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 109 / 406
Stationary Process and Models Analysis of Stochastic Models
Asymptotic statistics of AR processes
Recall that {x(n)} is generated by
x(n) + a∗1x(n − 1) + a∗
2x(n − 2) + · · · + a∗Mx(n − M) = v(n)
or
x(n) = w∗1 x(n − 1) + w∗
2 x(n − 2) + · · · + w∗Mx(n − M) + v(n)
Linear constant coefficient difference equation of order M drivenby v(n).
Z-transform representation:
X (z) =V (z)
1 +∑M
k=1 a∗kz−k
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 110 / 406
Stationary Process and Models Analysis of Stochastic Models
Inverse transforming X (z) = V (z)
1+∑M
k=1 a∗k z−k
yields
x(n) = xc(n)︸ ︷︷ ︸Homogeneous Solution
+ xp(n)︸ ︷︷ ︸Particular Solution
The particular solution is the result of driving HG(z) with v(n)
xp(n) = HG(z)v(n),
where z−1 is taken as the delay operator.
The particular solution has stationary statistics
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 111 / 406
Stationary Process and Models Analysis of Stochastic Models
The homogeneous solution is of the form
xc(n) = B1pn1 + B2pn
2 + · · · + BMpnM
where p1, p2, · · · , pM are the roots of
1 + a∗1z−1 + a∗
2z−2 + · · · + a∗Mz−M = 0
The B values depend on the initial conditions
The homogeneous solution is not stationary
The process is asymptotically stationary if |pn| < 1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 112 / 406
Stationary Process and Models Analysis of Stochastic Models
Correlation of a stationary AR process
Recall that an AR process can be written as
M∑k=0
a∗kx(n − k) = v(n)
where a0 = 1.Multiply both sides by x∗(n − l) and take E{ }.
E
{M∑
k=0
a∗kx(n − k)x∗(n − l)
}= E{v(n)x∗(n − l)}
Note that
E{x(n − k)x∗(n − l)} = r(l − k)
E{v(n)x∗(n − l)} = 0 for l > 0
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 113 / 406
Stationary Process and Models Analysis of Stochastic Models
Thus
E
{M∑
k=0
a∗kx(n − k)x∗(n − l)
}= E{v(n)x∗(n − l)}
⇒M∑
k=0
a∗k r(l − k) = 0 for l > 0
Accordingly, the auto-correlation of the AR process satisfies
r(l) = w∗1 r(l − 1) + w∗
2 r(l − 2) + · · · + w∗Mr(l − M)
where wk = −ak . Note that this also has the solution
r(m) =M∑
k=1
ckpmk
where pk is the k th root of
1 − w∗1 z−1 − w∗
2 z−2 − · · · − w∗Mz−M = 0
Why? Diff. equation (no driving function; homogeneous solution only)K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 114 / 406
Stationary Process and Models Analysis of Stochastic Models
Recall that the AR characteristic equation is
1 + a∗1z−1 + a∗
2z−2 + · · · + a∗Mz−M = 0
This is identical to the auto-correlation characteristic equation
1 − w∗1 z−1 − w∗
2 z−2 − · · · − w∗Mz−M = 0
⇒ the roots are equalResult: A stable AR process ⇒ |pk | < 1 and
limm→∞ r(m) = lim
m→∞
M∑k=1
ckpmk = 0
(asymptotically uncorrelated)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 115 / 406
Stationary Process and Models Yule-Walker Equations
Yule-Walker Equations
An AR model of order M is completely specified by
AR coefficients: a1, a2, . . . , aM
Variance of v(n): σ2v
Proposition: These parameters can be determined by theauto-correlation values: r(0), r(1), . . . , r(M).
Recall
r(l) = w∗1 r(l − 1) + w∗
2 r(l − 2) + · · · + w∗Mr(l − M)
Case 1: Let l = 1
r(1) = w∗1 r(0) + w∗
2 r(−1) + · · · + w∗Mr(1 − M)
Using the fact r(−k) = r∗(k)
r(1) = w∗1 r(0) + w∗
2 r∗(1) + · · · + w∗Mr∗(M − 1)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 116 / 406
Stationary Process and Models Yule-Walker Equations
The determinant of the correlation matrix is related to the eigenvaluesas follows:
det(R) =M∏
i=1
λi
Proof: Using R = QΩQH and the above,
det(R) = det(QΩQH)
= det(Q)det(QH)det(Ω) = det(Ω) =M∏
i=1
λi
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 170 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties
Property (Trace–Eigenvalue Relation)
The trace of the correlation matrix is related to the eigenvalues asfollows:
trace(R) =M∑
i=1
λi
Proof: Note
trace(R) = trace(QΩQH)
= trace(QHQΩ)
= trace(Ω)
=M∑
i=1
λi
QED
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 171 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties
Definition (Normal Matrix)
A complex square matrix A is a normal matrix if
AHA = AAH
That is, a matrix is normal if it commutes with its conjugate transpose.
Note
All Hermitian symmetric matrices are normal
Every matrix that can be diagonalized by the unitary transform isnormal
Definition (Condition Number)
The condition number reflects how numerically well–conditioned aproblem is, i.e, a low condition number ⇒ well–conditioned; a highcondition number ⇒ ill–conditioned.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 172 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties
Definition (Condition Number for Linear Systems)
For a linear systemAx = b
defined by a normal matrix A, the condition number is
χ(A) =λmax
λmin
where λmax and λmin are the maximum/minimum eigenvalues of A
Observations:
Large eigenvalue spread ⇒ ill–conditioned
Small eigenvalue spread ⇒ well–conditioned
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 173 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties
Sensitivity Analysis: Suppose a system (filter) is related as:
Rw = d
where w defines the filter parameters; R and d are signal statisticmatrices
⇒ Introduce small signal statistic perturbations
Let R and d are perturbed such that ||δR||/||R|| and ||δd||/||d|| are onthe order ε � 1
Result: A bound on the resulting parameter perturbations is given by
||δw||||w|| ≤ εχ(R) = ε
λmax
λmin
Consequence: If R is ill conditioned, small changes in R or d can leadto big changes in w
⇒ System sensitivity is related to eigenvalue spread
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 174 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The discrete Karhmen-Loeve Transform
The discrete Karhmen-Loeve Transform (KLT)
Definition (The discrete Karhmen-Loeve Transform (KLT))
A M sample vector x(n) from the process {x(n)} can be expressed as
x(n) =M∑
i=1
ci(n)qi
where q1, q2, · · · , qM are the orthonormal eigenvectors of the processcorrelation matrix, R, and c1(n), c2(n), · · · , cM(n) are a set of KLTcoefficients.
Signal is represented as a weighted sum of eigenvectors
Need to determine the coefficients
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 175 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The discrete Karhmen-Loeve Transform
Determining Coefficients: Write the expression in matrix form
x(n) =M∑
i=1
ci(n)qi
= Qc(n) (∗)
whereQ = [q1, q2, · · · , qM ]
andc(n) = [c1(n), c2(n), · · · , cM(n)]T .
Solving (∗) for c(n):
c(n) = Q−1x(n) = QHx(n)
orci(n) = qH
i x(n)
Note: ci(n) is the projection of x(n) onto qi
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 176 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The discrete Karhmen-Loeve Transform
Question: How related are the coefficients to reach other?
Answer: Consider the correlation between ci(n) terms
⇒ KLT transform coefficients are uncorrelatedA desirable property – Why?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 177 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The discrete Karhmen-Loeve Transform
Question: Can we represent x(n) with fewer terms? If so, how do weminimize the representation error?
Approach: Use fewer terms in the KLT transform
x(n) =M∑
i=1
ci(n)qi
⇒ x(n) =N∑
i=1
ci(n)qi N < M
Thus
x(n) = x(n) + ε(n)
=N∑
i=1
ci(n)qi +M∑
i=N+1
ci(n)qi
Question: How do we minimize the representation error?K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 178 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The discrete Karhmen-Loeve Transform
Approach: Analyzed and minimize the error power
The error power is given by
ε = E{εH(n)ε(n)}
= E
⎧⎨⎩
M∑i=N+1
c∗i (n)qH
i
M∑j=N+1
cj(n)qj
⎫⎬⎭
=M∑
i=N+1
E{c∗i (n)ci (n)} [result of orthogonality]
=M∑
i=N+1
λi [from prior result]
Result: To minimize the error select the qi eigenvectors associatedwith M largest eigenvalues.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 179 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter
The Matched Filter
Objective: Find the optimal filter coefficients for two cases:(1) deterministic signals and (2) stochastic signals
Signal model:x(n) = u(n)︸︷︷︸
signal
+ v(n)︸︷︷︸noise
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 180 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter
Cauchy-Schwarz’s inequality states (for square–integrablecomplex–valued functions),∣∣∣∣
∫f (x)g(x) dx
∣∣∣∣2 ≤∫
|f (x)|2 dx ·∫
|g(x)|2 dx
with equality only if f (x) = k · g(x), where k is a constant
Thus(∫ ∞
−∞
(∂ ln[fx|θ(x|θ)]
∂θ
√fx|θ(x|θ)
)(√fx|θ(x|θ)(θ − θ)
)dx)2
= 1
⇒(∫ ∞
−∞
(∂ ln[fx|θ(x|θ)]
∂θ
)2
fx|θ(x|θ)dx
)(∫ ∞
−∞(θ − θ)2fx|θ(x|θ)dx
)≥ 1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 207 / 406
Maximum Likelihood and Bayes Estimation Cramer-Rao Bound
Note ∫ ∞
−∞(θ − θ)2fx|θ(x|θ)dx = var(θ) (∗)
and∫ ∞
−∞
(∂ ln[fx|θ(x|θ)]
∂θ
)2
fx|θ(x|θ)dx = E
{(∂ ln(fx|θ(x|θ))
∂θ
)2}
(∗∗)
Thus using (∗) and (∗∗) in(∫ ∞
−∞
(∂ ln[fx|θ(x|θ)]
∂θ
)2
fx|θ(x|θ)dx
)(∫ ∞
−∞(θ − θ)2fx|θ(x|θ)dx
)≥ 1
⇒ var(θ) ≥[
E
{(∂ ln(fx|θ(x|θ))
∂θ
)2}]−1
with equality iff∂
∂θln(fx|θ(x|θ)) = k(θ − θ)
QEDK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 208 / 406
Maximum Likelihood and Bayes Estimation Cramer-Rao Bound
Thus the bound in met iff
∂
∂θln(fx|θ(x|θ)) = k(θ − θ)
Let θ = θML in the above
∂
∂θln(fx|θ(x|θ))
∣∣∣θ=θML︸ ︷︷ ︸
= 0 by ML criteria
= k(θ − θ)∣∣∣θ=θML
Therefore, the RHS must equal zero, or
θ = θML
Result: If an efficient estimate (one that satisfies the bound withequality) exists, then it is the ML estimate
Note: If an efficient estimator doesn’t exist, then we don’t know howgood θML is
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 209 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Definition (Bayes Estimation)
Objective: Estimate a random parameter (RV) from observationssamples x1, x2, · · · , xn that are statistically related to y by fy |x(·)Bayes Procedure: Define a nonnegative cost function C(y , y ) and sety to minimize the expected cost, or risk
R︸︷︷︸risk
= E{C(y , y )}
Since y and y are RVs
R =
∫ ∞
−∞
∫ ∞
−∞C(y , y )fy ,x(y , x)dydx
=
∫ ∞
−∞
[∫ ∞
−∞C(y , y )fy |x(y |x)dy
]︸ ︷︷ ︸
I(y)
fx(x)dx
Note: Minimizing I(y) it is equivalent to minimize R since fx(x) ≥ 0K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 210 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Consider several cost functions
Case 1: Mean Squared cost function
C(y , y ) = |y − y |2
|ˆ| yy −In this case,
I(y) =
∫ ∞
−∞(y − y)2fy |x(y |x)dy
⇒ ∂I(y)
∂y= −2
∫ ∞
−∞(y − y)fy |x(y |x)dy = 0
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 211 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
or rearranging∫ ∞
−∞y fy |x(y |x)dy =
∫ ∞
−∞yfy |x(y |x)dy [y is a constant]
⇒ yMS =
∫ ∞
−∞yfy |x(y |x)dy = E{y |x}
Example
Let xi = a + μi for i = 1, 2, · · · , N, where μi ∼ N(0, σ2) anda ∼ N(0, σ2
a) are i.i.d. Determine aMS(x).
Note
fx|a(x|a) =N∏
i=1
1√2πσ
e− (xi−a)2
2σ2 =
(1
2πσ2
)N2
e− 1
2
(N∑
i=1
(xi−a)2
σ2
)(∗)
fa(a) =1√
2πσae− a2
2σ2a (∗∗)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 212 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
To find aMS(x) we need
aMS(x) = E{a|x}By Bayes’s theorem we can write
fa|x(a|x) =fx|a(x|a)fa(a)
fx(x)
Substituting in (∗) and (∗∗), and rearranging
fa|x(a|x) =
( 12πσ2
)N2(
1√2πσa
)e− 1
2
(N∑
i=1
(xi−a)2
σ2 + a2
σ2a
)
fx(x)
This can be compactly written as
fa|x(a|x) = C(x) exp
⎧⎨⎩− 1
2σ2p
[a − σ2
a
σ2a + σ2/N
(1N
N∑i=1
xi
)]2⎫⎬⎭
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 213 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Observations on
fa|x(a|x) = C(x) exp
⎧⎨⎩− 1
2σ2p
[a − σ2
a
σ2a + σ2/N
(1N
N∑i=1
xi
)]2⎫⎬⎭
= C(x) exp
{−(a − η)2
2σ2p
}
C(x) is a (normalizing) function of x onlyThe variance term is given by
σ2p =
(1σ2
a+
Nσ2
)−1
=σ2
aσ2
Nσ2a + σ2
Critical Observation: fa|x(a|x) is a Gaussian distribution!Result:
aMS = E{a|x} = η =σ2
a
σ2a + σ2/N
(1N
N∑i=1
xi
)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 214 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Case 2: Uniform cost function
C(y , y ) =
{0 |y − y | < ε1 else
Question: For what types of problems isthis cost function effective?
|ˆ| yy −
)ˆ,( yyC
0
1
ε
In this case,
I(y) =
∫ ∞
−∞C(y , y )fy |x(y |x)dy
=
∫|y−y |≥ε
fy |x(y |x)dy
= 1 −∫
|y−y |<ε
fy |x(y |x)dy
How do we minimize I(y)?K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 215 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Result: I(y) is minimized bymaximizing∫
|y−y |<ε
fy |x(y |x)dyy
∫<− ε|ˆ|
| )|(yy
y dyyf xx
)|(| xx yf y
Note: ε is arbitrarily small
⇒ I(y) is minimized when fy |x(y |x) takes its largest value
yMAP(x) = argmaxy
fy |x(y |x)
yMAP is referred to as the maximum a posteriori (MAP) estimatebecause it maximized the posterior density fy |x(y |x).
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 216 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Example
Let
fx ,y(x , y) =
{10y 0 ≤ y ≤ x2, 0 ≤ x ≤ 10 otherwie
Find the MS and MAP estimates of y , i.e., yMS(x) and yMAP(x).
First step: determine the posterior density fy |x(y |x).
Since fy |x(y |x) =fx,y (x ,y)
fx (x) , we need
fx (x) =
∫ ∞
−∞fx ,y(x , y)dy
=
∫ x2
010ydy
= 5y2∣∣∣x2
0= 5x4 0 ≤ x ≤ 1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 217 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Thus,
fy |x(y |x) =fx ,y(x , y)
fx (x)
=10y5x4 =
2yx4 0 ≤ y ≤ x2
y
)|(| xyf xy
0
2
2
x
2x
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 218 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
MAP estimate:
yMAP(x) = argmaxy
fy |x(y |x)
= argmaxy
2yx4 0 ≤ y ≤ x2
= x2
MS estimate:
yMS(x) = E{y |x}
=
∫ x2
0yfy |x(y |x)dy
=
∫ x2
0
2y2
x4 dy
=23
y3
x4
∣∣∣x2
0=
23
x2
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 219 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Note that the minimum MSE is
E{(y − yMS)2} =
∫ 1
0
∫ x2
0(y − yMS)
2fx ,y(x , y)dydx
=
∫ 1
0
∫ x2
0(y − 2
3x2)210ydydx =
5162
= 0.0309
The MSE of the MAP estimate is
E{(y − yMAP)2} =
∫ 1
0
∫ x2
0(y − yMAP)
2fx ,y(x , y)dydx
=
∫ 1
0
∫ x2
0(y − x2)210ydydx =
554
= 0.0926
Observation: This result is expected. Why?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 220 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Observation: MAP estimation can be used as an extension of MLestimation if some variability is assumed
Instead of an unknown constant θ, we have an unknown randomparameter with distribution fθ(θ)
To see this, note
fθ|x(θ|x) =fx|θ(x|θ)fθ(θ)
fx(x)
The MAP estimate maximizes the numerator since fx(x) is not afunction of θ,
θMAP = argmaxθ
fx|θ(x|θ)fθ(θ)
Question: For what distribution fθ(θ) does θMAP = θML? That is
θMAP = argmaxθ
fx|θ(x|θ)fθ(θ)?= argmax
θfx|θ(x|θ) = θML
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 221 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Example
Let x(n) = A + μ(n) for n = 1, 2, · · · , N, where μ(n) ∼ N(0, σ2μ) and
A ∼ N(A0, σ2A) are i.i.d.
Determine the MAP estimate of A.
Need to maximize fx|A(x|A)fA(A), or
AMAP = argmaxA
[ln(fx|A(x|A)) + ln(fA(A))
]Note
ln(fx|A(x|A)) =N2
ln
(1
2πσ2μ
)−
N∑n=0
(x(n) − A)2
2σ2μ
and
ln(fA(A)) =12
ln
(1
2πσ2A
)− (A − A0)
2
2σ2A
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 222 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Thus
AMAP = argminA
(1
2σ2μ
N∑n=1
(x(n) − A)2 +(A − A0)
2
2σ2A
)
Differentiating we get
− 1σ2
μ
N∑n=1
(x(n) − A) +(A − A0)
σ2A
∣∣∣∣∣A=AMAP
= 0
⇒N∑
n=1
x(n)
σ2μ
− NAMAP
σ2μ
=AMAP
σ2A
− A0
σ2A
⇒ AMAP
(1σ2
A
+Nσ2
μ
)=
1σ2
μ
N∑n=1
x(n) +A0
σ2A
⇒ AMAP =1
1σ2
A+ N
σ2μ
(1σ2
μ
N∑n=1
x(n) +A0
σ2A
)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 223 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
AMAP =1
1σ2
A+ N
σ2μ
(1σ2
μ
N∑n=1
x(n) +A0
σ2A
)
Note that if σ2A → ∞ then there is no a priori information and
limσ2
A→∞AMAP =
1N
N∑n=1
x(n) = AML
21Aσ
22Aσ
2212 AA σσ > Observation: As fθ(θ) flattens out
θMAP → θML
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 224 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Case 3: The absolute cost function
C(y , y ) = |y − y |Question: For what types of problems is this costfunction effective?
|ˆ| yy −
)ˆ,( yyC
0In this case
I(y) =
∫ ∞
−∞C(y , y )fy |x(y |x)dy
=
∫y<y
(y − y)fy |x(y |x)dy +
∫y≥y
(y − y)fy |x(y |x)dy
=
∫ y
−∞(y − y)fy |x(y |x)dy +
∫ ∞
y(y − y)fy |x(y |x)dy
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 225 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
I(y) =
∫ y
−∞(y − y)fy |x(y |x)dy +
∫ ∞
y(y − y)fy |x(y |x)dy
Note that∫ y
−∞(y − y)fy |x(y |x)dy = yFy |x(y |x) −
∫ y
−∞yfy |x(y |x)dy
and similarly∫ ∞
y(y − y)fy |x(y |x)dy =
∫ ∞
yyfy |x(y |x)dy − y(1 − Fy |x(y |x))
Thus,
I(y) = yFy |x(y |x) −∫ y
−∞yfy |x(y |x)dy
−y(1 − Fy |x(y |x)) +
∫ ∞
yyfy |x(y |x)dy
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 226 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
I(y) = yFy |x(y |x) −∫ y
−∞yfy |x(y |x)dy
−y(1 − Fy |x(y |x)) +
∫ ∞
yyfy |x(y |x)dy
Taking the derivative
∂I(y)
∂y= Fy |x(y |x) + y fy |x(y |x) − y fy |x(y |x)
−(1 − Fy |x(y |x)) − y fy |x(y |x) + y fy |x(y |x)
= Fy |x(y |x) − (1 − Fy |x(y |x))
Result: Setting equal to 0, we see the yMAE is given by
Fy |x(yMAE|x) = 1 − Fy |x(yMAE|x)
or ∫ yMAE
−∞fy |x(y |x)dy =
∫ ∞
yMAE
fy |x(y |x)dy
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 227 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
∫ yMAE
−∞fy |x(y |x)dy =
∫ ∞
yMAE
fy |x(y |x)dy
Interpreting this graphically
)|(| xx yf y
Area A Area BMAEy
Area A = Area B
Observation:yMAE = median of fy |x(y |x)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 228 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Estimator Relations
If fy |x(y |x) is symmetric, then
yMAE = yMS
Why? For a symmetric distribution the conditional mean is equalto the (median) symmetry point
If fy |x(y |x) is symmetric and unimodal, then
yMAE = yMS = yMAP
Why? The unimodal constraint implies that the single mode mustbe at the distribution symmetry point ⇒ the MAP estimate islocated at the central point
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 229 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Example
Determine yMAE for the previously considered case
fx ,y(x , y) =
{10y 0 ≤ y ≤ x2, 0 ≤ x ≤ 10 otherwise
We showed previously that
fy |x(y |x) =2yx4 0 ≤ y ≤ x2 ⇒ Fy |x(y |x) =
y2
x4 0 ≤ y ≤ x2
Thus determining yMAE
Fy |x(yMAE|x) = 1 − Fy |x(yMAE|x)
⇒ y2MAE
x4 = 1 − y2MAE
x4
⇒ yMAE =x2√
2
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 230 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
MAP estimate: (previous result)
yMAP(x) = argmaxy
fy |x(y |x) = x2
MS estimate: (previous result)
yMS(x) = E{y |x} =23
x2
MAE estimate:
yMAE(x) = median of fy |x(y |x) =x2√
2
y
)|(| xyf xy
0
2
2
x
2x
2
2x
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 231 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Final ML and MAP Comments
ML estimation was pioneered by geneticist and statistician Sir R.A. Fisher between 1912 and 1922Under fairly weak regularity conditions the ML estimate isasymptotically optimal
The ML estimate is asymptotically unbiased, i.e., its bias tends tozero as the number of samples increases to infinityThe ML estimate is asymptotically efficient, i.e., it achieves theCramér-Rao lower bound when the number of samples tends toinfinityConsequence: No unbiased estimator has lower mean squarederror than the ML estimatorThe ML estimate is asymptotically normal, i.e., as the number ofsamples increases, the distribution of the ML estimate tends to theGaussian distribution
MAP estimation is a generalization of ML estimation thatincorporates the prior distribution of the quantity being estimated
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 232 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
Problem Statement
Produce an estimate of a desired process statistically related to a setof observations
filterx(n) d(n)input y(n)
output
e(n) estimateerror
desiredresponse
+_
Historical Notes: The linear filtering problem was solved byAndrey Kolmogorov for discrete time – his 1938 paper“established the basic theorems for smoothing and predictingstationary stochastic processes”Norbert Wiener in 1941 for continuous time – not published untilthe 1949 paper Extrapolation, Interpolation, and Smoothing ofStationary Time Series
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 234 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
filterx(n) d(n)input y(n)
output
e(n) estimateerror
desiredresponse
+_
System restrictions and considerations:
Filter is linear
Filter is discrete time
Filter is finite impulse response (FIR)
The process is WSS
Statistical optimization is employed
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 235 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
For the discrete time case
z-1x(n) …
e(n)
z-1 z-1
d(n)
∗0w ∗
1w ∗−1Mw
…… +-)(ˆ nd
The filter impulse response is finite and given by
hk =
{w∗
k for k = 0, 1, · · · , M − 10 otherwise
The output d(n) is an estimate of the desired signal d(n)x(n) and d(n) are statistically related ⇒ d(n) and d(n) arestatistically related
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 236 / 406
p = E{x(n)d ∗(n)} [cross correlation between x(n) and d(n)]
Then (∗∗) can be compactly expressed as
J(w) = σ2d − pHw − wHp + wHRw
where we have assumed x(n) & d(n) are zero mean, WSSK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 238 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
The MSE criteria as a function of the filter weight vector w
J(w) = σ2d − pHw − wHp + wHRw
Observation: The error is a quadratic function of w
Consequences: The error is an M–dimensional bowl–shaped functionof w with a unique minimum
Result: The optimal weight vector, w0, is determined by differentiatingJ(w) and setting the result to zero
∇wJ(w)|w=w0 = 0
A closed form solution exists
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 239 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
Example
Consider a two dimensional case, i.e., a M = 2 tap filter. Plot the errorsurface and error contours.
Error Surface Error Contours
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 240 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
Aside (Matrix Differentiation): For complex data,
wk = ak + jbk , k = 0, 1, · · · , M − 1
the gradient, with respect to wk , is
∇k (J) =∂J∂ak
+ j∂J∂bk
, k = 0, 1, · · · , M − 1
The complete gradient is thus given by
∇w(J) =
⎡⎢⎢⎢⎣
∇0(J)∇1(J)
...∇M−1(J)
⎤⎥⎥⎥⎦ =
⎡⎢⎢⎢⎢⎣
∂J∂a0
+ j ∂J∂b0
∂J∂a1
+ j ∂J∂b1
...∂J
∂aM−1+ j ∂J
∂bM−1
⎤⎥⎥⎥⎥⎦
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 241 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
Example
Let c and w be M × 1 complex vectors. For g = cHw, find ∇w(g)
Note
g = cHw =M−1∑k=0
c∗k wk =
M−1∑k=0
c∗k (ak + jbk )
Thus
∇k (g) =∂g∂ak
+ j∂g∂bk
= c∗k + j(jc∗
k ) = 0, k = 0, 1, · · · , M − 1
Result: For g = cHw
∇w(g) =
⎡⎢⎢⎢⎣
∇0(g)∇1(g)
...∇M-1(g)
⎤⎥⎥⎥⎦ =
⎡⎢⎢⎢⎣
00...0
⎤⎥⎥⎥⎦ = 0
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 242 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
Example
Now suppose g = wHc. Find ∇w(g)
In this case,
g = wHc =M−1∑k=0
w∗k ck =
M−1∑k=0
ck (ak − jbk )
and
∇k (g) =∂g∂ak
+ j∂g∂bk
= ck + j(−jck ) = 2ck , k = 0, 1, · · · , M − 1
Result: For g = wHc
∇w(g) =
⎡⎢⎢⎢⎣
∇0(g)∇1(g)
...∇M-1(g)
⎤⎥⎥⎥⎦ =
⎡⎢⎢⎢⎣
2c0
2c1...
2cM−1
⎤⎥⎥⎥⎦ = 2c
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 243 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
Example
Lastly, suppose g = wHQw. Find ∇w(g)
In this case,
g =M−1∑i=0
M−1∑j=0
w∗i wjqi ,j
=M−1∑i=0
M−1∑j=0
(ai − jbi)(aj + jbj)qi ,j
⇒ ∇k (g) =∂g∂ak
+ j∂g∂bk
= 2M−1∑j=0
(aj + jbj)qk ,j + 0
= 2M−1∑j=0
wjqk ,j
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 244 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
Result: For g = wHQw
∇w(g) =
⎡⎢⎢⎢⎣
∇0(g)∇1(g)
...∇M-1(g)
⎤⎥⎥⎥⎦ = 2
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
M−1∑i=0
q0,iwi
M−1∑i=0
q1,iwi
...M-1∑i=0
qM−1,iwi
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
= 2Qw
Observation: Differentiation result depends on matrix ordering
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 245 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
Returning to the MSE performance criteria
J(w) = σ2d − pHw − wHp + wHRw
Approach: Minimize error by differentiating with respect to w and setresult to 0
∇w(J) = 0 − 0 − 2p + 2Rw
= 0
⇒ Rw0 = p [normal equation]
Result: The Wiener filter coefficients are defined by
w0 = R−1p
Question: Does R−1 always exist? Recall R is positive semi-definite,and usually positive definite
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 246 / 406
Wiener (MSE) Filtering Theory Orthogonality Principle
Orthogonality Principle
Consider again the normal equation that defines the optimal solution
Rw0 = p
⇒ E{x(n)xH(n)}w0 = E{x(n)d ∗(n)}
Rearranging
E{x(n)d ∗(n)} − E{x(n)xH(n)}w0 = 0
E{x(n)[d ∗(n) − xH(n)w0]} = 0
E{x(n)e∗0(n)} = 0
Note: e∗0(n) is the error when the optimal weights are used, i.e.,
e∗0(n) = d∗(n) − xH(n)w0
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 247 / 406
Wiener (MSE) Filtering Theory Orthogonality Principle
Thus
E{x(n)e∗0(n)} = E
⎡⎢⎢⎢⎣
x(n)e∗0(n)
x(n − 1)e∗0(n)
...x(n − M + 1)e∗
0(n)
⎤⎥⎥⎥⎦ =
⎡⎢⎢⎢⎣
00...0
⎤⎥⎥⎥⎦
Orthogonality Principle
A necessary and sufficient condition for a filter to be optimal is that theestimate error, e∗(n), be orthogonal to each input sample in x(n)
Interpretation: The observations samples and error are orthogonal andcontain no mutual “information”
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 248 / 406
Wiener (MSE) Filtering Theory Minimum MSE
Objective: Determine the minimum MSE
Approach: Use the optimal weights w0 = R−1p in the MSE expression
J(w) = σ2d − pHw − wHp + wHRw
⇒ Jmin = σ2d − pHw0 − wH
0 p + wH0 R(R−1p)
= σ2d − pHw0 − wH
0 p + wH0 p
= σ2d − pHw0
Result:Jmin = σ2
d − pHR−1p
where the substitution w0 = R−1p has been employed
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 249 / 406
Wiener (MSE) Filtering Theory Excess MSE
Objective: Consider the excess MSE introduced by using a weightedvector that is not optimal.
Adaptive optimization and filtering methods are appropriate,advantageous, or necessary when:
Signal statistics are not known a priori and must be “learned” fromobserved or representative samples
Signal statistics evolve over time
Time or computational restrictions dictate that simple, if repetitive,operations be employed rather than solving more complex, closedform expressionsTo be considered are the following algorithms:
Steepest Descent (SD) – deterministicLeast Means Squared (LMS) – stochasticRecursive Least Squares (RLS) – deterministic
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 265 / 406
Steepest descent, also known as gradient descent, it is an iterativetechnique for finding the local minimum of a function.
Approach: Given an arbitrary starting point, the current location (value)is moved in steps proportional to the negatives of the gradient at thecurrent point.
SD is an old, deterministic method, that is the basis for stochasticgradient based methodsSD is a feedback approach to finding local minimum of an errorperformance surfaceThe error surface must be known a prioriIn the MSE case, SD converges converges to the optimal solution,w0 = R−1p, without inverting a matrix
Question: Why in the MSE case does this converge to the globalminimum rather than a local minimum?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 266 / 406
The following conditions hold:1 The vectors x(1), x(2), · · · , x(n) are statistically independent2 x(n) is independent of d(1), d(2), · · · , d(n − 1)
3 d(n) is statistically dependent on x(n), but is independent ofd(1), d(2), · · · , d(n − 1)
4 x(n) and d(n) are mutually Gaussian
The independence theorem is invoked in the LMS algorithmanalysisThe independence theorem is justified in some cases, e.g.,beamforming where we receive independent vector observationsIn other cases it is not well justified, but allows the analysis toproceeds (i.e., when all else fails, invoke simplifying assumptions)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 296 / 406
Figure: Learning curves of the LMS algorithm for an adaptive equalizer withnumber of taps M = 11, step-size parameter μ = 0.075, and varyingeigenvalue spread χ(R).
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 320 / 406
Figure: Learning curves of the LMS algorithm for an adaptive equalizer withthe number of taps M = 11, fixed eigenvalue spread, and varying step-sizeparameter μ.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 322 / 406
The above follows from the fact that Φ(n) and e∗0(i) are independent.
Why?e0(i) is independent of all observations and the x(i) terms aregiven, uniquely defining Φ(n). ⇒ independence of Φ(n) and e∗
0(i).
Result: The RLS algorithm is unbiased and convergent in the mean forn ≥ M.
Question: How does this compare to the LMS algorithm?K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 375 / 406
The RLS algorithm converges faster than the LMS algorithm
The RLS algorithm has lower steady state error than the LMSalgorithm
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 385 / 406
Application: Blind Deconvolution Motivation
Blind Deconvolution
Motivation:
Adaptive equalizers typically require a training period during whichthey operate on known signals/statistics.This known signal training is not always appropriate, e.g., inmobile communications
Cost is too high (time/bandwidth)Multipathing or other time varying interference
In such cases, we must use blind equalization
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 387 / 406
Application: Blind Deconvolution System Model
System components and assumptions:
Channel introduces distortion (dominant)
System has additive noise (not dominant)Assume a baseband model of communications
Dominating interference is due to intersymbol interference (ISI)from channel distortion⇒ The noise is ignored
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 388 / 406
Application: Blind Deconvolution System Model
Also assume that:
h �= 0 for n < 0 (noncausal)∑k
h2k = 1 (to keep the variance of the output constant)
Example (4-ary PAM modulation)
A 4–ary PAM modulation scheme uses 4 signals
Or
S1
S2
S3
S4
S4 S2 S1 S3x
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 389 / 406
Application: Blind Deconvolution System Model
To solve the equalization problem, we need a statistical model of thedata. Assume,
1 The data is white
E{x(n)} = 0
E{x(n)x(k)} =
{1 k = n0 otherwise
2 The pdf of x(n) is symmetric and uniform)(xf x
0
32
1
33−
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 390 / 406
Application: Blind Deconvolution System Model
Deconvolution Objective: If {wi} are the coefficients of the idealinverse filter, then ∑
i
wihl−i = δl =
{1 l = 00 else
If this is the case, the output of the equalizer is
y(n) =∑
i
wiu(n − i)
=∑
i
∑k
wihkx(n − i − k) [let k = l − i ]
=∑
l
x(n − l)∑
i
wihl−i
=∑
l
x(n − l)δl
= x(n)
Problem: hn is not known ⇒ the exact inverse can not be usedK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 391 / 406
Application: Blind Deconvolution System Model
Solution: Use an iterative procedure to find the filter.
Let the output at iteration n be given by
y(n) =L∑
i=−L
wi(n)u(n − i)
where a 2L + 1 tap filter is used
Setting wi(n) = 0 for |i | > L, we can write
y(n) =∑
i
wi(n)u(n − i)
=∑
i
wiu(n − i) +∑
i
[wi(n) − wi ]u(n − i)
= x(n) + v(n)
[since x(n) =
∑i
wiu(n − i)
]
Note v(n) =∑
i [wi(n) − wi ]u(n − i) is the residual ISIK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 392 / 406
Application: Blind Deconvolution System Model
Interpretation: v(n) =∑
i [wi(n) − wi ]u(n − i) is the convolution noise(residual ISI) resulting from the fact that ideal filter was not used
Estimation Approach: Apply the output y(n) = x(n) + v(n) to a zeromemory nonlinear estimator
x(n) = g[y(n)]
Complete System:
The nonlinear estimate g[y(n)] can be used to update theequalizer to produce a better estimate at time n + 1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 393 / 406
Application: Blind Deconvolution Optimization
Define the equalizer error to be
e(n) = x(n) − y(n)
Use e(n) in the LMS algorithm to update the equalizer weights