EE 278Lecture Notes # 3Winter 2010–2011
Random variables, vectors,and processes
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1
Random Variables
Probability space (Ω,F , P)
A (real-valued) random variable is a real-valued function defined onΩ with a technical condition (to be stated)
Common to use upper-case letters. E.g., a random variable X is afunction X : Ω→ R. Y,Z,U,V,Θ, · · ·Also common: random variable may take on values only in somesubset ΩX ⊂ R (sometimes called the alphabet of X, AX and X alsocommon notations)
Intuition: Randomness is in experiment, which produces outcome ωaccording to probability P⇒ random variable outcome isX(ω) ∈ ΩX ⊂ R.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 2
Examples
Consider (Ω,F , P) with Ω = R, P determined by uniform pdf on [0, 1)
Coin flip from earlier: X : R→ 0, 1 by
X(r) =
0 if r ≤ 0.51 otherwise
.
Observe X, do not observe outcome of fair spin.
Lots of possible random variables, e.g., W(r) = r2, Z(r) = er, V(r) = r,L(r) = −r ln r (require r ≥ 0), Y(r) = cos(2πr), etc.
Can think of rvs as observations or measurements made on anunderlying experiment.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 3
Functions of random variables
Suppose that X is a rv defined on (Ω,F , P) and suppose thatg : ΩX → R is another real-valued function.
Then the function g(X) : Ω→ R defined by g(X)(ω) = g(X(ω)) is alsoa real-valued mapping of Ω, i.e., a real-valued function of a randomvariable is a random variable
Can express the previous examples as W = V2, Z = eV, L = −V ln V,Y = cos(2πV)
Similarly, 1/W, sinh(Y), L3 are all random variables
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 4
Random vectors and random processes
A finite collection of random variables (defined on a commonprobability space (Ω,F , P) is a random vector
E.g., (X,Y), (X0, X1, · · · , Xk−1)
An infinite collection of random variables (defined on a commonprobability space) is a random process
E.g., Xn, n = 0, 1, 2, · · · , X(t); t ∈ (−∞,∞)
So theory of random vectors and random processes mostly boilsdown to theory of random variables.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 5
Derived distributions
In general: “input” probability space (Ω,F , P) + random variable X ⇒“output” probability space, say (ΩX,B(ΩX), PX), where ΩX ⊂ R and PX
is distribution of X PX(F) = Pr(X ∈ F)
Typically PX described by pmf pX or pdf fX
For binary quantizer special case derived PX.
Idea generalizes and forces a technical condition on definition ofrandom variable (and hence also on random vector and randomprocess)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 6
Inverse image formula
Given (Ω,B(Ω), P) and a random variable X, find PX
Basic method: PX(F) = the probability computed using P of all theoriginal sample points that are mapped by X into the subset F:
PX(F) = P(ω : X(ω) ∈ F)
Shorthand way to write formula in terms of inverse image of an eventF ∈ B(ΩX) under the mapping X : Ω→ ΩX: X−1(F) = r : X(r) ∈ F:
PX(F) = P(X−1(F))
Written informally as PX(F) = Pr(X ∈ F) = PX ∈ F = “probability thatrandom variable X assumes a value in F”
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 7
X−1(F) X
F
Inverse image method: Pr(X ∈ F) = P(ω : X(ω) ∈ F) = P(X−1(F))
inverse image formula — fundamental to probability, randomprocesses, signal processing.
Shows how to compute probabilities of output events in terms of theinput probability space does the definition make sense?
i.e., is PX(F) = P(X−1(F)) well-defined for all output events F??
Yes if include requirement in definition of random variable —
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 8
Careful definition of a random variable
Given a probability space (Ω,F , P), a (real-valued) random variableX is a function X : Ω→ ΩX ⊂ R with the property that
if F ∈ B(ΩX), then X−1(F) ∈ F
Notes:
• In English: X : Ω→ ΩX ⊂ R is a random variable iff the inverseimage of every output event is an input event and thereforePX(F) = P(X−1(F)) is well-defined for all events F.
• Another name for a function with this property: measurablefunction
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 9
• Most every function we encounter is measurable, but calculus ofprobability rests on this property and advanced courses provemeasurability of important functions.
In simple binary quantizer example, X is measurable (easy to showsince F = B([0, 1)) contains intervals) Recall
PX(0) = P(r : X(r) = 0) = P(X−1(0))= P(r : 0 ≤ r ≤ 0.5) = P([0, 0.5]) = 0.5
PX(1) = P(X−1(1)) = P((0.5, 1.0]) = 0.5
PX(ΩX) = PX(0, 1) = P(X−1(0, 1) = P([0, 1)) = 1
PX(∅) = P(X−1(∅)) = P(∅) = 0,
In general, find PX by computing pmf or pdf, as appropriate.Many shortcuts, but basic approach is inverse image formula.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 10
Random vectors
All theory, calculus, applications of individual random variables usefulfor studying random vectors and random processes since randomvectors and processes are simply collections of random variables.
One k-dimensional random vector = k 1-dimensional randomvariables defined on a common probability space.
Earlier example: two coin flips, k-coin flips (first k binary coefficientsof fair spinner)
Several notations used, e.g., Xk = (X0, X1, . . . , Xk−1) is shorthand forXk(ω) = (X0(ω), X1(ω), . . . , Xk−1)(ω)
or X or Xn; n = 0, 1, . . . , k − 1 or Xn; n ∈ ZkEE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 11
Can be discrete (discribed by multidimensional pmf) or continuous(e.g., described by multidimensional pdf) or mixed
Recall that a real-valued function of a random variable is a randomvariable.
Similarly, a real-valued function of a random vector (several randomvariables) is a random variable. E.g., if X0, X1, . . . Xn−1 are randomvariables, then
S n =1n
n−1
k=0
Xk
is a random variable defined by
S n(ω) =1n
n−1
k=0
Xk(ω)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 12
Inverse image formula for random vectors
PX(F) = P(X−1(F)) = P(ω : X(ω) ∈ F)= P(ω : (X0(ω), X1(ω), . . . , Xk−1(ω)) ∈ F)
where the various forms are equivalent and all stand for Pr(X ∈ F)
Technically, the formula holds for suitable events F ∈ B(R)k, the Borelfield of Rk (or some suitable subset). See book for discussion.
One multidimensional event of particular interest is a Cartesianproduct of 1D events (called a rectangle):F = ×k−1
i=0 Fi = xk : xi ∈ Fi; i = 0, . . . , k − 1
PX(F) = P(ω : X0(ω) ∈ F0, X1(ω) ∈ F1, . . . , Xk−1(ω) ∈ Fk−1)EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 13
Random processes
A random vector is a finite collection of rvs defined on a commonprobability space
A random process is an infinite family of rvs defined on a commonprobability space. Many types:
Xn; n = 0, 1, 2, . . . (discrete-time, one-sided)
Xn; n ∈ Z (discrete-time, two-sided)
Xt; t ∈ [0,∞) (continuous-time, one-sided)
Xt; t ∈ R (continuous-time, two-sided)
Also called stochastic process
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 14
In general: Xt; t ∈ T or X(t); t ∈ T
Other notations: X(t), X[n] (for discrete-time)
Sloppy but common: X(t), context tells rp and not single rv
Also called a stochastic process. Discrete-time random processesare also called time series
Always: a random process is an indexed family of random variables,T is index set
For each t, Xt is a random variable. All Xt defined on a commonprobability space
index is usually time, in some applications it is space, e.g., randomfield X(t, s); t, s ∈ [0, 1) models a random image,V(x, y, t); x, y ∈ [0, 1); t ∈ [0,∞) models analog video.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 15
Keep in mind the suppressed argument ω— e.g., each Xt is Xt(ω), afunction defined on the sample space
X(t) is X(t,ω), it can be viewed as a function of two arguments
Have seen one example — fair coin flips, a Bernoulli random process
Another, simpler, example:
Random sinusoids Suppose that A and Θ are two random variableswith a joint pdf fA,θ(a, θ) = fA(a) fΘ(θ). For example, Θ ∼ U([0, 2π))and A ∼ N(0,σ2). Define a continuous-time random process X(t) forall t ∈ R
X(t) = A cos(2πt + Θ)
Or, making the dependence on ω explicit,
X(t,ω) = A(ω) cos(2πt + Θ(ω))
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 16
Derived distributions for random variables
General problem: Given probability space (Ω,F , P) and a randomvariable X with range space (alphabet) ΩX. Find the distribution PX.
If ΩX is discrete, then PX described by a pmf
pX(x) = P(X−1(x)) = P(ω : X(ω) = x)
PX(F) =
x∈FpX(x) = P(X−1(F))
If ΩX is continuous, then need a pdf.
But a pdf is not a probability so inverse image formula does not applyimmediately⇒ alter approach
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 17
Cumulative distribution functions
Define cumulative distribution function (cdf) by
FX(x) ≡ x
−∞fX(r)dr = Pr(X ≤ x)
This is a probability and inverse image formula works
FX(x) = P(X−1((−∞, x]))
and from calculusfX(x) =
ddx
FX(x)
So first find cdf FX(x), then differentiate to find fX(x)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 18
Notes:
• If a ≥ b, then since (−∞, a] = (−∞, b] ∪ (b, a] is the union of disjointintervals, then FX(a) = FX(b) + PX((b, a]) and hence
PX((a, b]) = b
afX(x) dx = FX(b) − FX(a)
⇒ FX(x) is monontonically nondecreasing
• cdf is well defined for discrete rvs:
FX(r) = Pr(X ≤ r) =
x:x≤r
pX(x),
but not as useful. Not needed for derived distributions
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 19
If original space (Ω,F , P) is a discrete probability space, then rv Xdefined on (Ω,F , P) is also discrete
Inverse image formula⇒
pX(x) = PX(x) = P(X−1(x)) =
ω:X(ω)=x
p(ω)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 20
Example: discrete derived distribution
Ω = Z+, P determined by the geometric pmf
Define a random variable Y : Y(ω) =
1 if ω even0 if ω odd
Using the inverse image formula for the pmf for Y(ω) = 1:
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 21
pY(1) =
ω:ω even
(1 − p)k−1p =
k=2,4,...
(1 − p)k−1p
=p
(1 − p)
∞
k=1
((1 − p)2)k = p(1 − p)∞
k=0
((1 − p)2)k
= p(1 − p)
1 − (1 − p)2 =1 − p2 − p
pY(0) = 1 − pY(1) =1
2 − p
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 22
Suppose original space is (Ω,F , P) = (R,B(R), P) where P isdescribed by a pdf g:
P(F) =
r∈Fg(r) dr; F ∈ B(R).
X a rv. Inverse image formula⇒
PX(F) = P(X−1(F)) =
r: X(r)∈Fg(r) dr.
If X discrete, find the pmf pX(x) =
r: X(r)=xg(r) dr
Quantizer example did this.
If X is continuous, want the pdf. First find cdf then differentiate.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 23
Example: continuous derived distribution
Square of a random variable
(R,B(R), P) with P induced by a Gaussian pdf.
Define W : R→ R by W(r) = r2; r ∈ R.
Find pdf fW. First find cdf FW, then differentiate. If w < 0, FW(w) = 0.If w ≥ 0,
FW(w) = Pr(W ≤ w) = P(ω : W(ω) = ω2 ≤ w)
= P([−w1/2,w1/2]) = w1/2
−w1/2g(r) dr
This can be complicated, but don’t need to plug in g yet
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 24
Use integral differentiation formula to get pdf directly —
ddw
b(w)
a(w)g(r) dr = g(b(w))
db(w)dw
− g(a(w))da(w)
dw
In our example
fW(w) = g(w1/2)w−1/2
2
− g(−w1/2)
−w−1/2
2
E.g., if g =N(0,σ2), then
fW(w) =w−1/2√
2πσ2e−w/2σ2
; w ∈ [0,∞).
— a chi-squared pdf with one degree of freedom
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 25
Example: continuous derived distribution
The max and min functions
Let X ∼ fX(x) and Y ∼ fY(y) be independent so thatfX,Y(x, y) = fX(x) fY(y).
DefineU = maxX,Y,V = minX,Y
where
max(x, y) =
x if x ≥ yy otherwise
min(x, y) =
y if x ≥ yx otherwise
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 26
Find the pdfs of U and V.
To find the pdf of U, we first find its cdf. U ≤ u iff both X and Y are≤ u, so using independence
FU(u) = Pr(U ≤ u) = Pr(X ≤ u,Y ≤ u) = FX(u)FY(u)
Using the product rule for derivatives,
fU(u) = fX(u)FY(u) + fY(u)FX(u)
To find the pdf of V, first find the cdf. V ≤ v iff either X or Y ≤ v so thatusing independence
FV(v) = Pr(X ≤ v or Y ≤ v)
= 1 − Pr(X > v,Y > v)
= 1 − (1 − FX(v))(1 − FY(v))
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 27
ThusfV(v) = fX(v) + fY(v) − fX(v)FY(v) − fY(v)FX(v)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 28
Directly-given random variables
All named examples of pmfs (uniform, Bernoulli, binomial, geometric,Poisson) and pdfs (uniform, exponential, Gaussian, Laplacian,chi-squared, etc.) and the probability spaces they imply can beconsidered as describing random variables:
Suppose (Ω,F , P) is a probability space with Ω ⊂ R.
Define a random variable V : Ω→ Ω
V(ω) = ω
— the identity mapping, random variable just reports original samplevalue ω
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 29
Implies output probability space in trivial way:
PV(F) = P(V−1(F)) = P(F)
If original space discrete (continuous), so is random variable, andrandom variable is described by pmf (pdf)
A random variable is said to be Bernoulli, binomial, etc. if itsdistribution is determined by a Bernoulli, binomial, etc. pmf (or pdf)
Two random variables V and X (possibly defined on differentexperiments) are said to be equivalent or identically distributed ifPV = PX, i.e., PV(F) = PX(F) all events F
E.g., both continuous with same pdf, or both discrete with same pmf
Example: Binary random variable defined as quantization of fairspinner vs. directly given as above.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 30
Note: Two ways to describe random variables:
1. Describe a probability space (Ω,F , P) and define a function X onit. Together these imply distribution PX for rv (by a pmf or pdf)
2. (Directly given) Describe distribution PX directly (by a pmf or pdf).
Implicitly (Ω,F , P) = (ΩX,B(ΩX), PX) and X(ω) = ω.
Both representations are useful.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 31
Derived distributions: random vectors
As in the scalar case, distribution can be described by probabilityfunctions — cdf’s and either pmfs or pdfs (or both)
If random vector has a discrete range space, then the distribution canbe described by a multidimensional pmf pX(x) = PX(x) = Pr(X = x)as
PX(F) =
x∈FpX(x) =
(x0,x1,...,xk−1)∈FpX0,X1,...,Xk−1(x0, x1, . . . , xk−1)
If the random vector X has a continuous range space, thendistribution can be described by a multidimensional pdf fXPX(F) =
F fX(x) dx Use multidimensional cdf to find pdf
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 32
Given a k-dimensional random vector X, define cumulativedistribution function (cdf) FX by
FX(x)
= FX0,X1,...,Xk−1(x0, x1, . . . , xk−1)
= PX(α : αi ≤ xi; i = 0, 1, . . . , k − 1)= Pr(Xi ≤ xi; i = 0, 1, . . . , k − 1)
=
x0
−∞
x1
−∞· · · xk−1
−∞fX0,X1,...,Xk−1(α0,α1, . . . ,αk−1)dα0dα1 · · · dαk−1
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 33
Other ways to express multidimensional cdf:
FX(x) = PX×k−1
i=0 (−∞, xi]
= P(ω : Xi(ω) ≤ xi; i = 0, 1, . . . , k − 1)
= P
k−1
i=0
X−1i ((−∞, xi])
.
Integration and differentiation are inverses of each other⇒
fX0,X1,...,Xk−1(x0, x1, . . . , xk−1)
=∂k
∂x0∂x1 . . . ∂xk−1FX0,X1,...,Xk−1(x0, x1, . . . , xk−1).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 34
Joint and marginal distributions
Random vector X = (X0, X1, . . . , Xk−1) is a collection of randomvariables defined on a common probability space (Ω,F , P)
Alternatively, X is a random vector that takes on values randomly asdescribed by a probability distribution PX, without explicit reference tothe underlying probability space.
Either the original probability measure P or the induced distributionPX can be used to compute probabilities of events involving therandom vector.
E.g., finding the distributions of individual components of the randomvector.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 35
For example, if X = (X0, X1, . . . , Xk−1) is discrete, described by a pmfpX, then distribution for PX0 is described by pmf pX0(x0) which can becomputed as
pX0(x0) = P(ω : X0(ω) = x0)= P(ω : X0(ω) = x0, Xi(ω) ∈ ΩX; i = 1, 2, . . . , k − 1)=
x1,x2,...,xk−1
pX(x0, x1, x2, . . . , xk−1)
In English, all of these are Pr(X0 = x0)
In general we have for cdfs that
FX0(x0) = P(ω : X0(ω) ≤ x0)
= P(ω : X0(ω) ≤ x0, Xi(ω) ∈ ΩX; i = 1, 2, . . . , k − 1)= FX(x0,∞,∞, . . . ,∞)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 36
⇒ if the pdfs exist,
fX0(x0) =
fX(x0, x1, x2, . . . , xk−1)dx1dx2 . . . dxk−1
Can find distributions for any of the components in this way:
pXi(α)
=
x0,x1,...,xi−1,xi+1,...,xk−1
pX0,X1,...,Xk−1(x0, x1, . . . , xi−1,α, xi+1, . . . , xk−1)
or
fXi(α) =
dx0 . . . dxi−1dxi+1 . . . dxk−1 fX0,...,Xk−1(x0, . . . , xi−1,α, xi+1, . . . , xk−1)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 37
Sum or integrate over all of the dummy variables corresponding tothe unwanted random variables in the vector to obtain the pmf or pdffor the random variable Xi
FXi(α) = FX(∞,∞, . . . ,∞,α,∞, . . . ,∞),or Pr(Xi ≤ α) = Pr(Xi ≤ α and Xj ≤ ∞, all j i)
Similarly can find cdfs/pmfs/pdfs for any pairs or triples of randomvariables in the random vector or any other subvector (at least intheory)
These relations are called consistency relationships — a randomvector distribution implies many other distributions, and these mustbe consistent with each other.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 38
2D random vectors
Ideas are clearest when only 2 rvs: (X,Y) a random vector.
marginal distribution of X is obtained from the joint distribution of Xand Y by leaving Y unconstrained
PX(F) = PX,Y((x, y) : x ∈ F, y ∈ R); F ∈ B(R).
Marginal cdf of X is FX(α) = FX,Y(α,∞)
If the range space of the vector (X,Y) is discrete,
pX(x) =
y
pX,Y(x, y).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 39
If the range space of the vector (X,Y) is continuous and the cdf isdifferentiable so that fX,Y(x, y) exists,
fX(x) = ∞
−∞fX,Y(x, y) dy,
with similar expressions for the distribution for rv Y.
Joint distributions imply marginal distributions.
The opposite is not true without additional assumptions, e.g.,independence.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 40
Examples of joint and marginal distributions
Example
Suppose rvs X and Y are such that the random vector (X,Y) has apmf of the form
pX,Y(x, y) = r(x)q(y),
where r and q are both valid pmfs. (pX,Y is a product pmf)
Then
pX(x) =
y
pX,Y(x, y) =
y
r(x)q(y)
= r(x)
y
q(y) = r(x).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 41
Thus in the special case of a product distribution, knowing themarginal pmfs is enough to know the joint distribution. Thus marginaldistributions + independence⇒ the joint distribution.
Pair of fair coins provides an example:
pXY(x, y) = pX(x)pY(y) =14
; x, y = 0, 1
pX(x) = pY(y) =12
; x = 0, 1
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 42
Example of where marginals not enough
Flip two fair coins connected by a piece of flexible rubber
pXY(x, y)0 1
0 0.4 0.11 0.1 0.4
⇒ pX(x) = pY(y) = 1/2, x = 0, 1
Not a product distribution, but same marginals as product distributioncase
Quite different joints can yield the same marginals. Marginals alonedo not tell the story.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 43
Another example
Loaded pair of six-sided dice have property the sum of the two dice =7 on every roll.
All 6 combinations possible combinations ( (1,6), (2,5), (3,4), (4,3),(5,2), (6,1)) have equal probability.
Suppose outcome of one die is X, other is Y
(X,Y) is a random vector taking values in 1, 2, . . . , 62
pX,Y(x, y) =16, x + y = 7, (x, y) ∈ 1, 2, . . . , 62.
Find marginal pmfs
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 44
pX(x) =
y
pXY(x, y) = pXY(x, 7 − x) =16, x = 1, 2, . . . , 6
Same as if product distribution. marginals alone do not imply joint
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 45
Continuous example
(X,Y) a rv with a pdf that is constant on the unit disk in the XY plane:
fX,Y(x, y) =
C x2 + y2 ≤ 10 otherwise
Find marginal pdfs. Is it a product pdf?
Need C:
x2+y2≤1C dx dy = 1.
Integral = area of a circle multiplied by C ⇒ C = 1/π.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 46
fX(x) = +√1−x2
−√
1−x2C dy = 2C
√1 − x2 , x2 ≤ 1.
Could now also find C by a second integration:
+1
−12C√
1 − x2 dx = πC = 1,
or C = π−1.
ThusfX(x) = 2π−1
√1 − x2 , x2 ≤ 1.
By symmetry Y has the same pdf. fX,Y not a product pdf.
Note marginal pdf is not constant, even though the joint pdf is.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 47
Joints and marginals: Gaussian pair
2D Gaussian pdf with k = 2, m = (0, 0)t, andΛ = λ(i, j) : λ(1, 1) = λ(2, 2) = 1, λ(1, 2) = λ(2, 1) = ρ.Inverse matrix is
1 ρρ 1
−1
=1
1 − ρ2
1 −ρ−ρ 1
,
the joint pdf for the random vector (X,Y) is
fX,Y(x, y) =exp− 1
2(1−ρ2)(x2 + y2 − 2ρxy)
2π
1 − ρ2, (x, y) ∈ R2.
ρ called “correlation coefficient”
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 48
Need ρ2 < 1 for Λ to be positive definite
To find the pdf of X, integrate joint over y
Do this using standard trick: complete the square:
x2 + y2 − 2ρxy = (y − ρx)2 − ρ2x2 + x2 = (y − ρx)2 + (1 − ρ2)x2
fX,Y(x, y) =exp−(y−ρx)2
2(1−ρ2) −x2
2
2π
1 − ρ2=
exp−(y−ρx)2
2(1−ρ2)
2π(1 − ρ2)
exp−x2
2
√2π.
Part of joint is N(ρx, 1 − ρ2), which integrates to 1. Thus
fX(x) = (2π)−1/2e−x2/2.
Note marginals the same regardless of ρ!
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 49
Consistency & directly given processes
Have seen two ways to describe (specify) a random variable – as aprobability space + a function (random variable), or a directly given rv(a distribution — pdf or pmf)
Same idea works for random vectors.
What about random processes? E.g., direct definition of fair coinflipping process.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 50
For simplicity, consider discrete time, discrete alphabet randomprocess, say Xn. Given random process, can use inverse imageformula to compute pmf for any finite collection of samples(Xk1, Xk2, . . . , XkK), e.g.,
pXk1,Xk2,...,XkK(x1, x2, . . . , xK) = Pr(Xki = xi; i = 1, . . . ,K)
= P(ω : Xki(ω) = xi; i = 1, . . . ,K)
For example, in the fair coin flipping process
pXk1,Xk2,...,XkK(x1, x2, . . . , xK) = 2−K, all (x1, x2, . . . , xK) ∈ 0, 1K
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 51
The axioms of probability⇒ that these pmfs for any choice of K andk1, . . . , kK must be consistent in the sense that if any of the pmfs isused to compute the probability of an event, the answer must be thesame. E.g.,
pX1(x1) =
x2
pX1,X2(x1, x2)
=
x0,x2
pX0,X1,X2(x0, x1, x2)
=
x3,x5
pX1,X3,X5(x0, x2, x5)
since all of these computations yield the same probability in theoriginal probability space Pr(X1 = x1) = P(ω : X1(ω) = x1)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 52
Bottom line If given a discrete time discrete alphabet randomprocess Xn; n ∈ Z, then for any finite K and collection of K sampletimes k1, . . . , kK can find the joint pmf pXk1,Xk2,...,XkK
(x1, x2, . . . , xK) andthis collection of pmfs must be consistent.
Kolmogorov proved a converse to this idea now called theKolmogorov extension theorem, which provides the most commonmethod for describing a random process:
Theorem. Kolmogorov extension theorem for discrete timeprocesses Given a consistent family of finite-dimensional pmfspXk1,Xk2,...,XkK
(x1, x2, . . . , xK) for all dimensions K and sample timesk1, . . . , kK, then there is a random process Xn; n ∈ Z described bythese marginals.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 53
To completely describe a random process, you need only provide aformula for a consistent family of pmfs for finite collections ofsamples.
The same result holds for continuous time random processes and forcontinuous alphabet processes (family of pdfs)
Difficult to prove, but most common way to specify model.Kolmogorov or directly-given representation of a random process –describe consistent family of vector distributions. For completeness:
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 54
Theorem. Kolmogorov extension theoremSuppose that one is given a consistent family of finite-dimensionaldistributions PXt0,Xt1,...,Xtk−1
for all positive integers k and all possiblesample times ti ∈ T ; i = 0, 1, . . . , k − 1. Then there exists a randomprocess Xt; t ∈ T that is consistent with this family. In other words,to describe a random process completely, it is sufficient to describe aconsistent family of finite-dimensional distributions of its samples.
Example: Given a pmf p, define a family of vector pmfs by
pXk1,Xk2,...,XkK(x1, x2, . . . , xK) =
K
i=1
p(xk),
then there is a random process Xn having these vector pmfs forfinite collections of samples. A process of this form is called an iidprocess.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 55
The continuous alphabet analogy is defined in terms of a pdf f —define the vector pdfs by
fXk1,Xk2,...,XkK(x1, x2, . . . , xK) =
K
i=1
f (xk)
A discrete time continuous alphabet process is iid if its joint pdfsfactor in this way.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 56
Independent random variables
Return to definition of independent rvs, with more explanation.
Definition of independent random variables an application ofdefinition of independent events.
Defined events F and G to be independent if P(F ∩G) = P(F)P(G)
Two random variables X and Y defined on a probability space areindependent if the events X−1(F) and Y−1(G) are independent for allF and G in B(R), i.e., if
P(X−1(F) ∩ Y−1(G)) = P(X−1(F))P(Y−1(G))
Equivalently, Pr(X ∈ F,Y ∈ G) = Pr(X ∈ F) Pr(Y ∈ G) orPXY(F ×G) = PX(F)PY(G)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 57
If X, Y discrete, choosing F = x, Y = y⇒
pXY(x, y) = pX(x)pY(y) all x, y
Conversely, if joint pmf = product of marginals, then evaluatePr(X ∈ F,Y ∈ G) as
P(X−1(F) ∩ Y−1(G)) =
x∈F,y∈GpXY(x, y) =
x∈F,y∈GpX(x)pY(y)
=
x∈FpX(x)
y∈GpY(y)
= P(X−1(F))P(Y−1(G))
⇒ independent by general definition
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 58
For general random variables, consider F = (−∞, x], G = (−∞, y].Then if X,Y independent, FXY(x, y) = FX(x)FY(y) all x, y. If pdfs exist,this implies that
fXY(x, y) = fX(x) fY(y)
Conversely, if this relation holds for all x, y, thenP(X−1(F) ∩ Y−1(G)) = P(X−1(F))P(Y−1(G)) and hence X and Y areindependent.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 59
A collection of rvs Xi, i = 0, 1, . . . , k − 1 is independent or mutuallyindependent if all collections of events of the formX−1
i (Fi); i = 0, 1, . . . , k − 1 are mutually independent for anyFi ∈ B(R); i = 0, 1, . . . , k − 1.
A collection of discrete random variables Xi; i = 0, 1, . . . , k − 1 ismutually independent iff
pX0,...,Xk−1(x0, . . . , xk−1) =k−1
i=0
pXi(xi); ∀xi.
A collection of continuous random variables is independent iff thejoint pdf factors as
fX0,...,Xk−1(x0, . . . , xk−1) =k−1
i=0
fXi(xi).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 60
A collection of general random variables is independent iff the jointcdf factors as
FX0,...,Xk−1(x0, . . . , xk−1) =k−1
i=0
FXi(xi); (x0, x1, . . . , xk−1) ∈ Rk.
The random vector is independent, identically distributed (iid) if thecomponents are independent and the marginal distributions are allthe same.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 61
Conditional distributions
Apply conditional probability to distributions.
Can express joint probabilities as products even if rvs notindependent
E.g., distribution of input given observed output (for inference)
There are many types: conditional pmfs, conditional pdfs, conditionalcdfs
Elementary and nonelementary conditional probability
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 62
Discrete conditional distributions
Simplest, direct application of elementary conditional probability topmfs
Consider 2D discrete random vector (X,Y)
alphabet AX × AY
joint pmf pX,Y(x, y)
marginal pmfs pX and pY
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 63
Define for each x ∈ AX for which pX(x) > 0 the conditional pmf
pY |X(y|x) = P(Y = y|X = x)
=P(Y = y, X = x)
P(X = x)
=P(ω : Y(ω) = y ∩ ω : X(ω) = x)
P(ω : X(ω) = x)
=pX,Y(x, y)
pX(x),
elementary conditional probability that Y = y given X = x
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 64
Properties of conditional pmfs
For fixed x, pY |X(·|x) is a pmf:
y∈AY
pY |X(y|x) =
y∈AY
pX,Y(x, y)pX(x)
=1
pX(x)
y∈AY
pX,Y(x, y)
=1
pX(x)pX(x) = 1.
The joint pmf can be expressed as a product as
pX,Y(x, y) = pY |X(y|x)pX(x).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 65
Can compute conditional probabilities by summing conditional pmfs,
P(Y ∈ F|X = x) =
y∈FpY |X(y|x)
Can write probabilities of events of the form X ∈ G,Y ∈ F (rectangles)as
P(X ∈ G,Y ∈ F) =
x,y:x∈G,y∈FpX,Y x, y
=
x∈GpX(x)
y∈FpY |X(y | x)
=
x∈GpX(x)P(F | X = x)
Later: define nonelementary conditional probability to mimic thisformula
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 66
If X and Y are independent, then pY |X(y|x) = pY(y)
Given pY |X, pX, Bayes rule for pmfs:
pX|Y(x|y) =pX,Y(x, y)
pY(y)=
pY |X(y|x)pX(x)
u pY |X(y|u)pX(u),
a result often referred to as Bayes’ rule.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 67
Example of Bayes rule: Binary Symmetric Channel
Consider the following binary communication channel
X ∈ 0, 1 +
Z ∈ 0, 1
Y ∈ 0, 1
Bit sent is X ∼ Bern(p), 0 ≤ p ≤ 1, noise is Z ∼ Bern(), 0 ≤ ≤ 0.5,bit received is Y = (X + Z) mod 2 = X ⊕ Z, and X and Z areindependent
Find 1) pX|Y(x|y), 2) pY(y), and 3) PrX Y, the probability of error
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 68
1.To find pX|Y(x|y) use Bayes rule
pX|Y(x|y) =pY |X(y|x)
x∈AX
pY |X(y|x)pX(x)pX(x)
Know pX(x), but we need to find pY |X(y|x):
pY |X(y|x) = PrY = y | X = x = PrX ⊕ Z = y | X = x= Prx ⊕ Z = y | X = x = PrZ = y ⊕ x | X = x= PrZ = y ⊕ x since Z and X are independent
= pZ(y ⊕ x)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 69
Therefore
pY |X(0 | 0) = pZ(0 ⊕ 0) = pZ(0) = 1 − pY |X(0 | 1) = pZ(0 ⊕ 1) = pZ(1) =
pY |X(1 | 0) = pZ(1 ⊕ 0) = pZ(1) =
pY |X(1 | 1) = pZ(1 ⊕ 1) = pZ(0) = 1 −
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 70
Plugging into Bayes rule:
pX|Y(0|0) =pY |X(0|0)
pY |X(0|0)pX(0) + pY |X(0|1)pX(1)pX(0) =
(1 − )(1 − p)(1 − )(1 − p) + p
pX|Y(1|0) = 1 − pX|Y(0|0) =p
(1 − )(1 − p) + p
pX|Y(0|1) =pY |X(1|0)
pY |X(1|0)pX(0) + pY |X(1|1)pX(1)pX(0) =
(1 − p)(1 − )p + (1 − p)
pX|Y(1|1) = 1 − pX|Y(0|1) =(1 − )p
(1 − )p + (1 − p)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 71
2.We already found pY(y) as
pY(y) = pY |X(y |0)pX(0) + pY |X(y |1)pX(1)
=
(1 − )(1 − p) + p for y = 0
(1 − p) + (1 − )p for y = 1
3.Now to find the probability of error PrX Y, consider
PrX Y = pX,Y(0, 1) + pX,Y(1, 0)
= pY |X(1|0)pX(0) + pY |X(0|1)pX(1)
= (1 − p) + p =
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 72
An interesting special case is = 12. Here, PrX Y = 1
2, which isthe worst possible (no information is sent), and
pY(0) = 12 p + 1
2(1 − p) = 12 = pY(1)
Therefore Y ∼ Bern(12), independent of the value of p !
In this case, the bit sent X and the bit received Y are independent(check this)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 73
Conditional pmfs for vectors
Random vector (X0, X1, . . . , Xk−1)
pmf pX0,X1,...,Xk−1
Define conditional pmfs (assuming denominators 0)
pXl|X0,...,Xl−1(xl|x0, . . . , xl−1) =pX0,...,Xl(x0, . . . , xl)
pX0,...,Xl−1(x0, . . . , xl−1).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 74
⇒ chain rule
pX0,X1,...,Xn−1(x0, x1, . . . , xn−1)
=
pX0,X1,...,Xn−1(x0, x1, . . . , xn−1)pX0,X1,...,Xn−2(x0, x1, . . . , xn−2)
pX0,X1,...,Xn−2(x0, x1, . . . , xn−2)
...
= pX0(x0)n−1
i=1
pX0,X1,...,Xi(x0, x1, . . . , xi)pX0,X1,...,Xi−1(x0, x1, . . . , xi−1)
= pX0(x0)n−1
l=1
pXl|X0,...,Xl−1(xl|x0, . . . , xl−1)
Formula plays an important role in characterizing memory inprocesses. Can be used to construct joint pmfs, and to specify arandom process.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 75
Continuous conditional distributions
Continuous distributions more complicated
Given X,Y with joint pdf fX,Y, marginal pdfs fX, fY, define conditionalpdf
fY |X(y|x) ≡ fX,Y(x, y)fX(x)
.
analogous to conditional pmf, but unlike conditional pmf, not aconditional probability!
A density of conditional probability
Problem: conditioning event has probability 0. Elementaryconditional probability not work.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 76
Conditional pdf is a pdf:
fY |X(y|x) dy =
fX,Y(x, y)fX(x)
dy
=1
fX(x)
fX,Y(x, y) dy
=1
fX(x)fX(x) = 1,
provided require that fX(x) > 0 over the region of integration.
Given a conditional pdf fY |X, define (nonelementary) conditionalprobability that Y ∈ F given X = x by
P(Y ∈ F|X = x) ≡
FfY |X(y|x) dy.
Resembles discrete form.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 77
Nonelementary conditional probability
Does P(Y ∈ F|X = x) =
F fY |X(y|x) dy. make sense as an appropriatedefinition of conditional probability given an event of zero probability?
Observe that analogous to the ed result for pmfs, assuming thepdfs all make sense
P(X ∈ G,Y ∈ F) =
x,y:x∈G,y∈FfX,Y(x, y)dxdy
=
x∈GfX(x)
y∈FfY |X(y | x)dy
dx
=
x∈GfX(x)P(F | X = x)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 78
Our definition is ad hoc. But the careful mathematical definition ofconditional probability P(F | X = x) for an event of 0 probability ismade not by a formula such as we have used to define conditionalpmfs and pdfs and elementary conditional probability, but by itsbehavior inside an integral (like the Dirac delta). In particular,P(F | X = x) is defined as any measurable function satisfyingequation for all events F and G, which our definition does.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 79
Bayes rule for pdfs
Bayes rule:
fX|Y(x|y) =fX,Y(x, y)
fY(y)=
fY |X(y|x) fX(x)fY |X(y|u) fX(u) du
.
Example of conditional pdfs: 2D Gaussian
U = (X,Y), Gaussian pdf with mean (mX,mY)t and covariance matrix
Λ =
σ2
X ρσXσY
ρσXσY σ2Y
,
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 80
Algebra⇒
det(Λ) = σ2Xσ
2Y(1 − ρ2)
Λ−1 =1
(1 − ρ2)
1/σ2
X −ρ/(σXσY)−ρ/(σXσY) 1/σ2
Y
so
fXY(x, y)
=1
2π√
detΛe−
12(x−mX,y−mY)Λ−1(x−mX,y−mY)t
=1
2πσXσY
1 − ρ2exp− 1
2(1 − ρ2)
×
x − mX
σX
2− 2ρ
(x − mX)(y − mY)σXσY
+
y − mY
σY
2
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 81
Rearrange
fXY(x, y) =exp−1
2(x−mXσX
)2
2πσ2
X
exp−1
2
y−mY−(ρσY/σX)(x−mX)√
1−ρ2σY
2
2πσ2Y(1 − ρ2)
⇒
fY |X(y|x) =exp−1
2
y−mY−(ρσY/σX)(x−mX)√
1−ρ2σY
2
2πσ2Y(1 − ρ2)
,
Gaussian with variance σ2Y |X ≡ σ2
Y(1 − ρ2), meanmY |X ≡ mY + ρ(σY/σX)(x − mX)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 82
Integrate joint over y (as before)⇒
fX(x) =e−(x−mX)2/2σ2
X
2πσ2X
.
Similarly, fY(y) and fX|Y(x|y) are also Gaussian
Note: X and Y jointly Gaussian⇒ also both individually andconditionally Gaussian!
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 83
Chain rule for pdfs
Assume fX0,X1,...,Xi(x0, x1, . . . , xi) > 0,
fX0,X1,...,Xn−1(x0, x1, . . . , xn−1)
=fX0,X1,...,Xn−1(x0, x1, . . . , xn−1)fX0,X1,...,Xn−2(x0, x1, . . . , xn−2)
fX0,X1,...,Xn−2(x0, x1, . . . , xn−2)
...
= fX0(x0)n−1
i=1
fX0,X1,...,Xi(x0, x1, . . . , xi)fX0,X1,...,Xi−1(x0, x1, . . . , xi−1)
= fX0(x0)n−1
i=1
fXi|X0,...,Xi−1(xi|x0, . . . , xi−1).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 84
Statistical detection and classification
Simple application of conditional probability mass functionsdescribing discrete random vectors
Transmitted: discrete rv X, pmf pX, pX(1) = p
(e.g., one sample of a binary random process)
Received: rv Y
Conditional pmf (noisy channel) pY |X(y|x)
More specific example as special case: X Bernoulli, parameter p
pY |X(y|x) =
x y1 − x = y
.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 85
binary symmetric channel (BSC)
Given observation Y, what is the best guess X(Y) of transmittedvalue?
decision rule or detection rule
Measure quality by probability guess is correct:
Pc(X) = Pr(X = X(Y)) = 1 − Pe,
wherePe(X) = Pr(X(Y) X).
A decision rule is optimal if it yields the smallest possible Pe ormaximum possible Pc
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 86
Pr(X = X) = 1 − Pe(X) =
(x,y):X(y)=x
pX,Y(x, y)
=
(x,y):X(y)=x
pX|Y(x|y)pY(y)
=
y
pY(y)
x:X(y)=x
pX|Y(x|y)
=
y
pY(y)pX|Y(X(y)|y).
To maximize sum, maximize pX|Y(X(y)|y) for each y.
Accomplished by X(y) ≡ arg maxu
pX|Y(u|y) which yields
pX|Y(X(y)|y) = maxu pX|Y(u|y)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 87
This is maximum a posteriori (MAP) detection rule
In binary example: Choose X(y) = y if < 1/2 and X(y) = 1 − y if > 1/2.
⇒ minimum (optimal) error probability over all possible rules ismin(, 1 − )
In general nonbinary case, statistical detection is statisticalclassification: Unseen X might be presence or absence of a disease,observation Y the results of various tests.
General Bayesian classification allows weighting of cost of differentkinds of errors (Bayes risk) so minimize a weighted average(expected cost) instead of only probability of error
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 88
Additive noise: Discrete random variables
Common setup in communications, signal processing, statistics:
Original signal X has random noise W (independent of X) added to it,observe Y = X +W
Typically use observation Y to make inference about X
Begin by deriving conditional distributions.
Discrete case: Have independent rvs X and W with pmfs pX and pW.Form Y = X +W. Find pY
Use inverse image formula:
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 89
pX,Y(x, y) = Pr(X = x,Y = y) = Pr(X = x, X +W = y)
=
α,β:α=x,α+β=y
pX,W(α, β) = pX,W(x, y − x)
= pX(x)pW(y − x).
Note: Formula only makes sense if y − x is in the range space of W
ThuspY |X(y|x) =
pX,Y(x, y)pX(x)
= pW(y − x),
Intuitive!
Marginal for Y:
pY(y) =
x
pX,Y(x, y) =
x
pX(x)pW(y − x)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 90
a discrete convolution
Above uses ordinary real arithmetic. Similar results hold for otherdefinitions of addition, e.g., modulo 2 arithmetic for binary
As with linear systems, convolutions usually be easily evaluated inthe transform domain. Will do shortly.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 91
Additive noise: continuous random variables
X, W, fXW(x,w) = fX(x) fW(w) (independent), Y = X +W
Find fY |X and fY
Since continuous, find joint pdf by first finding joint cdf
FX,Y(x, y) = Pr(X ≤ x,Y ≤ y) = Pr(X ≤ x, X +W ≤ y)
=
α,β:α≤x,α+β≤yfX,W(α, β) dα dβ
=
x
−∞dα y−α
−∞dβ fX(α) fW(β)
=
x
−∞dα fX(α)FW(y − α).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 92
Taking derivatives:
fX,Y(x, y) = fX(x) fW(y − x)
⇒fY |X(y|x) = fW(y − x).
⇒fY(y) =
fX,Y(x, y) dx =
fX(x) fW(y − x) dx,
a convolution integral of the pdfs fX and fW
pdf fX|Y follows from Bayes’ rule:
fX|Y(x|y) =fX(x) fW(y − x)
fX(α) fW(y − α) dα.
Gaussian example:
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 93
Additive Gaussian noise
Assume fX = N(0,σX), fW = N(0,σ2Y), fXW(x,w) = fX(x) fW(w),
Y = X +W.
fY |X(y|x) = fW(y − x) =e−(y−x)2/2σ2
W
2πσ2W
which is N(x,σ2W).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 94
To find fX|Y using Bayes’ rule, need fY:
fY(y) = ∞
−∞fY |X(y|α) fX(α) dα
=
∞
−∞
exp− 1
2σ2W
(y − α)2
2πσ2
W
exp− 1
2σ2Xα2
2πσ2
X
dα
=1
2πσXσW
∞
−∞exp−
12
y2 − 2αy + α2
σ2W
+α2
σ2X
dα
=
exp− y2
2σ2W
2πσXσW
∞
−∞exp−
12
α2(
1σ2
X+
1σ2
W) − 2αyσ2
W
dα
Can integrate by completing the square (later see an easier wayusing tranforms, but this trick is not difficult)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 95
Integrand resembles
exp−1
2(α − mσ
)2.
which has integral
∞
−∞exp−1
2(α − mσ
)2
dα =√
2πσ2
(Gaussian pdf integrates to 1)
Compare
−12
α
2
1σ2
X+
1σ2
W
−
2αyσ2
W
vs. − 1
2
α − mσ
2= −1
2
α2
σ2 − 2αmσ2+
m2
σ2
.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 96
The braced terms will be the same if choose
1σ2 =
1σ2
W+
1σ2
X⇒ σ2 =
σ2Xσ
2W
σ2X + σ
2W,
and
yσ2
W=
mσ2 ⇒ m =
σ2
σ2W
y.
⇒
α2
1σ2
X+
1σ2
W
−
2αyσ2
W=α − mσ
2− m2
σ2
“completing the square.’
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 97
⇒
∞
−∞exp−
12
α2(
1σ2
X+
1σ2
W) − 2αyσ2
W
dα
=
∞
−∞exp−1
2
α − mσ2
2− m2
σ2
dα =
√2πσ2 exp
m2
2σ2
⇒
fY(y) =exp−1
2y2
σ2W
2πσXσW
√2πσ2 exp
m2
2σ2
=
exp−1
2y2
σ2X+σ
2W
2π(σ2
X + σ2W)
So fY = N(0,σ2X + σ
2W)
Sum of two independent 0 mean Gaussian rvs is another 0 meanGaussian rv, the variance of the sum = sum of the variances
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 98
For a posteriori probability fX|Y use Bayes’ rule + algebra
fX|Y(x|y) = fY |X(y|x) fX(x)/ fY(y)
=
exp− 1
2σ2W
(y − x)2
2πσ2
W
exp− 1
2σ2X
x2
2πσ2
X
/exp−1
2y2
σ2X+σ
2W
2π(σ2
X + σ2W)
=
exp−1
2
y2−2yx+x2
σ2W+ x2
σ2X− y2
σ2X+σ
2W
2πσ2
Xσ2W/(σ
2X + σ
2W)
=
exp− 1
2σ2Xσ
2W/(σ
2X+σ
2W)
(x − yσ2X/(σ
2X + σ
2W))2
2πσ2
Xσ2W/(σ
2X + σ
2W)
.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 99
fX|Y(x|y) = Nσ2
X
σ2X + σ
2W
y,σ2
Xσ2W
σ2X + σ
2W
.
The mean of a conditional distribution called a conditional mean, thevariance of a conditional distribution called a conditional variance
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 100
Continuous additive noise with discrete input
Most important case of mixed distributions in communicationsapplications
Typical: Binary random variable X, Gaussian random variable W, Xand W independent, Y = X +W
Previous examples do not work, one rv discrete, other continuous
Similar signal processing issue: Observe Y, guess X
As before, may be one sample of a random process, in practice haveXn, Wn, Yn. At time n, observe Yn, guess Xn
Conditional cdf FY |X(y|x) for Y given X = x is an elementaryconditional probability. Analogous to purely discrete and purely
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 101
continuous cases
FY |X(y|x) = Pr(Y ≤ y | X = x) = Pr(X +W ≤ y | X = x)
= Pr(x +W ≤ y | X = x) = Pr(W ≤ y − x | X = x)
= Pr(W ≤ y − x) = FW(y − x)
Differentiating,
fY |X(y|x) =ddy
FY |X(y|x) =ddy
FW(y − x) = fW(y − x)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 102
Joint distribution combined by a combination of pmf and pdf.
Pr(X ∈ F and Y ∈ G) =
F
pX(x)
GfY |X(y|x) dy
=
F
pX(x)
GfW(y − x) dy.
Choosing F = R yields
Pr(Y ∈ G) =
pX(x)
GfY |X(y|x) dy
=
pX(x)
GfW(y − x) dy.
Choosing G = (−∞, y] yields cdf FY(y)⇒
fY(y) =
pX(x) fY |X(y|x) =
pX(x) fW(y − x),
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 103
a convolution, analogous to pure discrete and pure continuous cases
Continuing analogy Bayes’ rule suggests conditional pmf:
pX|Y(x|y) =fY |X(y|x)pX(x)
fY(y)=
fY |X(y|x)pX(x)α pX(α) fY |X(y|α)
,
but this is not an elementary conditional probability, conditioningevent has probability 0!
Can be justified in similar way to conditional pdfs:
Pr(X ∈ F and Y ∈ G) =
Gdy fY(y) Pr(X ∈ F|Y = y)
=
Gdy fY(y)
F
pX|Y(x|y)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 104
so that pX|Y(x|y) satisfies
Pr(X ∈ F|Y = y) =
F
pX|Y(x|y)
Apply to binary input and Gaussian noise: the conditional pmf of thebinary input given the noisy observation is
pX|Y(x|y) =fW(y − x)pX(x)
fY(y)
=fW(y − x)pX(x)α pX(α) fW(y − α)
; y ∈ R, x ∈ 0, 1.
Can now solve classical binary detection in Gaussian noise.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 105
Binary detection in Gaussian noise
The derivation of the MAP detector or classifier extends immediatelyto a binary input random variable and independent Gaussian noise
As in the purely discrete case, MAP detector X(y) of X given Y = y isgiven by
X(y) = argmaxx
pX|Y(x|y) = argmaxx
fW(y − x)pX(x)α pX(α) fW(y − α)
.
Denominator of the conditional pmf does not depend on x, thedenominator has no effect on the maximization
X(y) = argmaxx
pX|Y(x|y) = argmaxx
fW(y − x)pX(x).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 106
Assume for simplicity that X is equally likely to be 0 or 1:
X(y) = argmaxx
pX|Y(x|y) = argmaxx
1
2πσ2W
exp−
12
(x − y)2
σ2W
= argmaxx
pX|Y(x|y) = argminx|x − y|
Minimum distance or nearest neighbor decision, choose closest x to y
X(y) =
0 y < 0.51 y > 0.5
.
A threshold detector
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 107
Error probability of optimal detector:
Pe = Pr(X(Y) X)
= Pr(X(Y) 0|X = 0)pX(0) + Pr(X(Y) 1|X = 1)pX(1)
= Pr(Y > 0.5|X = 0)pX(0) + Pr(Y < 0.5|X = 1)pX(1)
= Pr(W + X > 0.5|X = 0)pX(0) + Pr(W + X < 0.5|X = 1)pX(1)
= Pr(W > 0.5|X = 0)pX(0) + Pr(W + 1 < 0.5|X = 1)pX(1)
= Pr(W > 0.5)pX(0) + Pr(W < −0.5)pX(1)
using the independence of W and X. In terms of Φ function:
Pe =12
1 − Φ
0.5σW
+ Φ
−0.5σW
= Φ
− 1
2σW
.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 108
Statistical estimation
In detection/classification problems, goal is to guess which of adiscrete set of possibilities is true. MAP rule is an intuitive solution.
Different if (X,Y) continuous, observe Y, and guess X.
Examples: X,W independent Gaussian, Y = X +W. What is bestguess of X given Y?
Xn is a continuous alphabet random process (perhaps Gaussian).Observe Xn−1. What is best guess for Xn? What if observeX0, X1, X2, . . . , Xn−1?
Quality criteria for discrete case no longer works, Pr(X(Y) = Y) = 0 ingeneral.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 109
Will later introduce another quality measure (MSE) and optimize.
Now mention other approaches.
Examples of estimation or regression instead of detection
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 110
MAP Estimation
Mimic map detection, maximize conditional probability function
XMAP(y) = argmaxx fX|Y(x|y)Easy to describe, application of conditional pdfs + Bayes.
But can not argue “optimal” in sense of maximizing quality
Example: Gaussian signal plus noise
Found fX|Y(x|y) = Gaussian with mean yσ2X/(σ
2X + σ
2W)
Gaussian pdf maximized at its mean⇒ MAP estimate of X givenY = y is the conditional mean yσ2
X/(σ2X + σ
2W).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 111
Maximum Likelihood Estimation
The maximum likelihood (ML) estimate of X given Y = the value of xthat maximizes the conditional pdf fY |X(y|x) (instead of the a posterioripdf fX|Y(x|y))
XML(y) = argmaxx
fY |X(y|x).
Advantage: Do not need to know prior fX and use Bayes to findfX|Y(x|y). Simple
In the Gaussian case, XML(Y) = y.
Will return to estimation when consider expectations in more detail.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 112
Characteristic functions
When sum independent random variables, find derived distribution byconvolution of pmfs or pdfs
Can be complicated, avoidable using transforms as in linear systems
Summing independent random variables arises frequently in signalanalysis problems. E.g., iid random process Xk is put into a linearfilter to produce an output Yn =
nk=1 hn−kXk.
What is distribution of Yn?
n-fold convolution a mess. Describe shortcut.
Transforms of probability functions called characteristic functions.Variation on Fourier/Laplace transforms. Notation varies.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 113
For discrete rv with pmf pX, define characteristic function MX
MX( ju) =
x
pX(x)e jux
where u is usually assumed to be real.
A discrete exponential transform. Sometimes φ, Φ, j not included.(∼ notational differences in Fourier transforms)
Alternative useful form: Recall definition of expectation of a randomvariable g defined on a discrete probability space described by a pmfg: E(g) =
ω p(ω)g(ω)
Consider probability space (ΩX,B(ΩX), PX) with PX described by pmfpX
This is directly-given representation for rv X, X is the identity functionon ΩX: X(x) = x
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 114
Define random variable g(X) on this space g(X)(x) = e jux. ThenE[g(X)] =
x
pX(x)e jux so that
MX( ju) = E[e juX]
Characteristic functions, like probabilities, can be viewed as specialcases of expectation
Resembles discrete time Fourier transform
Fν(pX) =
x
pX(x)e− j2πνx
and the z-transformZz(pX) =
x
pX(x)zx.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 115
MX( ju) = F−u/2π(pX) = Ze ju(pX)
Properties of characteristic functions follow from those ofFourier/Laplace/z/exponential transforms.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 116
Can recover pmf from MX by suitable inversion. E.g., givenpX(k); k ∈ ZN,
12π
π/2
−π/2MX( ju)e−iuk du =
12π
π/2
−π/2
x
pX(x)e jux
e−iuk du
=
x
pX(x)1
2π
π/2
−π/2e ju(x−k) du
=
x
pX(x)δk−x = pX(k).
But usually invert by inspection or from tables, avoid inversetransforms
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 117
Characteristic functions and summing independentrvs
Two independent random variables X, W with pmfs pX and pW andcharacteristic functions MX and MW
Y = X +W
To find characteristic function of Y
MY( ju) =
y
pY(y)e juy
use the inverse image formula
pY(y) =
x,w:x+w=y
pX,W(x,w)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 118
to obtain
MY( ju) =
y
x,w:x+w=y
pX,W(x,w)
e juy =
y
x,w:x+w=y
pX,W(x,w)e juy
=
y
x,w:x+w=y
pX,W(x,w)e ju(x+w)
=
x,w
pX,W(x,w)e ju(x+w)
Last sum factors:
MY( ju) =
x,w
pX(x)pW(w)e juxe juw =
x
pX(x)e jux
w
pW(w)e juw
= MX( ju)MW( ju),
⇒ transform of the pmf of the sum of independent random variablesis the product of their transforms
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 119
Iterate:
Theorem 1. If Xi; i = 1, . . . ,N are independent random variableswith characteristic functions MXi, then the characteristic function ofthe random variable Y =
Ni=1 Xi is
MY( ju) =N
i=1
MXi( ju).
If the Xi are independent and identically distributed with commoncharacteristic function MX, then
MY( ju) = MNX ( ju).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 120
Example: X Bernoulli with parameter p = pX(1) = 1 − pX(0)
MX( ju) =1
k=0
e juk pX(k) = (1 − p) + pe ju
Xi; i = 1, . . . , n iid Bernoulli random variables, Yn =n
k=1 Xi, then
MYn( ju) = [(1 − p) + pe ju]n
with binomial theorem⇒
MYn( ju) =n
k=0
pYn(k)e juk = ((1 − p) + pe ju)n
=
n
k=0
nk
(1 − p)n−k pk
pYn(k)
e juk ,
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 121
Uniqueness of transforms⇒
pYn(k) =
nk
(1 − p)n−k pk; k ∈ Zn+1.
Same idea works for continuous rvs
For a continous random variable X with pdf fX, define thecharacteristic function MX of the random variable (or of the pdf) as
MX( ju) =
fX(x)e jux dx.
As in the discrete case,
MX( ju) = Ee juX.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 122
Relates to the continuous-time Fourier transform
Fν( fX) =
fX(x)e− j2πνx dx
and the Laplace transform
Ls( fX) =
fX(x)e−sx dx
byMX( ju) = F−u/2π( fX) = L− ju( fX)
Hence can apply results from Fourier/Laplace transform theory. E.g.,given a well-behaved density fX(x); x ∈ R MX( ju), can inverttransform
fX(x) =1
2π
∞
−∞MX( ju)e− jux du.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 123
Consider again two independent random variables X and Y with pdfsfX and fW, characteristic functions MX and MW
Paralleling the discrete case,
MY( ju) = MX( ju)MW( ju).
Will later see simple and general proof.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 124
As in the discrete case, iterating gives result for many independentrvs:
If Xi; i = 1, . . . ,N are independent random variables withcharacteristic functions MXi, then the characteristic function of therandom variable Y =
Ni=1 Xi is
MY( ju) =N
i=1
MXi( ju).
If the Xi are independent and identically distributed with commoncharacteristic function MX, then
MY( ju) = MNX ( ju).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 125
Summing Independent Gaussian rvs
X ∼ N(m,σ2)
Characteristic function found by completing the square:
MX( ju) = E(e juX) = ∞
−∞
1(2πσ2)1/2e−(x−m)2/2σ2
e jux dx
=
∞
−∞
1(2πσ2)1/2e−(x2−2mx−2σ2 jux+m2)/2σ2
dx
=
∞
−∞
1(2πσ2)1/2e−(x−(m+ juσ2))2/2σ2
dx
e jum−u2σ2/2
= e jum−u2σ2/2.
Thus N(m,σ2)↔ e jum−u2σ2/2
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 126
Xi; i = 1, . . . , n iid Gaussian random variables with pdfs N(m,σ2)
Yn =n
k=1 Xi
ThenMYn( ju) = [e jum−u2σ2/2]n = e ju(nm)−u2(nσ2)/2,
= characteristic function of N(nm, nσ2)
Moral: Use characteristic functions to derive distributions of sums ofindependent rvs.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 127
Gaussian random vectors
A random vector is Gaussian if its density is Gaussian
Component rvs are jointly Gaussian
Description is complicated, but many nice properties
Multidimensional characteristic functions help derivation
Random vector X = (X0, . . . , Xn−1)
vector argument u = (u0, . . . , un−1)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 128
n-dimensional characteristic function:
MX( ju) = MX0,...,Xn−1( ju0, . . . , jun−1) = Ee jutX
= E
exp
j
n−1
k=0
ukXk
Can be shown using multivariable calculus: Gaussian rv with meanvector m and covariance matrix Λ has characteristic function
MX( ju) = e jutm−utΛu/2
= exp
j
n−1
k=0
ukmk − 1/2n−1
k=0
n−1
m=0
ukΛ(k,m)um
Same basic form as Gaussian pdf, but depends directly on Λ, not Λ−1
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 129
So exists more generally, only need Λ to be nonnegative definite(instead of strictly positive definite). Define Gaussian rv moregenerally as a rv having a characteristic function of this form (inversetransform will have singularities)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 130
Further examples of random processes:
Have seen two ways to define rps: Indirectly in terms of anunderlying probability space or directly (Kolmogorov representation)by describing consistent family of joint distributions (via pmfs, pdfs, orcdfs).
Used to define discrete time iid processes and processes which canbe constructed from iid processes by coding or filtering.
Introduce more classes of processes and develop some propertiesfor various examples.
In particular: Gaussian random processes and Markov processes
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 131
Gaussian random processes
A random process Xt; t ∈ T is Gaussian if the random vectors(Xt0, Xt1, . . . , Xtk−1) are Gaussian for all positive integers k and allpossible sample times ti ∈ T ; i = 0, 1, . . . , k − 1.
Works for continuous and discrete time.
Consistent family?
Yes if all mean vectors and covariance matrices drawn from acommon mean function m(t); t ∈ T and covariance functionΛ(t, s); t, s ∈ T ; i.e., for any choice of sample times t0, . . . , tk−1 ∈ Tthe random vector (Xt0, Xt1, . . . , Xtk−1) is Gaussian with mean(m(t0),m(t1), . . . ,m(tk−1)) and the covariance matrix isΛ = Λ(tl, t j); l, j ∈ Zk.EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 132
Gaussian random processes in both discrete and continuous timeare extremely common in analysis of random systems and havemany nice properties.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 133
Discrete time Markov processes
An iid process is memoryless because present independent of past.
A Markov process allows dependence on the past in a structuredway.
Introduce via example.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 134
A binary Markov process
Xn; n = 0, 1, . . . is a Bernoulli process with
pXn(x) =
p x = 11 − p x = 0
,
p ∈ (0, 1) a fixed parameter
Since the pmf pXn(x), abbreviate to pX:
pX(x) = px(1 − p)1−x; x = 0, 1.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 135
Since process iid
pXn(xn) =n−1
i=0
pX(xi) = pw(xn)(1 − p)n−w(xn),
where w(xn) = Hamming weight of the binary vector xn.
Let Xn be input to a device which produces an output binaryprocess Yn defined by
Yn =
Y0 n = 0Xn ⊕ Yn−1 n = 1, 2, . . .
,
where Y0 is a binary equiprobable random variable(pY0(0) = pY0(1) = 0.5), independent of all of the Xn and ⊕ is mod 2addition
(linear filter using mod 2 arithmetic)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 136
Alternatively:
Yn =
1 if Xn Yn−1
0 if Xn = Yn−1.
This process is called a binary autoregressive process. As will beseen, it is also called the symmetric binary Markov process
Unlike Xn, Yn depends strongly on past values. Since p < 1/2, Yn ismore likely to equal Yn−1 than not
If p is small, Yn is likely to have long runs of 0s and 1s.
Task: Find joint pmfs for new process: pYn(yn) = Pr(Yn = yn)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 137
Use inverse image formula:
pYn(yn) = Pr(Yn = yn)
= Pr(Y0 = y0,Y1 = y1,Y2 = y2, . . . ,Yn−1 = yn−1)
= Pr(Y0 = y0, X1 ⊕ Y0 = y1, X2 ⊕ Y1 = y2, . . . , Xn−1 ⊕ Yn−2 = yn−1)
= Pr(Y0 = y0, X1 ⊕ y0 = y1, X2 ⊕ y1 = y2, . . . , Xn−1 ⊕ yn−2 = yn−1)
= Pr(Y0 = y0, X1 = y1 ⊕ y0, X2 = y2 ⊕ y1, . . . , Xn−1 = yn−1 ⊕ yn−2)
= pY0,X1,X2,X3,...,Xn−1(y0, y1 ⊕ y0, y2 ⊕ y1, . . . , yn−1 ⊕ yn−2)
= pY0(y0)n−1
i=1
pX(yi ⊕ yi−1).
Used the facts that (1) a ⊕ b = c iff a = b ⊕ c, (2) Y0, X1, X2, . . . , Xn−1
mutually independent, and (3) Xn are iid.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 138
Plug in specific forms of pY0 and pX ⇒
pYn(yn) =12
n−1
i=1
pyi⊕yi−1(1 − p)1−yi⊕yi−1.
Marginal pmfs for Yn evaluated by summing out the joints (totalprobability), e.g.,
pY1(y1) =
y0
pY0,Y1(y0, y1) =12
y0
py1⊕y0(1 − p)1−y1⊕y0
=12
; y1 = 0, 1.
In a similar fashion it can be shown that the marginals for Yn are allthe same:
pYn(y) =12
; y = 0, 1; n = 0, 1, 2, . . .
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 139
Hence drop subscript and abbreviate pmf to pY
Note: Would not be the same with different initialization, e.g., Y0 = 1
Unlike the iid Xn process
pYn(yn) n−1
i=0
pY(yi)
(provided p 1/2)
Yn not iid
Joint not product of marginals, but can use chain rule with conditionalprobabilities to write as product of conditional pmfs, given by
pYl|Y0,Y1,...,Yl−1(yl|y0, y1, . . . , yl−1) =pYl+1(yl+1)
pYl(yl)= pX(yl ⊕ yl−1)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 140
Note: Conditional probability of current output Yl given entire pastYi; i = 0, 1, . . . , l − 1 depends only on the most recent past outputYl−1! This property can be summarized nicely by also deriving theconditional pmf
pYl|Yl−1(yl|yl−1) =pYl−1,Yl(yl, yl−1)
pYl−1(yl−1)
= pyl⊕yl−1(1 − p)1−yl⊕yl−1
⇒pYl|Y0,Y1,...,Yl−1(yl|y0, y1, . . . , yl−1) = pYl|Yl−1(yl|yl−1).
A discrete time random process with this property is called a Markovprocess or Markov chain
The binary autoregressive process is a Markov process!
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 141
The binomial counting process
Next filter binary Bernoulli process using ordinary arithmetic.
Xn iid binary random process with marginal pmfpX(1) = p = 1 − pX(0).
Yn =
Y0 = 0 n = 0n
k=1 Xk = Yn−1 + Xn n = 1, 2, . . ..
Yn = output of a discrete time time-invariant linear filter with Kroneckerdelta response hk given by hk = 1 for k ≥ 0 and hk = 0 otherwise.
By definition,
Yn = Yn−1 or Yn = Yn−1 + 1; n = 2, 3, . . .
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 142
A discrete time process with this property is called a countingprocess. Will later see a continuous time counting process whichalso can only increase by 1
To completely describe this process need a formula for the joint pmfs
pY1,...,Yn(y1, . . . , yn) = pY1(y1)n
l=1
pYl|Y1,...,Yl−1(yl|y1, . . . , yl−1)
Already found marginal pmf pYn(k) using transforms to be binomial⇒binomial counting process
Find conditional pmfs, which imply joints via chain rule.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 143
pYn|Yn−1,...,Y1(yn|yn−1, . . . , y1)
= Pr(Yn = yn|Yl = yl; l = 1, . . . , yn−1)
= Pr(Xn = yn − yn−1|Yl = yl; l = 1, . . . , n − 1)
= Pr(Xn = yn − yn−1|X1 = y1, Xi = yi − yi−1; i = 2, 3, . . . , n − 1)
Follows since since conditioning event Yi = yi; i = 1, 2, . . . , n − 1 isthe event X1 = y1, Xi = yi − yi−1; i = 2, 3, . . . , n − 1 and, given thisevent, the event Yn = yn is the event Xn = yn − yn−1.
Thus
pYn|Yn−1,...,Y1(yn|yn−1, . . . , y1)
= pXn|Xn−1,...,X2,X1(yn − yn−1|yn−1 − yn−2, . . . , y2 − y1, y1)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 144
Xn iid⇒pYn|Yn−1,...,Y1(yn|yn−1, . . . , y1) = pX(yn − yn−1)
Hence chain rule + definition y0 = 0⇒
pY1,...,Yn(y1, . . . , yn) =n
i=1
pX(yi − yi−1)
For binomial counting process, use Bernoulli pX:
pY1,...,Yn(y1, . . . , yn) =n
i=1
p(yi−yi−1)(1 − p)1−(yi−yi−1),
whereyi − yi−1 = 0 or 1, i = 1, 2, . . . , n; y0 = 0.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 145
Similar derivation⇒
pYn|Yn−1(yn|yn−1) = Pr(Yn = yn|Yn−1 = yn−1)= Pr(Xn = yn − yn−1|Yn−1 = yn−1).
Conditioning event, depends only on values of Xk for k < n, hencepYn|Yn−1(yn|yn−1) = pX(yn − yn−1) ⇒ Yn is Markov
Similar derivation works for sum of iid rvs with any pmf pX to showthat
pYn|Yn−1,...,Y1(yn|yn−1, . . . , y1) = pYn|Yn−1(yn|yn−1)or, equivalently,
Pr(Yn = yn|Yi = yi ; i = 1, . . . , n − 1) = Pr(Yn = yn|Yn−1 = yn−1),
⇒ Markov
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 146
Discrete random walk
Slight variation: Let Xn be binary iid with alphabet 1,−1 andPr(Xn = −1) = p
Yn =
0 n = 0n
k=1 Xk n = 1, 2, . . .,
Also has autoregressive format
Yn = Yn−1 + Xn, n = 1, 2, . . .
Transform of the iid random variables is
MX( ju) = (1 − p)e ju + pe− ju,
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 147
binomial theorem⇒
MYn( ju) = ((1 − p)e ju + pe− ju)n
=
n
k=0
nk
(1 − p)n−k pk
e ju(n−2k)
=
k=−n,−n+2,...,n−2,n
n
(n − k)/2
(1 − p)(n+k)/2p(n−k)/2
pYn(k)
e juk.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 148
⇒
pYn(k) =
n(n − k)/2
(1 − p)(n+k)/2p(n−k)/2 ,
k = −n,−n + 2, . . . , n − 2, n.
Note that Yn must be even or odd depending on whether n is even orodd. This follows from the nature of the increments.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 149
The discrete time Wiener process
Xn iid N(0, ,σ2).
As with the counting process, define
Yn =
0 n = 0n
k=1 Xk n = 1, 2, . . .,
discrete time Wiener process
Handle in essentially the same way, but use cdfs and then pdfs
Previously found marginal fYn using transforms to be N(0, nσ2X)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 150
To find the joint pdfs use conditional pdfs and chain rule
fY1,...,Yn(y1, . . . , yn) =n
l=1
fYl|Y1,...,Yl−1(yl|y1, . . . , yl−1).
To find conditional pdf fYn|Y1,...,Yn−1(yn|y1, . . . , yn−1), first find conditionalcdf P(Yn ≤ yn|Yn−i = yn−i; i = 1, 2, . . . , n − 1)
. Analogous to the discrete case:
P(Yn ≤ yn|Yn−i = yn−i; i = 1, 2, . . . , n − 1)
= P(Xn ≤ yn − yn−1|Yn−i = yn−i; i = 1, 2, . . . , n − 1)
= P(Xn ≤ yn − yn−1) = FX(yn − yn−1),
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 151
Differentiating the conditional cdf to obtain the conditional pdf⇒
fYn|Y1,...,Yn−1(yn|y1, . . . , yn−1) =∂
∂ynFX(yn − yn−1) = fX(yn − yn−1),
pdf chain rule⇒
fY1,...,Yn(y1, . . . , yn) =n
i=1
fX(yi − yi−1).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 152
If fX = N(0,σ2)
fYn(yn) =exp− y2
12σ2
√2πσ2
n
i=2
exp−(yi−yi−1)2
2σ2
√2πσ2
= (2πσ2)−n/2 exp
−
12σ2(
n
i=2
(yi − yi−1)2 + y21)
.
This is a joint Gaussian pdf with mean vector 0 and covariance matrixKX(m, n) = σ2 min(m, n), m, n = 1, 2, . . .
A similar argument implies that
fYn|Yn−1(yn|yn−1) = fX(yn − yn−1)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 153
and hence
fYn|Y1,...,Yn−1(yn|y1, . . . , yn−1) = fYn|Yn−1(yn|yn−1).
As in discrete alphabet case, a process with this property is called aMarkov process
Combine the discrete alphabet and continuous alphabet definitionsinto a common definition: a discrete time random process Yn is saidto be a Markov process if the conditional cdf’s satisfy the relation
Pr(Yn ≤ yn|Yn−i = yn−i; i = 1, 2, . . .) = Pr(Yn ≤ yn|Yn−1 = yn−1)
for all yn−1, yn−2, . . .
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 154
More specifically, such a Yn is frequently called a first-order Markovprocess because it depends on only the most recent past value. Anextended definition to nth-order Markov processes can be made inthe obvious fashion.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 155