Top Banner
much more on minimax (order bounds) http://www-stat.stanford.edu/~imj/wald/wald1web.pdf cf. lecture by Iain Johnstone Monday, June 3, 2013
52

cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

Jul 25, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

much more on minimax (order bounds)

http://www-stat.stanford.edu/~imj/wald/wald1web.pdf

cf. lecture by Iain Johnstone

Monday, June 3, 2013

Page 2: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

today’s lecture

• parametric estimation, Fisher information, Cramer-Rao lower bound: Ch. 4, Sec. 9.3

• information and estimation: Ch. 7

• universal denoising: Ch. 8

• (chapters and sections from new version of notes)

Monday, June 3, 2013

Page 3: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

Monday, June 3, 2013

Page 4: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

mean squared error estimation

Monday, June 3, 2013

Page 5: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

bias-variance

Monday, June 3, 2013

Page 6: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

Fisher Information

exercise:

Monday, June 3, 2013

Page 7: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

exercise

Monday, June 3, 2013

Page 8: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

note

Monday, June 3, 2013

Page 9: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

Monday, June 3, 2013

Page 10: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

• r.h.s. depends on estimator

• far from tight: consider estimator identically 0

note:

Monday, June 3, 2013

Page 11: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

multi-parameter case

Monday, June 3, 2013

Page 12: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

Fisher information for a “location family”

Monday, June 3, 2013

Page 13: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

Fisher Information and MMSE

Monday, June 3, 2013

Page 14: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

Monday, June 3, 2013

Page 15: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

recall

5 Notation and Conventions

Our conventions and notation for information measures, such as mutual information and relative entropy, are stan-

dard. The initiated reader is advised to skip this section. If U, V,W are three random variables taking values in

Polish spaces U ,V,W, respectively, and defined on a common probability space with a probability measure P , we let

PU , PU,V etc. denote the probability measures induced on U , the pair (U ,V) etc. while e.g., PU |V denotes a regular

version of the conditional distribution of U given V . PU |v is the distribution on U obtained by evaluating that regular

version at v. If Q is another probability measure on the same measurable space we similarly denote QU , QU |V , etc.As usual, given two measures on the same measurable space, e.g., P and Q, define their relative entropy (divergence)

by

D(P�Q) =

� �log

dP

dQ

�dP (12)

when P is absolutely continuous w.r.t. Q, defining D(P�Q) = ∞ otherwise. An immediate consequence of the

definitions of relative entropy and of the Radon-Nikodym derivative is that if f : U → V is measurable and one-to-

one, and V = f(U), then

D(PU�QU ) = D(PV �QV ). (13)

Following [5], we further use the notation

D(PU |V �QU |V |PV ) =

�D(PU |v�QU |v)dPV (v), (14)

where on the right side D(PU |v�QU |v) is a divergence in the sense of (12) between the measures PU |v and QU |v. It

will be convenient to write

D(PU |V �QU |V ) (15)

to denote f(V ) when f(v) = D(PU |v�QU |v). Thus D(PU |V �QU |V ) is a random variable while D(PU |V �QU |V |PV ) is

its expectation under P . With this notation, the chain rule for relative entropy (cf., e.g., [6, Subsection D.3]) is

D(PU,V �QU,V ) = D(PU�QU ) +D(PV |U�QV |U |PU ) (16)

and is valid regardless of the finiteness of both sides of the equation.

The mutual information between U and V is defined as

I(U ;V ) = D(PU,V �PU × PV ), (17)

where PU × PV denotes the product measure induced by PU and PV . We note in passing, in line with the comment

on relative entropy and one-to-one transformations leading to (13), that if f and g are two measurable one-to-one

transformations and A = f(U) while B = g(V ), then

I(U ;V ) = I(A;B). (18)

Finally, the conditional mutual information between U and V , given W , is defined as

I(U ;V |W ) = D(PU,V |W �PU |W × PV |W |PW ). (19)

The roles of U, V,W will be played in what follows by scalar random variables, vectors, or processes.

6 Relative Entropy and Mismatched Estimation

6.1 For slides

• Scalar Channel:

X ≥ 0

Yγ |X ∼ Poisson(γ ·X)

7

5 Notation and Conventions

Our conventions and notation for information measures, such as mutual information and relative entropy, are stan-

dard. The initiated reader is advised to skip this section. If U, V,W are three random variables taking values in

Polish spaces U ,V,W, respectively, and defined on a common probability space with a probability measure P , we let

PU , PU,V etc. denote the probability measures induced on U , the pair (U ,V) etc. while e.g., PU |V denotes a regular

version of the conditional distribution of U given V . PU |v is the distribution on U obtained by evaluating that regular

version at v. If Q is another probability measure on the same measurable space we similarly denote QU , QU |V , etc.As usual, given two measures on the same measurable space, e.g., P and Q, define their relative entropy (divergence)

by

D(P�Q) =

� �log

dP

dQ

�dP (12)

when P is absolutely continuous w.r.t. Q, defining D(P�Q) = ∞ otherwise. An immediate consequence of the

definitions of relative entropy and of the Radon-Nikodym derivative is that if f : U → V is measurable and one-to-

one, and V = f(U), then

D(PU�QU ) = D(PV �QV ). (13)

Following [5], we further use the notation

D(PU |V �QU |V |PV ) =

�D(PU |v�QU |v)dPV (v), (14)

where on the right side D(PU |v�QU |v) is a divergence in the sense of (12) between the measures PU |v and QU |v. It

will be convenient to write

D(PU |V �QU |V ) (15)

to denote f(V ) when f(v) = D(PU |v�QU |v). Thus D(PU |V �QU |V ) is a random variable while D(PU |V �QU |V |PV ) is

its expectation under P . With this notation, the chain rule for relative entropy (cf., e.g., [6, Subsection D.3]) is

D(PU,V �QU,V ) = D(PU�QU ) +D(PV |U�QV |U |PU ) (16)

and is valid regardless of the finiteness of both sides of the equation.

The mutual information between U and V is defined as

I(U ;V ) = D(PU,V �PU × PV ), (17)

where PU × PV denotes the product measure induced by PU and PV . We note in passing, in line with the comment

on relative entropy and one-to-one transformations leading to (13), that if f and g are two measurable one-to-one

transformations and A = f(U) while B = g(V ), then

I(U ;V ) = I(A;B). (18)

Finally, the conditional mutual information between U and V , given W , is defined as

I(U ;V |W ) = D(PU,V |W �PU |W × PV |W |PW ). (19)

The roles of U, V,W will be played in what follows by scalar random variables, vectors, or processes.

6 Relative Entropy and Mismatched Estimation

6.1 For slides

• Scalar Channel:

X ≥ 0

Yγ |X ∼ Poisson(γ ·X)

7

Monday, June 3, 2013

Page 16: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

2

mutual information and MMSE

Monday, June 3, 2013

Page 17: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

2

(follows from J-MMSE and De-Bruijn)

Monday, June 3, 2013

Page 18: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

continuous time

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown tobe the sum of the mutual information and the relative entropy between the true and mismatched output distributions,this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, theinput, is a non-negative random variable while the conditional distribution of the output Y given the input isgiven by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, thechannel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the outputY T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:The channel input represents the squared magnitude of the electric field incident on the photo-detector, while itsoutput is the counting process describing the arrival times of the photons registered by the detector. Here the energyof the channel input signal is proportional to its l1 norm, rather than the l2 norm as in the Gaussian channel. Thus it

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown tobe the sum of the mutual information and the relative entropy between the true and mismatched output distributions,this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, theinput, is a non-negative random variable while the conditional distribution of the output Y given the input isgiven by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, thechannel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the outputY T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:The channel input represents the squared magnitude of the electric field incident on the photo-detector, while itsoutput is the counting process describing the arrival times of the photons registered by the detector. Here the energyof the channel input signal is proportional to its l1 norm, rather than the l2 norm as in the Gaussian channel. Thus it

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T

0(Xt − E[Xt|Y T ])2dt

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

2

Monday, June 3, 2013

Page 19: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T

0(Xt − E[Xt|Y T ])2dt

or in its integral version

I(snr) =1

2

� snr

0mmse(γ)dγ

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T

0(Xt − E[Xt|Y T ])2dt

or in its integral version

I(snr) =1

2

� snr

0mmse(γ)dγ

2

, [Zakai 2005]:

1 for Duncan slide

H(X) =�

x∈XP (X = x) log

1

P (X = x)

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

For mismatch:

cmseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y t])2dt

mseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y T ])2dt

cmseP,Q(snr) =1

snr

� snr

0mseP,Q(γ)dγ

=2

snr[I(snr) +D (PY �QY )]

Relationship between cmseP,Q and mseP,Q ?Relationship between cmseP,Q and mseP,Q

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

2

Monday, June 3, 2013

Page 20: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

Duncan1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown tobe the sum of the mutual information and the relative entropy between the true and mismatched output distributions,this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, theinput, is a non-negative random variable while the conditional distribution of the output Y given the input isgiven by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, thechannel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the outputY T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:The channel input represents the squared magnitude of the electric field incident on the photo-detector, while itsoutput is the counting process describing the arrival times of the photons registered by the detector. Here the energyof the channel input signal is proportional to its l1 norm, rather than the l2 norm as in the Gaussian channel. Thus it

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown tobe the sum of the mutual information and the relative entropy between the true and mismatched output distributions,this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, theinput, is a non-negative random variable while the conditional distribution of the output Y given the input isgiven by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, thechannel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the outputY T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:The channel input represents the squared magnitude of the electric field incident on the photo-detector, while itsoutput is the counting process describing the arrival times of the photons registered by the detector. Here the energyof the channel input signal is proportional to its l1 norm, rather than the l2 norm as in the Gaussian channel. Thus it

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown tobe the sum of the mutual information and the relative entropy between the true and mismatched output distributions,this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, theinput, is a non-negative random variable while the conditional distribution of the output Y given the input isgiven by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, thechannel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the outputY T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:The channel input represents the squared magnitude of the electric field incident on the photo-detector, while itsoutput is the counting process describing the arrival times of the photons registered by the detector. Here the energyof the channel input signal is proportional to its l1 norm, rather than the l2 norm as in the Gaussian channel. Thus it

2

1 for Duncan slide

AWGN channel

dYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT;Y T

) =1

2E

�� T

0(Xt − E[Xt|Y t

])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT;Y T

)

cmmse(γ) =1

2E

�� T

0(Xt − E[Xt|Y t

])2dt

I(XT;Y T

) =1

2E

�� T

0(Xt − E[Xt|Y t

])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information between

the input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),

is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simple

relationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as the

continuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings where

this relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-

tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributed

continuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean value

of the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual information

to both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimator

that would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due to

the mismatch is equal to the relative entropy between the true channel output distribution and the channel output

distribution under Q, at SNR = γ.This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEs

continues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distribution

that differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown to

be the sum of the mutual information and the relative entropy between the true and mismatched output distributions,

this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, the

input, is a non-negative random variable while the conditional distribution of the output Y given the input is

given by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, the

channel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the output

Y T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the

“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:

2

Monday, June 3, 2013

Page 21: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

SNR in Duncan

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown tobe the sum of the mutual information and the relative entropy between the true and mismatched output distributions,this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, theinput, is a non-negative random variable while the conditional distribution of the output Y given the input isgiven by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, thechannel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the outputY T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:The channel input represents the squared magnitude of the electric field incident on the photo-detector, while itsoutput is the counting process describing the arrival times of the photons registered by the detector. Here the energyof the channel input signal is proportional to its l1 norm, rather than the l2 norm as in the Gaussian channel. Thus it

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown tobe the sum of the mutual information and the relative entropy between the true and mismatched output distributions,this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, theinput, is a non-negative random variable while the conditional distribution of the output Y given the input isgiven by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, thechannel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the outputY T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:The channel input represents the squared magnitude of the electric field incident on the photo-detector, while itsoutput is the counting process describing the arrival times of the photons registered by the detector. Here the energyof the channel input signal is proportional to its l1 norm, rather than the l2 norm as in the Gaussian channel. Thus it

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown tobe the sum of the mutual information and the relative entropy between the true and mismatched output distributions,this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, theinput, is a non-negative random variable while the conditional distribution of the output Y given the input isgiven by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, thechannel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the outputY T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:The channel input represents the squared magnitude of the electric field incident on the photo-detector, while itsoutput is the counting process describing the arrival times of the photons registered by the detector. Here the energyof the channel input signal is proportional to its l1 norm, rather than the l2 norm as in the Gaussian channel. Thus it

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown tobe the sum of the mutual information and the relative entropy between the true and mismatched output distributions,this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, theinput, is a non-negative random variable while the conditional distribution of the output Y given the input isgiven by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, thechannel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the outputY T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:The channel input represents the squared magnitude of the electric field incident on the photo-detector, while itsoutput is the counting process describing the arrival times of the photons registered by the detector. Here the energyof the channel input signal is proportional to its l1 norm, rather than the l2 norm as in the Gaussian channel. Thus it

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown to

2

Monday, June 3, 2013

Page 22: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown tobe the sum of the mutual information and the relative entropy between the true and mismatched output distributions,this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, theinput, is a non-negative random variable while the conditional distribution of the output Y given the input isgiven by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, thechannel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the outputY T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:The channel input represents the squared magnitude of the electric field incident on the photo-detector, while itsoutput is the counting process describing the arrival times of the photons registered by the detector. Here the energyof the channel input signal is proportional to its l1 norm, rather than the l2 norm as in the Gaussian channel. Thus it

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown tobe the sum of the mutual information and the relative entropy between the true and mismatched output distributions,this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, theinput, is a non-negative random variable while the conditional distribution of the output Y given the input isgiven by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, thechannel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the outputY T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:The channel input represents the squared magnitude of the electric field incident on the photo-detector, while itsoutput is the counting process describing the arrival times of the photons registered by the detector. Here the energyof the channel input signal is proportional to its l1 norm, rather than the l2 norm as in the Gaussian channel. Thus it

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T

0(Xt − E[Xt|Y T ])2dt

or in its integral version

I(snr) =1

2

� snr

0mmse(γ)dγ

2

, [Zakai 2005]:

1 for Duncan slide

H(X) =�

x∈XP (X = x) log

1

P (X = x)

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

For mismatch:

cmseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y t])2dt

mseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y T ])2dt

cmseP,Q(snr) =1

snr

� snr

0mseP,Q(γ)dγ

=2

snr[I(snr) +D (PY �QY )]

Relationship between cmseP,Q and mseP,Q ?Relationship between cmseP,Q and mseP,Q

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

2

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T

0(Xt − E[Xt|Y T ])2dt

or in its integral version

I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ

cmleP,Q(snr) =

Relationship between cmmse and mmse??⇒

�⇒

⇒?

+

=?

What if X ∼ P but the estimator thinks X ∼ Q ?

mseP,Q(γ) = EP

�(X − EQ[X|Y ])2

What is Cost of Mismatch?

D(P�Q) =

� ∞

0[mseP,Q(γ)−mseP,P (γ)]dγ

D(PYsnr�QYsnr) =

� snr

0[mseP,Q(γ)−mseP,P (γ)]dγ

d

dγD(PY �QY ) = mseP,Q(γ)−mseP,P (γ)

X ∼ P

?

3

Recap

Monday, June 3, 2013

Page 23: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T

0(Xt − E[Xt|Y T ])2dt

or in its integral version

I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ

Relationship between cmmse and mmse?

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T

0(Xt − E[Xt|Y T ])2dt

or in its integral version

I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ

Relationship between cmmse and mmse?

+

2Monday, June 3, 2013

Page 24: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

Mismatch

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T

0(Xt − E[Xt|Y T ])2dt

or in its integral version

I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ

Relationship between cmmse and mmse?

+

=?

What if X ∼ P but the estimator thinks X ∼ Q ?

X ∼ P

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T

0(Xt − E[Xt|Y T ])2dt

or in its integral version

I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ

Relationship between cmmse and mmse?

+

=?

What if X ∼ P but the estimator thinks X ∼ Q ?

mseP,Q(γ) = EP

�(X − EQ[X|Y ])2

X ∼ P

2

Monday, June 3, 2013

Page 25: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

A new representation of relative entropy [Verdu 2010]:

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T

0(Xt − E[Xt|Y T ])2dt

or in its integral version

I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ

Relationship between cmmse and mmse?

+

=?

What if X ∼ P but the estimator thinks X ∼ Q ?

mseP,Q(γ) = EP

�(X − EQ[X|Y ])2

What is Cost of Mismatch?

D(P�Q) =

� ∞

0[mseP,Q(γ)−mseP,P (γ)]dγ

d

dγD(PY �QY ) = mseP,Q(γ)−mseP,P (γ)

X ∼ P

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown tobe the sum of the mutual information and the relative entropy between the true and mismatched output distributions,this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, theinput, is a non-negative random variable while the conditional distribution of the output Y given the input is

3

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T

0(Xt − E[Xt|Y T ])2dt

or in its integral version

I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ

Relationship between cmmse and mmse?

+

=?

What if X ∼ P but the estimator thinks X ∼ Q ?

mseP,Q(γ) = EP

�(X − EQ[X|Y ])2

What is Cost of Mismatch?

D(P�Q) =

� ∞

0[mseP,Q(γ)−mseP,P (γ)]dγ

D(PYsnr�QYsnr) =

� snr

0[mseP,Q(γ)−mseP,P (γ)]dγ

d

dγD(PY �QY ) = mseP,Q(γ)−mseP,P (γ)

X ∼ P

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown tobe the sum of the mutual information and the relative entropy between the true and mismatched output distributions,this relative entropy thus quantifying the penalty due to mismatch.

3

Monday, June 3, 2013

Page 26: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

Causal vs. Non-causal Mismatched Estimation 1 for Duncan slide

AWGN channel

dYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT;Y T

) =1

2E

�� T

0(Xt − E[Xt|Y t

])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT;Y T

)

cmmse(γ) =1

2E

�� T

0(Xt − E[Xt|Y t

])2dt

I(XT;Y T

) =1

2E

�� T

0(Xt − E[Xt|Y t

])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information between

the input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),

is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simple

relationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as the

continuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings where

this relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-

tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributed

continuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean value

of the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual information

to both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimator

that would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due to

the mismatch is equal to the relative entropy between the true channel output distribution and the channel output

distribution under Q, at SNR = γ.This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEs

continues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distribution

that differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown to

be the sum of the mutual information and the relative entropy between the true and mismatched output distributions,

this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, the

input, is a non-negative random variable while the conditional distribution of the output Y given the input is

given by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, the

channel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the output

Y T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the

“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

For mismatch:

cmseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y t])2dt

mseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y T ])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T

0(Xt − E[Xt|Y T ])2dt

or in its integral version

I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

For mismatch:

cmseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y t])2dt

mseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y T ])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T

0(Xt − E[Xt|Y T ])2dt

or in its integral version

I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

For mismatch:

cmseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y t])2dt

mseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y T ])2dt

Relationship between cmseP,Q and mseP,Q ?

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T

0(Xt − E[Xt|Y T ])2dt

or in its integral version

I(snr) =1

2

� snr

0mmse(γ)dγ

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown tobe the sum of the mutual information and the relative entropy between the true and mismatched output distributions,this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, theinput, is a non-negative random variable while the conditional distribution of the output Y given the input isgiven by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, thechannel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the outputY T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:The channel input represents the squared magnitude of the electric field incident on the photo-detector, while itsoutput is the counting process describing the arrival times of the photons registered by the detector. Here the energyof the channel input signal is proportional to its l1 norm, rather than the l2 norm as in the Gaussian channel. Thus it

2

Monday, June 3, 2013

Page 27: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

For mismatch:

cmseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y t])2dt

mseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y T ])2dt

Relationship between cmseP,Q and mseP,Q ?Relationship between cmseP,Q and mseP,Q

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T

0(Xt − E[Xt|Y T ])2dt

or in its integral version

I(snr) =1

2

� snr

0mmse(γ)dγ

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

For mismatch:

cmseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y t])2dt

mseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y T ])2dt

cmseP,Q(snr) =1

snr

� snr

0mseP,Q(γ)dγ

=1

snr[I +D]

Relationship between cmseP,Q and mseP,Q ?Relationship between cmseP,Q and mseP,Q

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

2

Monday, June 3, 2013

Page 28: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

For mismatch:

cmseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y t])2dt

mseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y T ])2dt

Relationship between cmseP,Q and mseP,Q ?Relationship between cmseP,Q and mseP,Q

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T

0(Xt − E[Xt|Y T ])2dt

or in its integral version

I(snr) =1

2

� snr

0mmse(γ)dγ

2

1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

For mismatch:

cmseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y t])2dt

mseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y T ])2dt

cmseP,Q(snr) =1

snr

� snr

0mseP,Q(γ)dγ

=1

snr[I +D]

Relationship between cmseP,Q and mseP,Q ?Relationship between cmseP,Q and mseP,Q

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

2

1 for Duncan slide

H(X) =�

x∈XP (X = x) log

1

P (X = x)

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T

0(Xt − E[Xt|Y t])2dt

For mismatch:

cmseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y t])2dt

mseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y T ])2dt

cmseP,Q(snr) =1

snr

� snr

0mseP,Q(γ)dγ

=2

snr[I(snr) +D (PY T �QY T )]

Relationship between cmseP,Q and mseP,Q ?Relationship between cmseP,Q and mseP,Q

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

[Guo, Shamai and Verdu 2005]:[Guo, Shamai and Verdu 2008]

2

Monday, June 3, 2013

Page 29: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

minimax estimation

Theorem 6.4 Let P and Q be two probability measures that are members of P. For γ ≥ 0,

D(PY Tγ�QY T

γ) = γ · [cmleP,Q(γ)− cmleP,P (γ)] . (27)

Theorem 6.5 (under mild conditions)

D(PY T �QY T ) ∝ cmleP,Q − cmleP,P (28)

cmseP,Q − cmseP,P = D(PY T �QY T ) (29)

• Girsanov-type theory for expressing logdQY T

dlaw of homogenous Poisson as a filtering integral

• manipulating

D(PY T �QY T ) = EP

log

dPY T

dlaw of homogenous Poisson

logdQY T

dlaw of homogenous Poisson

via ‘orthogonality’ etc.

Put together, Theorem 6.3 and Theorem 6.5 yield, for γ > 0,

cmleP,Q(γ)− cmleP,P (γ) =1

γ

� γ

0[mleP,Q(α)−mleP,P (α)] dα =

1

γD(PY T

γ�QY T

γ), (30)

which is the Poissonian analogue of [33, Theorem 2]. On a technical note, the r.h.s. of (24), (25) and (26) arewell-defined as integrals of non-negative Borel measurable functions, as will follow from our treatment in Section 9.

6.4 for slides: minimaxity

minimax(P, snr)�= min

{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmseP,P (snr)

minimax(P, snr)�= min

X(·)maxP∈P

�cmseP,X(snr)− cmseP,P (snr)

minimax(P, snr) = minQ

maxP∈P

[cmseP,Q(snr)− cmseP,P (snr)] (31)

=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(32)

=2

snrmax

�I�Θ;Y T

snr

�: Θ is a P-valued RV

�(33)

=2

snrC

��PY T

snr

�P∈P

�(34)

Furthermore, the ‘strong redundancy-capacity’ results are directly applicable here and imply:

6.5 strong red cap

∀ε > 0 and any filter {Xt(·)}0≤t≤T ,

EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmseP,P (snr) ≥ (1− ε) ·minimax(P, snr) (35)

for all P ∈ P with the possible exception of sources in a subset B ⊂ P where

w∗(B) ≤ e · 2

−ε·C��

PY Tsnr

P∈P

, (36)

w∗(B) ≤ e · 2−ε·minimax(P,snr)

,

w∗ being the capacity achieving prior

11

Monday, June 3, 2013

Page 30: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

minimax estimation

classical

Theorem 6.4 Let P and Q be two probability measures that are members of P. For γ ≥ 0,

D(PY Tγ�QY T

γ) = γ · [cmleP,Q(γ)− cmleP,P (γ)] . (27)

Theorem 6.5 (under mild conditions)

D(PY T �QY T ) ∝ cmleP,Q − cmleP,P (28)

cmseP,Q − cmseP,P = D(PY T �QY T ) (29)

• Girsanov-type theory for expressing logdQY T

dlaw of homogenous Poisson as a filtering integral

• manipulating

D(PY T �QY T ) = EP

log

dPY T

dlaw of homogenous Poisson

logdQY T

dlaw of homogenous Poisson

via ‘orthogonality’ etc.

Put together, Theorem 6.3 and Theorem 6.5 yield, for γ > 0,

cmleP,Q(γ)− cmleP,P (γ) =1

γ

� γ

0[mleP,Q(α)−mleP,P (α)] dα =

1

γD(PY T

γ�QY T

γ), (30)

which is the Poissonian analogue of [33, Theorem 2]. On a technical note, the r.h.s. of (24), (25) and (26) arewell-defined as integrals of non-negative Borel measurable functions, as will follow from our treatment in Section 9.

6.4 for slides: minimaxity

minimax(P, snr)�= min

{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmseP,P (snr)

minimax(P, snr)�= min

X(·)maxP∈P

�cmseP,X(snr)− cmseP,P (snr)

minimax(P, snr) = minQ

maxP∈P

[cmseP,Q(snr)− cmseP,P (snr)] (31)

=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(32)

=2

snrmax

�I�Θ;Y T

snr

�: Θ is a P-valued RV

�(33)

=2

snrC

��PY T

snr

�P∈P

�(34)

Furthermore, the ‘strong redundancy-capacity’ results are directly applicable here and imply:

6.5 strong red cap

∀ε > 0 and any filter {Xt(·)}0≤t≤T ,

EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmseP,P (snr) ≥ (1− ε) ·minimax(P, snr) (35)

for all P ∈ P with the possible exception of sources in a subset B ⊂ P where

w∗(B) ≤ e · 2

−ε·C��

PY Tsnr

P∈P

, (36)

w∗(B) ≤ e · 2−ε·minimax(P,snr)

,

w∗ being the capacity achieving prior

11

Monday, June 3, 2013

Page 31: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

minimax estimation

classical

ours

Theorem 6.4 Let P and Q be two probability measures that are members of P. For γ ≥ 0,

D(PY Tγ�QY T

γ) = γ · [cmleP,Q(γ)− cmleP,P (γ)] . (27)

Theorem 6.5 (under mild conditions)

D(PY T �QY T ) ∝ cmleP,Q − cmleP,P (28)

cmseP,Q − cmseP,P = D(PY T �QY T ) (29)

• Girsanov-type theory for expressing logdQY T

dlaw of homogenous Poisson as a filtering integral

• manipulating

D(PY T �QY T ) = EP

log

dPY T

dlaw of homogenous Poisson

logdQY T

dlaw of homogenous Poisson

via ‘orthogonality’ etc.

Put together, Theorem 6.3 and Theorem 6.5 yield, for γ > 0,

cmleP,Q(γ)− cmleP,P (γ) =1

γ

� γ

0[mleP,Q(α)−mleP,P (α)] dα =

1

γD(PY T

γ�QY T

γ), (30)

which is the Poissonian analogue of [33, Theorem 2]. On a technical note, the r.h.s. of (24), (25) and (26) arewell-defined as integrals of non-negative Borel measurable functions, as will follow from our treatment in Section 9.

6.4 for slides: minimaxity

minimax(P, snr)�= min

{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmseP,P (snr)

minimax(P, snr)�= min

X(·)maxP∈P

�cmseP,X(snr)− cmseP,P (snr)

minimax(P, snr) = minQ

maxP∈P

[cmseP,Q(snr)− cmseP,P (snr)] (31)

=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(32)

=2

snrmax

�I�Θ;Y T

snr

�: Θ is a P-valued RV

�(33)

=2

snrC

��PY T

snr

�P∈P

�(34)

Furthermore, the ‘strong redundancy-capacity’ results are directly applicable here and imply:

6.5 strong red cap

∀ε > 0 and any filter {Xt(·)}0≤t≤T ,

EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmseP,P (snr) ≥ (1− ε) ·minimax(P, snr) (35)

for all P ∈ P with the possible exception of sources in a subset B ⊂ P where

w∗(B) ≤ e · 2

−ε·C��

PY Tsnr

P∈P

, (36)

w∗(B) ≤ e · 2−ε·minimax(P,snr)

,

w∗ being the capacity achieving prior

11

Monday, June 3, 2013

Page 32: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

minimax estimation

classical

ours

Redundancy-Capacity theory

Theorem 6.4 Let P and Q be two probability measures that are members of P. For γ ≥ 0,

D(PY Tγ�QY T

γ) = γ · [cmleP,Q(γ)− cmleP,P (γ)] . (27)

Theorem 6.5 (under mild conditions)

D(PY T �QY T ) ∝ cmleP,Q − cmleP,P (28)

cmseP,Q − cmseP,P = D(PY T �QY T ) (29)

• Girsanov-type theory for expressing logdQY T

dlaw of homogenous Poisson as a filtering integral

• manipulating

D(PY T �QY T ) = EP

log

dPY T

dlaw of homogenous Poisson

logdQY T

dlaw of homogenous Poisson

via ‘orthogonality’ etc.

Put together, Theorem 6.3 and Theorem 6.5 yield, for γ > 0,

cmleP,Q(γ)− cmleP,P (γ) =1

γ

� γ

0[mleP,Q(α)−mleP,P (α)] dα =

1

γD(PY T

γ�QY T

γ), (30)

which is the Poissonian analogue of [33, Theorem 2]. On a technical note, the r.h.s. of (24), (25) and (26) arewell-defined as integrals of non-negative Borel measurable functions, as will follow from our treatment in Section 9.

6.4 for slides: minimaxity

minimax(P, snr)�= min

{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmseP,P (snr)

minimax(P, snr)�= min

X(·)maxP∈P

�cmseP,X(snr)− cmseP,P (snr)

minimax(P, snr) = minQ

maxP∈P

[cmseP,Q(snr)− cmseP,P (snr)] (31)

=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(32)

=2

snrmax

�I�Θ;Y T

snr

�: Θ is a P-valued RV

�(33)

=2

snrC

��PY T

snr

�P∈P

�(34)

Furthermore, the ‘strong redundancy-capacity’ results are directly applicable here and imply:

6.5 strong red cap

∀ε > 0 and any filter {Xt(·)}0≤t≤T ,

EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmseP,P (snr) ≥ (1− ε) ·minimax(P, snr) (35)

for all P ∈ P with the possible exception of sources in a subset B ⊂ P where

w∗(B) ≤ e · 2

−ε·C��

PY Tsnr

P∈P

, (36)

w∗(B) ≤ e · 2−ε·minimax(P,snr)

,

w∗ being the capacity achieving prior

11

Monday, June 3, 2013

Page 33: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

minimax estimation

classical

ours

Redundancy-Capacity theory

Shannon

Theorem 6.4 Let P and Q be two probability measures that are members of P. For γ ≥ 0,

D(PY Tγ�QY T

γ) = γ · [cmleP,Q(γ)− cmleP,P (γ)] . (27)

Theorem 6.5 (under mild conditions)

D(PY T �QY T ) ∝ cmleP,Q − cmleP,P (28)

cmseP,Q − cmseP,P = D(PY T �QY T ) (29)

• Girsanov-type theory for expressing logdQY T

dlaw of homogenous Poisson as a filtering integral

• manipulating

D(PY T �QY T ) = EP

log

dPY T

dlaw of homogenous Poisson

logdQY T

dlaw of homogenous Poisson

via ‘orthogonality’ etc.

Put together, Theorem 6.3 and Theorem 6.5 yield, for γ > 0,

cmleP,Q(γ)− cmleP,P (γ) =1

γ

� γ

0[mleP,Q(α)−mleP,P (α)] dα =

1

γD(PY T

γ�QY T

γ), (30)

which is the Poissonian analogue of [33, Theorem 2]. On a technical note, the r.h.s. of (24), (25) and (26) arewell-defined as integrals of non-negative Borel measurable functions, as will follow from our treatment in Section 9.

6.4 for slides: minimaxity

minimax(P, snr)�= min

{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmseP,P (snr)

minimax(P, snr)�= min

X(·)maxP∈P

�cmseP,X(snr)− cmseP,P (snr)

minimax(P, snr) = minQ

maxP∈P

[cmseP,Q(snr)− cmseP,P (snr)] (31)

=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(32)

=2

snrmax

�I�Θ;Y T

snr

�: Θ is a P-valued RV

�(33)

=2

snrC

��PY T

snr

�P∈P

�(34)

Furthermore, the ‘strong redundancy-capacity’ results are directly applicable here and imply:

6.5 strong red cap

∀ε > 0 and any filter {Xt(·)}0≤t≤T ,

EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmseP,P (snr) ≥ (1− ε) ·minimax(P, snr) (35)

for all P ∈ P with the possible exception of sources in a subset B ⊂ P where

w∗(B) ≤ e · 2

−ε·C��

PY Tsnr

P∈P

, (36)

w∗(B) ≤ e · 2−ε·minimax(P,snr)

,

w∗ being the capacity achieving prior

11

Monday, June 3, 2013

Page 34: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

minimax estimation

classical

ours

Redundancy-Capacity theory

Shannon

Theorem 6.4 Let P and Q be two probability measures that are members of P. For γ ≥ 0,

D(PY Tγ�QY T

γ) = γ · [cmleP,Q(γ)− cmleP,P (γ)] . (27)

Theorem 6.5 (under mild conditions)

D(PY T �QY T ) ∝ cmleP,Q − cmleP,P (28)

• Girsanov-type theory for expressing logdQY T

dlaw of homogenous Poisson as a filtering integral

• manipulating

D(PY T �QY T ) = EP

log

dPY T

dlaw of homogenous Poisson

logdQY T

dlaw of homogenous Poisson

via ‘orthogonality’ etc.

Put together, Theorem 6.3 and Theorem 6.5 yield, for γ > 0,

cmleP,Q(γ)− cmleP,P (γ) =1

γ

� γ

0[mleP,Q(α)−mleP,P (α)] dα =

1

γD(PY T

γ�QY T

γ), (29)

which is the Poissonian analogue of [33, Theorem 2]. On a technical note, the r.h.s. of (24), (25) and (26) arewell-defined as integrals of non-negative Borel measurable functions, as will follow from our treatment in Section 9.

6.4 for slides: minimaxity

minimax(P, snr)�= min

{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmseP,P (snr)

minimax(P, snr) = minQ

maxP∈P

[cmseP,Q(snr)− cmseP,P (snr)] (30)

=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(31)

=2

snrmax

�I�Θ;Y T

snr

�: Θ is a P-valued RV

�(32)

=2

snrC

��PY T

snr

�P∈P

�(33)

Furthermore, the ‘strong redundancy-capacity’ results are directly applicable here and imply:

6.5 strong red cap

∀ε > 0 and any filter {Xt(·)}0≤t≤T ,

EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmleP,P (snr) ≥ (1− ε) ·minimax(P, snr) (34)

for all P ∈ P with the possible exception of sources in a subset B ⊂ P where

w∗(B) ≤ e · 2−ε·C(P,snr)

, (35)

w∗ being the capacity achieving prior

11

Theorem 6.4 Let P and Q be two probability measures that are members of P. For γ ≥ 0,

D(PY Tγ�QY T

γ) = γ · [cmleP,Q(γ)− cmleP,P (γ)] . (27)

Theorem 6.5 (under mild conditions)

D(PY T �QY T ) ∝ cmleP,Q − cmleP,P (28)

cmseP,Q − cmseP,P = D(PY T �QY T ) (29)

• Girsanov-type theory for expressing logdQY T

dlaw of homogenous Poisson as a filtering integral

• manipulating

D(PY T �QY T ) = EP

log

dPY T

dlaw of homogenous Poisson

logdQY T

dlaw of homogenous Poisson

via ‘orthogonality’ etc.

Put together, Theorem 6.3 and Theorem 6.5 yield, for γ > 0,

cmleP,Q(γ)− cmleP,P (γ) =1

γ

� γ

0[mleP,Q(α)−mleP,P (α)] dα =

1

γD(PY T

γ�QY T

γ), (30)

which is the Poissonian analogue of [33, Theorem 2]. On a technical note, the r.h.s. of (24), (25) and (26) arewell-defined as integrals of non-negative Borel measurable functions, as will follow from our treatment in Section 9.

6.4 for slides: minimaxity

minimax(P, snr)�= min

{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmseP,P (snr)

minimax(P, snr)�= min

X(·)maxP∈P

�cmseP,X(snr)− cmseP,P (snr)

minimax(P, snr) = minQ

maxP∈P

[cmseP,Q(snr)− cmseP,P (snr)] (31)

=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(32)

=2

snrmax

�I�Θ;Y T

snr

�: Θ is a P-valued RV

�(33)

=2

snrC

��PY T

snr

�P∈P

�(34)

Furthermore, the ‘strong redundancy-capacity’ results are directly applicable here and imply:

6.5 strong red cap

∀ε > 0 and any filter {Xt(·)}0≤t≤T ,

EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmseP,P (snr) ≥ (1− ε) ·minimax(P, snr) (35)

for all P ∈ P with the possible exception of sources in a subset B ⊂ P where

w∗(B) ≤ e · 2

−ε·C��

PY Tsnr

P∈P

, (36)

w∗(B) ≤ e · 2−ε·minimax(P,snr)

,

w∗ being the capacity achieving prior

11

Monday, June 3, 2013

Page 35: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

Strong Converse

Theorem 6.4 Let P and Q be two probability measures that are members of P. For γ ≥ 0,

D(PY Tγ�QY T

γ) = γ · [cmleP,Q(γ)− cmleP,P (γ)] . (27)

Theorem 6.5 (under mild conditions)

D(PY T �QY T ) ∝ cmleP,Q − cmleP,P (28)

• Girsanov-type theory for expressing logdQY T

dlaw of homogenous Poissonas a filtering integral

• manipulating

D(PY T �QY T ) = EP

log

dPY T

dlaw of homogenous Poisson

logdQY T

dlaw of homogenous Poisson

via ‘orthogonality’ etc.

Put together, Theorem 6.3 and Theorem 6.5 yield, for γ > 0,

cmleP,Q(γ)− cmleP,P (γ) =1

γ

� γ

0[mleP,Q(α)−mleP,P (α)] dα =

1

γD(PY T

γ�QY T

γ), (29)

which is the Poissonian analogue of [33, Theorem 2]. On a technical note, the r.h.s. of (24), (25) and (26) are

well-defined as integrals of non-negative Borel measurable functions, as will follow from our treatment in Section 9.

6.4 for slides: minimaxity

minimax(P, snr)�= min

{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmleP,P (snr)

minimax(P, snr) = minQ

maxP∈P

[cmleP,Q(snr)− cmseP,P (snr)] (30)

=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(31)

=2

snrmax

�I�Θ;Y T

snr

�: Θ is a P-valued RV

�(32)

=2

snrC(P, snr) (33)

Furthermore, the ‘strong redundancy-capacity’ results are directly applicable here and imply:

6.5 strong red cap

∀ε > 0 and any filter {Xt(·)}0≤t≤T ,

EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmleP,P (snr) ≥ (1− ε) ·minimax(P, snr) (34)

for all P ∈ P with the possible exception of sources in a subset B ⊂ P where

w∗(B) ≤ e · 2−ε·C(P,snr), (35)

w∗ being the capacity achieving prior

10

Theorem 6.4 Let P and Q be two probability measures that are members of P. For γ ≥ 0,

D(PY Tγ�QY T

γ) = γ · [cmleP,Q(γ)− cmleP,P (γ)] . (27)

Theorem 6.5 (under mild conditions)

D(PY T �QY T ) ∝ cmleP,Q − cmleP,P (28)

• Girsanov-type theory for expressing logdQY T

dlaw of homogenous Poissonas a filtering integral

• manipulating

D(PY T �QY T ) = EP

log

dPY T

dlaw of homogenous Poisson

logdQY T

dlaw of homogenous Poisson

via ‘orthogonality’ etc.

Put together, Theorem 6.3 and Theorem 6.5 yield, for γ > 0,

cmleP,Q(γ)− cmleP,P (γ) =1

γ

� γ

0[mleP,Q(α)−mleP,P (α)] dα =

1

γD(PY T

γ�QY T

γ), (29)

which is the Poissonian analogue of [33, Theorem 2]. On a technical note, the r.h.s. of (24), (25) and (26) are

well-defined as integrals of non-negative Borel measurable functions, as will follow from our treatment in Section 9.

6.4 for slides: minimaxity

minimax(P, snr)�= min

{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmleP,P (snr)

minimax(P, snr) = minQ

maxP∈P

[cmleP,Q(snr)− cmseP,P (snr)] (30)

=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(31)

=2

snrmax

�I�Θ;Y T

snr

�: Θ is a P-valued RV

�(32)

=2

snrC(P, snr) (33)

Furthermore, the ‘strong redundancy-capacity’ results are directly applicable here and imply:

6.5 strong red cap

∀ε > 0 and any filter {Xt(·)}0≤t≤T ,

EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmleP,P (snr) ≥ (1− ε) ·minimax(P, snr) (34)

for all P ∈ P with the possible exception of sources in a subset B ⊂ P where

w∗(B) ≤ e · 2−ε·C(P,snr), (35)

w∗ being the capacity achieving prior

10

Theorem 6.4 Let P and Q be two probability measures that are members of P. For γ ≥ 0,

D(PY Tγ�QY T

γ) = γ · [cmleP,Q(γ)− cmleP,P (γ)] . (27)

Theorem 6.5 (under mild conditions)

D(PY T �QY T ) ∝ cmleP,Q − cmleP,P (28)

• Girsanov-type theory for expressing logdQY T

dlaw of homogenous Poissonas a filtering integral

• manipulating

D(PY T �QY T ) = EP

log

dPY T

dlaw of homogenous Poisson

logdQY T

dlaw of homogenous Poisson

via ‘orthogonality’ etc.

Put together, Theorem 6.3 and Theorem 6.5 yield, for γ > 0,

cmleP,Q(γ)− cmleP,P (γ) =1

γ

� γ

0[mleP,Q(α)−mleP,P (α)] dα =

1

γD(PY T

γ�QY T

γ), (29)

which is the Poissonian analogue of [33, Theorem 2]. On a technical note, the r.h.s. of (24), (25) and (26) are

well-defined as integrals of non-negative Borel measurable functions, as will follow from our treatment in Section 9.

6.4 for slides: minimaxity

minimax(P, snr)�= min

{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmleP,P (snr)

minimax(P, snr) = minQ

maxP∈P

[cmleP,Q(snr)− cmseP,P (snr)] (30)

=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(31)

=2

snrmax

�I�Θ;Y T

snr

�: Θ is a P-valued RV

�(32)

=2

snrC(P, snr) (33)

Furthermore, the ‘strong redundancy-capacity’ results are directly applicable here and imply:

6.5 strong red cap

∀ε > 0 and any filter {Xt(·)}0≤t≤T ,

EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmleP,P (snr) ≥ (1− ε) ·minimax(P, snr) (34)

for all P ∈ P with the possible exception of sources in a subset B ⊂ P where

w∗(B) ≤ e · 2−ε·C(P,snr), (35)

w∗ being the capacity achieving prior

10

∀ε > 0 and any X(·)cmseP,X(snr)− cmseP,P (snr) ≥ (1− ε) ·minimax(P, snr) (36)

for all P ∈ P with the possible exception of sources in a subset B ⊂ P where

w∗(B) ≤ e · 2

−ε·C��

PY Tsnr

P∈P

, (37)

w∗(B) ≤ e · 2−ε·minimax(P,snr)

,

w∗ being the capacity achieving prior

7 Implications

7.1 Mutual Information and Minimum Mean Estimation Loss

Let X be a non-negative random variable and, for γ > 0, let Yγ be a non-negative integer-valued random variable,jointly distributed with X such that the conditional law of Yγ given X is Poisson(γX). When specialized to thissetting, Theorem 2 of [14] gives

d

dγI(X;Yγ) = E [X logX − E[X|Yγ ] logE[X|Yγ ]] . (38)

It is instructive to observe that the right hand side of (37) is nothing but the minimum mean loss in estimating X

based on Yγ under the loss function �. Indeed, denoting this minimum mean loss by mmle(γ), i.e.,

mmle(γ)�= E [� (X,E[X|Yγ ])] , (39)

we have

E [� (X,E[X|Yγ ])] = E

�X log

X

E[X|Yγ ]−X + E[X|Yγ ]

�(40)

= E [X logX −X logE[X|Yγ ]] (41)

= E [X logX − E[X|Yγ ] logE[X|Yγ ]] . (42)

Thus, (37) can be stated as the “I-MMLE” relationship

d

dγI(X;Yγ) = mmle(γ), (43)

in complete analogy with the I-MMSE relationship of [13]. To see one immediate benefit of this realization that theright hand side of (37) coincides with the minimum mean loss in the right hand side of (42), we first go throughthe following data processing argument: Fix γ

�< γ, let {Bi}i≥1 be i.i.d. Bernoulli(γ�

/γ) independent of (X,Yγ),

and note that�X,

�Yγ

i=1 Bi

�is equal in distribution to (X,Yγ�). Since estimating X based on

�Yγ

i=1 Bi, which is a

function of Yγ and the randomization sequence {Bi}, cannot be better (in the sense of minimizing the expected lossunder �) than estimating X based on Yγ , we have mmle(γ�) ≥ mmle(γ). Thus, mmle(γ) is non-increasing with γ

which, when combined with (42), yields the following analogue of [13, Corollary 1]:

Corollary 7.1 I(X;Yγ) is concave in γ.

It is also worth pointing out that the I-MMLE relationship can be viewed as a direct consequence of Theorem6.2. Indeed, in the notation of Section 6.2, (42) is expressed as

d

dγIP (X;Yγ) = mleP,P (γ), (44)

12

∀ε > 0 and any X(·)cmseP,X(snr)− cmseP,P (snr) ≥ (1− ε) ·minimax(P, snr) (36)

for all P ∈ P with the possible exception of sources in a subset B ⊂ P where

w∗(B) ≤ e · 2

−ε·C��

PY Tsnr

P∈P

, (37)

w∗(B) ≤ e · 2−ε·minimax(P,snr)

,

w∗ being the capacity achieving prior

7 Implications

7.1 Mutual Information and Minimum Mean Estimation Loss

Let X be a non-negative random variable and, for γ > 0, let Yγ be a non-negative integer-valued random variable,jointly distributed with X such that the conditional law of Yγ given X is Poisson(γX). When specialized to thissetting, Theorem 2 of [14] gives

d

dγI(X;Yγ) = E [X logX − E[X|Yγ ] logE[X|Yγ ]] . (38)

It is instructive to observe that the right hand side of (37) is nothing but the minimum mean loss in estimating X

based on Yγ under the loss function �. Indeed, denoting this minimum mean loss by mmle(γ), i.e.,

mmle(γ)�= E [� (X,E[X|Yγ ])] , (39)

we have

E [� (X,E[X|Yγ ])] = E

�X log

X

E[X|Yγ ]−X + E[X|Yγ ]

�(40)

= E [X logX −X logE[X|Yγ ]] (41)

= E [X logX − E[X|Yγ ] logE[X|Yγ ]] . (42)

Thus, (37) can be stated as the “I-MMLE” relationship

d

dγI(X;Yγ) = mmle(γ), (43)

in complete analogy with the I-MMSE relationship of [13]. To see one immediate benefit of this realization that theright hand side of (37) coincides with the minimum mean loss in the right hand side of (42), we first go throughthe following data processing argument: Fix γ

�< γ, let {Bi}i≥1 be i.i.d. Bernoulli(γ�

/γ) independent of (X,Yγ),

and note that�X,

�Yγ

i=1 Bi

�is equal in distribution to (X,Yγ�). Since estimating X based on

�Yγ

i=1 Bi, which is a

function of Yγ and the randomization sequence {Bi}, cannot be better (in the sense of minimizing the expected lossunder �) than estimating X based on Yγ , we have mmle(γ�) ≥ mmle(γ). Thus, mmle(γ) is non-increasing with γ

which, when combined with (42), yields the following analogue of [13, Corollary 1]:

Corollary 7.1 I(X;Yγ) is concave in γ.

It is also worth pointing out that the I-MMLE relationship can be viewed as a direct consequence of Theorem6.2. Indeed, in the notation of Section 6.2, (42) is expressed as

d

dγIP (X;Yγ) = mleP,P (γ), (44)

12

Monday, June 3, 2013

Page 36: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

“Minimax Filtering via Relations Between Information and Estimation”

ISIT 2013IEEE International Symposium on Information Theory

July 7-12, 2013 — Istanbul, Turkey

Albert No and T. Weissman

Monday, June 3, 2013

Page 37: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

lookahead

Monday, June 3, 2013

Page 38: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

question

1 for Duncan slide

H(X) =�

x∈XP (X = x) log

1P (X = x)

AWGN channeldYt = Xtdt + dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X

[Duncan 1970]:For example, consider Duncan’s relationship

I(XT ;Y T ) =12E

�� T

0(Xt − E[Xt|Y t])2dt

E

�log

dPXT ,Y T

dPXT × PY T

�=

12E

�� T

0(Xt − E[Xt|Y t])2dt

E

�log

dPXT ,Y T

dPXT × PY T

− 12

� T

0(Xt − E[Xt|Y t])2dt

�= 0

What else can we say about the random variable

logdPXT ,Y T

dPXT × PY T

− 12

� T

0(Xt − E[Xt|Y t])2dt

V ar

�log

dPXT ,Y T

dPXT × PY T

− 12

� T

0(Xt − E[Xt|Y t])2dt

�=?

V ar

�log

dPXT ,Y T

dPXT × PY T

− 12

� T

0(Xt − E[Xt|Y t])2dt

�= 2I(XT ;Y T ) = E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√

γXtdt + dWt, 0 ≤ t ≤ T

For stationary X = {Xt} let

mmsed(X, d, γ) = V ar(X0|Y d−∞)

let I(γ) here be the mutual information rate

can I(·) determine lmmse(d, snr) ?

We’ve seen that I(·) determines both mmsed(X, 0, γ) and mmsed(X,∞, γ)Does I(·) determine mmsed(X, d, γ) in general?No: In general mmsed(X, d, γ) �= mmsed(X(r)

, d, γ), where X(r) is the time-reversed X

I(γ) = I(XT ;Y T )

2Monday, June 3, 2013

Page 39: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

question

1 for Duncan slide

H(X) =�

x∈XP (X = x) log

1P (X = x)

AWGN channeldYt = Xtdt + dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X

[Duncan 1970]:For example, consider Duncan’s relationship

I(XT ;Y T ) =12E

�� T

0(Xt − E[Xt|Y t])2dt

E

�log

dPXT ,Y T

dPXT × PY T

�=

12E

�� T

0(Xt − E[Xt|Y t])2dt

E

�log

dPXT ,Y T

dPXT × PY T

− 12

� T

0(Xt − E[Xt|Y t])2dt

�= 0

What else can we say about the random variable

logdPXT ,Y T

dPXT × PY T

− 12

� T

0(Xt − E[Xt|Y t])2dt

V ar

�log

dPXT ,Y T

dPXT × PY T

− 12

� T

0(Xt − E[Xt|Y t])2dt

�=?

V ar

�log

dPXT ,Y T

dPXT × PY T

− 12

� T

0(Xt − E[Xt|Y t])2dt

�= 2I(XT ;Y T ) = E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√

γXtdt + dWt, 0 ≤ t ≤ T

For stationary X = {Xt} let

mmsed(X, d, γ) = V ar(X0|Y d−∞)

let I(γ) here be the mutual information rate

can I(·) determine lmmse(d, snr) ?

We’ve seen that I(·) determines both mmsed(X, 0, γ) and mmsed(X,∞, γ)Does I(·) determine mmsed(X, d, γ) in general?No: In general mmsed(X, d, γ) �= mmsed(X(r)

, d, γ), where X(r) is the time-reversed X

I(γ) = I(XT ;Y T )

2

1 for Duncan slide

H(X) =�

x∈XP (X = x) log

1P (X = x)

AWGN channeldYt = Xtdt + dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X

[Duncan 1970]:For example, consider Duncan’s relationship

I(XT ;Y T ) =12E

�� T

0(Xt − E[Xt|Y t])2dt

E

�log

dPXT ,Y T

dPXT × PY T

�=

12E

�� T

0(Xt − E[Xt|Y t])2dt

E

�log

dPXT ,Y T

dPXT × PY T

− 12

� T

0(Xt − E[Xt|Y t])2dt

�= 0

What else can we say about the random variable

logdPXT ,Y T

dPXT × PY T

− 12

� T

0(Xt − E[Xt|Y t])2dt

V ar

�log

dPXT ,Y T

dPXT × PY T

− 12

� T

0(Xt − E[Xt|Y t])2dt

�=?

V ar

�log

dPXT ,Y T

dPXT × PY T

− 12

� T

0(Xt − E[Xt|Y t])2dt

�= 2I(XT ;Y T ) = E

�� T

0(Xt − E[Xt|Y t])2dt

dYt =√

γXtdt + dWt, 0 ≤ t ≤ T

For stationary X = {Xt} let

mmsed(X, d, γ) = V ar(X0|Y d−∞)

let I(γ) here be the mutual information rate

can I(·) determine lmmse(d, snr) ?

how about I(·) and Sx(·) ?

We’ve seen that I(·) determines both mmsed(X, 0, γ) and mmsed(X,∞, γ)Does I(·) determine mmsed(X, d, γ) in general?No: In general mmsed(X, d, γ) �= mmsed(X(r)

, d, γ), where X(r) is the time-reversed X

2Monday, June 3, 2013

Page 40: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

a time irreversible process

Monday, June 3, 2013

Page 41: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T

0(Xt − E[Xt|Y T ])2dt

or in its integral version

I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ

Relationship between cmmse and mmse??⇒

+

=?

What if X ∼ P but the estimator thinks X ∼ Q ?

mseP,Q(γ) = EP

�(X − EQ[X|Y ])2

What is Cost of Mismatch?

D(P�Q) =

� ∞

0[mseP,Q(γ)−mseP,P (γ)]dγ

D(PYsnr�QYsnr) =

� snr

0[mseP,Q(γ)−mseP,P (γ)]dγ

d

dγD(PY �QY ) = mseP,Q(γ)−mseP,P (γ)

X ∼ P

?

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

3

Monday, June 3, 2013

Page 42: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

Poisson Channel

5 Notation and Conventions

Our conventions and notation for information measures, such as mutual information and relative entropy, are stan-

dard. The initiated reader is advised to skip this section. If U, V,W are three random variables taking values in

Polish spaces U ,V,W, respectively, and defined on a common probability space with a probability measure P , we let

PU , PU,V etc. denote the probability measures induced on U , the pair (U ,V) etc. while e.g., PU |V denotes a regular

version of the conditional distribution of U given V . PU |v is the distribution on U obtained by evaluating that regular

version at v. If Q is another probability measure on the same measurable space we similarly denote QU , QU |V , etc.As usual, given two measures on the same measurable space, e.g., P and Q, define their relative entropy (divergence)

by

D(P�Q) =

� �log

dP

dQ

�dP (12)

when P is absolutely continuous w.r.t. Q, defining D(P�Q) = ∞ otherwise. An immediate consequence of the

definitions of relative entropy and of the Radon-Nikodym derivative is that if f : U → V is measurable and one-to-

one, and V = f(U), then

D(PU�QU ) = D(PV �QV ). (13)

Following [5], we further use the notation

D(PU |V �QU |V |PV ) =

�D(PU |v�QU |v)dPV (v), (14)

where on the right side D(PU |v�QU |v) is a divergence in the sense of (12) between the measures PU |v and QU |v. It

will be convenient to write

D(PU |V �QU |V ) (15)

to denote f(V ) when f(v) = D(PU |v�QU |v). Thus D(PU |V �QU |V ) is a random variable while D(PU |V �QU |V |PV ) is

its expectation under P . With this notation, the chain rule for relative entropy (cf., e.g., [6, Subsection D.3]) is

D(PU,V �QU,V ) = D(PU�QU ) +D(PV |U�QV |U |PU ) (16)

and is valid regardless of the finiteness of both sides of the equation.

The mutual information between U and V is defined as

I(U ;V ) = D(PU,V �PU × PV ), (17)

where PU × PV denotes the product measure induced by PU and PV . We note in passing, in line with the comment

on relative entropy and one-to-one transformations leading to (13), that if f and g are two measurable one-to-one

transformations and A = f(U) while B = g(V ), then

I(U ;V ) = I(A;B). (18)

Finally, the conditional mutual information between U and V , given W , is defined as

I(U ;V |W ) = D(PU,V |W �PU |W × PV |W |PW ). (19)

The roles of U, V,W will be played in what follows by scalar random variables, vectors, or processes.

6 Relative Entropy and Mismatched Estimation

6.1 For slides

• Scalar Channel:

X ≥ 0

Yγ |X ∼ Poisson(γ ·X)

7

• Continuous-time Channel:

XT a non-negative stochastic process

Y Tγ |XT non-homogenous Poisson of intensity γ ·XT

• Note

D (exp(λ1)�exp(λ2)) =1

λ1· �(λ1,λ2)

Compare with

D�N (µ1,σ

2)�N (µ2,σ2)�=

1

2σ2· (µ1 − µ2)

2

I�XT ;Y T

γ

�= E

�� T

0��Xt, E[Xt|Y t

γ ]�dt

cmmle(γ) = E

�� T

0��Xt, E[Xt|Y t

γ ]�dt

mmle(γ) = E

�� T

0��Xt, E[Xt|Y T

γ ]�dt

cmmle(snr) =1

snr

� snr

0mmle(γ)dγ

Relationship between cmmle and mmle

I(U ;V )X independent of Z ∼ N (0, 1)

d

dth�X +

√tZ

�=

1

2J(X +

√tZ)

6.2 Random Variables

Suppose that X is a non-negative random variable and the conditional law of a r.v. Yγ , given X, is Poisson(γX). IfX ∼ P , denote expectation w.r.t. the corresponding joint law of X and Yγ by EP , the distribution of Yγ by PYγ ,the conditional expectation by EP [X|Yγ ], etc. We denote the mutual information by IP (X;Yγ) or simply I(X;Yγ)when there is no ambiguity. Let further mleP,Q(γ) denote the mean loss under � in estimating X based on Yγ usingthe estimator that would have been optimal had X ∼ Q when in fact X ∼ P , i.e.,

mleP,Q(γ)�= EP

���X,EQ[X|Yγ ]

��. (20)

The following is a new representation of relative entropy, paralleling the Gaussian channel result of [31]:

Theorem 6.1 For any pair P,Q of probability measures over [a, b], where 0 < a < b < ∞,

D(P�Q) =

� ∞

0[mleP,Q(γ)−mleP,P (γ)] dγ (21)

Theorem 6.1 is a direct consequence of the fact (proved in Section 9) that

limγ→∞

D(PYγ�QYγ ) = D(P�Q), (22)

combined with the following result, which is the Poisson parallel of [31, Equation (24)]:

8

Monday, June 3, 2013

Page 43: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

quest for

this relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown tobe the sum of the mutual information and the relative entropy between the true and mismatched output distributions,this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, theinput, is a non-negative random variable while the conditional distribution of the output Y given the input isgiven by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, thechannel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the outputY T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:The channel input represents the squared magnitude of the electric field incident on the photo-detector, while itsoutput is the counting process describing the arrival times of the photons registered by the detector. Here the energyof the channel input signal is proportional to its l1 norm, rather than the l2 norm as in the Gaussian channel. Thus itis the amplification factor γ rather than γ2 that plays the role of SNR. We refer to [32] for a review of the literature onthe Poisson channel and its communication theoretic significance, and to [11] and references therein for applicationsof Poisson channel models in other fields.

The function �0(x) = x log x − x + 1, x > 0 (where log denotes the natural logarithm throughout), being theconvex conjugate of the Poisson distribution’s log moment generating function, arises naturally in analysis of Poissonand continuous time jump Markov processes in a variety of situations. These include relative entropy representationfor jump Markov processes (see, e.g., equation (3.20) and Theorem 3.3 of [8]), large deviation local rate function forsuch processes ([8], Chapter 5 of [29]), mutual information in the Poisson channel (Section 19.5 and equation (19.135)of [20]), and logarithmic transformations in stochastic control theory (Section 3 of [9]). It is also intimately relatedto change-of-measure formulae for point processes in the spirit of the Girsanov transformation (Section VI.(5.5–6) of[4], [16], [28]). It is therefore not surprising that the function �0 appears in this paper in representations for relativeentropy and related calculations. It is less obvious, however, that using it to define estimation loss turns out to bevery useful and, in particular, gives rise to a number of results that parallel the Gaussian theory.

Enter the loss function � : [0,∞)× [0,∞) → [0,∞] defined by x�0(x/x) or, more precisely,

�(x, x) = x log(x/x)− x+ x, (1)

where the right hand side of (1) is well-defined as an extended non-negative real number in view of our conventions0 log 0 = 0, 0 log 0/0 = 0, c/0 = ∞ and log c/0 = ∞ for c > 0. In Section 2, we exhibit properties of this loss functionthat show it is a natural one for measuring goodness of reconstruction of non-negative objects, and that it sharessome of its key properties with the squared error loss, such as optimality of the conditional expectation under themean loss criterion.

The goal of this paper is to show that a set of relations identical to those that hold for the Gaussian channel– ranging from Duncan’s formula [7], to the I-MMSE of [13, 34], to Verdu’s relationship between relative entropyand mismatched estimation [31], to the relationship between causal and non-causal estimation in continuous timefor matched [13] and mismatched [33] filters – hold for the Poisson channel upon replacing the squared error loss bythe loss function in (1).

It is instructive to note that while the relative entropy between two Gaussians of the same variance and meansm1 and m2 is equal to (m1 − m2)2, that between two exponentials of parameters λ1 and λ2 is equal to �(λ1,λ2)(with additional multiplicative terms in both cases). Although this simple fact does not exclusively explain theGaussian-Poissonian analogy, it lies at its heart, along with further properties of � observed in Section 2.

2

Monday, June 3, 2013

Page 44: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

this relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown tobe the sum of the mutual information and the relative entropy between the true and mismatched output distributions,this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, theinput, is a non-negative random variable while the conditional distribution of the output Y given the input isgiven by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, thechannel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the outputY T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:The channel input represents the squared magnitude of the electric field incident on the photo-detector, while itsoutput is the counting process describing the arrival times of the photons registered by the detector. Here the energyof the channel input signal is proportional to its l1 norm, rather than the l2 norm as in the Gaussian channel. Thus itis the amplification factor γ rather than γ2 that plays the role of SNR. We refer to [32] for a review of the literature onthe Poisson channel and its communication theoretic significance, and to [11] and references therein for applicationsof Poisson channel models in other fields.

The function �0(x) = x log x − x + 1, x > 0 (where log denotes the natural logarithm throughout), being theconvex conjugate of the Poisson distribution’s log moment generating function, arises naturally in analysis of Poissonand continuous time jump Markov processes in a variety of situations. These include relative entropy representationfor jump Markov processes (see, e.g., equation (3.20) and Theorem 3.3 of [8]), large deviation local rate function forsuch processes ([8], Chapter 5 of [29]), mutual information in the Poisson channel (Section 19.5 and equation (19.135)of [20]), and logarithmic transformations in stochastic control theory (Section 3 of [9]). It is also intimately relatedto change-of-measure formulae for point processes in the spirit of the Girsanov transformation (Section VI.(5.5–6) of[4], [16], [28]). It is therefore not surprising that the function �0 appears in this paper in representations for relativeentropy and related calculations. It is less obvious, however, that using it to define estimation loss turns out to bevery useful and, in particular, gives rise to a number of results that parallel the Gaussian theory.

Enter the loss function � : [0,∞)× [0,∞) → [0,∞] defined by x�0(x/x) or, more precisely,

�(x, x) = x log(x/x)− x+ x, (1)

where the right hand side of (1) is well-defined as an extended non-negative real number in view of our conventions0 log 0 = 0, 0 log 0/0 = 0, c/0 = ∞ and log c/0 = ∞ for c > 0. In Section 2, we exhibit properties of this loss functionthat show it is a natural one for measuring goodness of reconstruction of non-negative objects, and that it sharessome of its key properties with the squared error loss, such as optimality of the conditional expectation under themean loss criterion.

The goal of this paper is to show that a set of relations identical to those that hold for the Gaussian channel– ranging from Duncan’s formula [7], to the I-MMSE of [13, 34], to Verdu’s relationship between relative entropyand mismatched estimation [31], to the relationship between causal and non-causal estimation in continuous timefor matched [13] and mismatched [33] filters – hold for the Poisson channel upon replacing the squared error loss bythe loss function in (1).

It is instructive to note that while the relative entropy between two Gaussians of the same variance and meansm1 and m2 is equal to (m1 − m2)2, that between two exponentials of parameters λ1 and λ2 is equal to �(λ1,λ2)(with additional multiplicative terms in both cases). Although this simple fact does not exclusively explain theGaussian-Poissonian analogy, it lies at its heart, along with further properties of � observed in Section 2.

2

[26] D. P. Palomar and S. Verdu, “Representation of Mutual Information via Input Estimates,” IEEE Trans. Infor-

mation Theory, vol. 53, no. 2, pp. 453-470, Feb. 2007.

[27] B. Y. Ryabko, “Encoding a source with unknown but ordered probabilities,” Probl. Inf. Transm., pp. 134-139,Oct. 1979.

[28] A. Segall and T. Kailath, “Radon-Nikodym derivatives with respect to measures induced by discontinuousindependent-increment processes,” Ann. Probab., Vol. 3 No. 3, pp. 449–464, 1975

[29] A. Shwartz and A. Weiss. Large Deviations for Performance Analysis. Queues, Communications, and Comput-

ing. Chapman & Hall, London, 1995

[30] A. M. Tulino and S. Verdu, “Monotonic Decrease of the Non-Gaussianness of the Sum of Independent RandomVariables: A Simple Proof,” IEEE Trans. Information Theory, vol. 52, no. 9, pp. 4295-4297, Sep. 2006.

[31] S. Verdu, “Mismatched estimation and relative entropy,” IEEE Trans. Information Theory, vol. 56, no. 8, pp.3712-3720, August 2010.

[32] S. Verdu, “Poisson communication theory,” International Technion Communication Day in Honor of Israel

Bar-David, March 1999.

[33] T. Weissman, “The Relationship Between Causal and Noncausal Mismatched Estimation in Continuous-TimeAWGN Channels,” IEEE Trans. Information Theory, vol. 56, no. 9, pp. 4256 - 4273, September 2010.

[34] M. Zakai, “On mutual information, likelihood ratios, and estimation error for the additive Gaussian channel,”IEEE Trans. Information Theory, vol. 51, no. 9, pp. 3017–3024, Sep. 2005.

1 2 3 4

0.5

1.0

1.5

2.0

2.5

(a) �(1, x)

1 2 3 4

0.5

1.0

1.5

2.0

2.5

(b) �(x, 1) = x log x− x+ 1

Figure 1: The loss function �

A

C

B

D

0.0 0.5 1.0 1.5 2.0 2.5 3.00.0

0.1

0.2

0.3

0.4

0.5

Figure 2: The curves mleP,P (γ), cmleP,P (γ), mleP,Q(γ) and cmleP,Q(γ), marked respectively by A,B,C,D, of theexample in Section 8.1, plotted here for p = 1/2 and q = 1/5.

27

[26] D. P. Palomar and S. Verdu, “Representation of Mutual Information via Input Estimates,” IEEE Trans. Infor-

mation Theory, vol. 53, no. 2, pp. 453-470, Feb. 2007.

[27] B. Y. Ryabko, “Encoding a source with unknown but ordered probabilities,” Probl. Inf. Transm., pp. 134-139,Oct. 1979.

[28] A. Segall and T. Kailath, “Radon-Nikodym derivatives with respect to measures induced by discontinuousindependent-increment processes,” Ann. Probab., Vol. 3 No. 3, pp. 449–464, 1975

[29] A. Shwartz and A. Weiss. Large Deviations for Performance Analysis. Queues, Communications, and Comput-

ing. Chapman & Hall, London, 1995

[30] A. M. Tulino and S. Verdu, “Monotonic Decrease of the Non-Gaussianness of the Sum of Independent RandomVariables: A Simple Proof,” IEEE Trans. Information Theory, vol. 52, no. 9, pp. 4295-4297, Sep. 2006.

[31] S. Verdu, “Mismatched estimation and relative entropy,” IEEE Trans. Information Theory, vol. 56, no. 8, pp.3712-3720, August 2010.

[32] S. Verdu, “Poisson communication theory,” International Technion Communication Day in Honor of Israel

Bar-David, March 1999.

[33] T. Weissman, “The Relationship Between Causal and Noncausal Mismatched Estimation in Continuous-TimeAWGN Channels,” IEEE Trans. Information Theory, vol. 56, no. 9, pp. 4256 - 4273, September 2010.

[34] M. Zakai, “On mutual information, likelihood ratios, and estimation error for the additive Gaussian channel,”IEEE Trans. Information Theory, vol. 51, no. 9, pp. 3017–3024, Sep. 2005.

1 2 3 4

0.5

1.0

1.5

2.0

2.5

(a) �(1, x)

1 2 3 4

0.5

1.0

1.5

2.0

2.5

(b) �(x, 1) = x log x− x+ 1

Figure 1: The loss function �

A

C

B

D

0.0 0.5 1.0 1.5 2.0 2.5 3.00.0

0.1

0.2

0.3

0.4

0.5

Figure 2: The curves mleP,P (γ), cmleP,P (γ), mleP,Q(γ) and cmleP,Q(γ), marked respectively by A,B,C,D, of theexample in Section 8.1, plotted here for p = 1/2 and q = 1/5.

27

[26] D. P. Palomar and S. Verdu, “Representation of Mutual Information via Input Estimates,” IEEE Trans. Infor-

mation Theory, vol. 53, no. 2, pp. 453-470, Feb. 2007.

[27] B. Y. Ryabko, “Encoding a source with unknown but ordered probabilities,” Probl. Inf. Transm., pp. 134-139,Oct. 1979.

[28] A. Segall and T. Kailath, “Radon-Nikodym derivatives with respect to measures induced by discontinuousindependent-increment processes,” Ann. Probab., Vol. 3 No. 3, pp. 449–464, 1975

[29] A. Shwartz and A. Weiss. Large Deviations for Performance Analysis. Queues, Communications, and Comput-

ing. Chapman & Hall, London, 1995

[30] A. M. Tulino and S. Verdu, “Monotonic Decrease of the Non-Gaussianness of the Sum of Independent RandomVariables: A Simple Proof,” IEEE Trans. Information Theory, vol. 52, no. 9, pp. 4295-4297, Sep. 2006.

[31] S. Verdu, “Mismatched estimation and relative entropy,” IEEE Trans. Information Theory, vol. 56, no. 8, pp.3712-3720, August 2010.

[32] S. Verdu, “Poisson communication theory,” International Technion Communication Day in Honor of Israel

Bar-David, March 1999.

[33] T. Weissman, “The Relationship Between Causal and Noncausal Mismatched Estimation in Continuous-TimeAWGN Channels,” IEEE Trans. Information Theory, vol. 56, no. 9, pp. 4256 - 4273, September 2010.

[34] M. Zakai, “On mutual information, likelihood ratios, and estimation error for the additive Gaussian channel,”IEEE Trans. Information Theory, vol. 51, no. 9, pp. 3017–3024, Sep. 2005.

1 2 3 4

0.5

1.0

1.5

2.0

2.5

(a) �(1, x)

1 2 3 4

0.5

1.0

1.5

2.0

2.5

(b) �(x, 1) = x log x− x+ 1

Figure 1: The loss function �

A

C

B

D

0.0 0.5 1.0 1.5 2.0 2.5 3.00.0

0.1

0.2

0.3

0.4

0.5

Figure 2: The curves mleP,P (γ), cmleP,P (γ), mleP,Q(γ) and cmleP,Q(γ), marked respectively by A,B,C,D, of theexample in Section 8.1, plotted here for p = 1/2 and q = 1/5.

27

Monday, June 3, 2013

Page 45: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

An observation (and hint)

• Continuous-time Channel:

XT a non-negative stochastic process

Y Tγ |XT non-homogenous Poisson of intensity γ ·XT

• Note

D (exp(λ1)�exp(λ2)) =1

λ1· �(λ1,λ2)

D (Poisson(λ1)�Poisson(λ2)) = �(λ1,λ2)

Compare with

D�N (µ1,σ

2)�N (µ2,σ2)�=

1

2σ2· (µ1 − µ2)

2

I�XT ;Y T

γ

�= γ · E

�� T

0��Xt, E[Xt|Y t

γ ]�dt

cmleP,Q(snr) =1

snr

�I�XT ;Y T

snr

�+D

�PY T

snr�QY T

snr

��

cmmle(γ) = E

�� T

0��Xt, E[Xt|Y t

γ ]�dt

mmle(γ) = E

�� T

0��Xt, E[Xt|Y T

γ ]�dt

cmmle(snr) =1

snr

� snr

0mmle(γ)dγ

Relationship between cmmle and mmle

I(U ;V )X independent of Z ∼ N (0, 1)

d

dth�X +

√tZ

�=

1

2J(X +

√tZ)

6.2 Random Variables

Suppose that X is a non-negative random variable and the conditional law of a r.v. Yγ , given X, is Poisson(γX). IfX ∼ P , denote expectation w.r.t. the corresponding joint law of X and Yγ by EP , the distribution of Yγ by PYγ ,the conditional expectation by EP [X|Yγ ], etc. We denote the mutual information by IP (X;Yγ) or simply I(X;Yγ)when there is no ambiguity. Let further mleP,Q(γ) denote the mean loss under � in estimating X based on Yγ usingthe estimator that would have been optimal had X ∼ Q when in fact X ∼ P , i.e.,

mleP,Q(γ)�= EP

���X,EQ[X|Yγ ]

��. (20)

The following is a new representation of relative entropy, paralleling the Gaussian channel result of [?]:

Theorem 6.1 For any pair P,Q of probability measures over [a, b], where 0 < a < b < ∞,

D(P�Q) =

� ∞

0[mleP,Q(γ)−mleP,P (γ)] dγ (21)

8

Monday, June 3, 2013

Page 46: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

An observation (and hint)

• Continuous-time Channel:

XT a non-negative stochastic process

Y Tγ |XT non-homogenous Poisson of intensity γ ·XT

• Note

D (exp(λ1)�exp(λ2)) =1

λ1· �(λ1,λ2)

D (Poisson(λ1)�Poisson(λ2)) = �(λ1,λ2)

Compare with

D�N (µ1,σ

2)�N (µ2,σ2)�=

1

2σ2· (µ1 − µ2)

2

I�XT ;Y T

γ

�= γ · E

�� T

0��Xt, E[Xt|Y t

γ ]�dt

cmleP,Q(snr) =1

snr

�I�XT ;Y T

snr

�+D

�PY T

snr�QY T

snr

��

cmmle(γ) = E

�� T

0��Xt, E[Xt|Y t

γ ]�dt

mmle(γ) = E

�� T

0��Xt, E[Xt|Y T

γ ]�dt

cmmle(snr) =1

snr

� snr

0mmle(γ)dγ

Relationship between cmmle and mmle

I(U ;V )X independent of Z ∼ N (0, 1)

d

dth�X +

√tZ

�=

1

2J(X +

√tZ)

6.2 Random Variables

Suppose that X is a non-negative random variable and the conditional law of a r.v. Yγ , given X, is Poisson(γX). IfX ∼ P , denote expectation w.r.t. the corresponding joint law of X and Yγ by EP , the distribution of Yγ by PYγ ,the conditional expectation by EP [X|Yγ ], etc. We denote the mutual information by IP (X;Yγ) or simply I(X;Yγ)when there is no ambiguity. Let further mleP,Q(γ) denote the mean loss under � in estimating X based on Yγ usingthe estimator that would have been optimal had X ∼ Q when in fact X ∼ P , i.e.,

mleP,Q(γ)�= EP

���X,EQ[X|Yγ ]

��. (20)

The following is a new representation of relative entropy, paralleling the Gaussian channel result of [?]:

Theorem 6.1 For any pair P,Q of probability measures over [a, b], where 0 < a < b < ∞,

D(P�Q) =

� ∞

0[mleP,Q(γ)−mleP,P (γ)] dγ (21)

8

• Continuous-time Channel:

XT a non-negative stochastic process

Y Tγ |XT non-homogenous Poisson of intensity γ ·XT

• Note

D (exp(λ1)�exp(λ2)) =1

λ1· �(λ1,λ2)

D (Poisson(λ1)�Poisson(λ2)) = �(λ1,λ2)

Compare with

D�N (µ1,σ

2)�N (µ2,σ2)�=

1

2σ2· (µ1 − µ2)

2

D (N (µ1, 1)�N (µ2, 1)) =1

2· (µ1 − µ2)

2

I�XT ;Y T

γ

�= γ · E

�� T

0��Xt, E[Xt|Y t

γ ]�dt

cmleP,Q(snr) =1

snr

�I�XT ;Y T

snr

�+D

�PY T

snr�QY T

snr

��

cmmle(γ) = E

�� T

0��Xt, E[Xt|Y t

γ ]�dt

mmle(γ) = E

�� T

0��Xt, E[Xt|Y T

γ ]�dt

cmmle(snr) =1

snr

� snr

0mmle(γ)dγ

Relationship between cmmle and mmle

I(U ;V )X independent of Z ∼ N (0, 1)

d

dth�X +

√tZ

�=

1

2J(X +

√tZ)

6.2 Random Variables

Suppose that X is a non-negative random variable and the conditional law of a r.v. Yγ , given X, is Poisson(γX). IfX ∼ P , denote expectation w.r.t. the corresponding joint law of X and Yγ by EP , the distribution of Yγ by PYγ ,the conditional expectation by EP [X|Yγ ], etc. We denote the mutual information by IP (X;Yγ) or simply I(X;Yγ)when there is no ambiguity. Let further mleP,Q(γ) denote the mean loss under � in estimating X based on Yγ usingthe estimator that would have been optimal had X ∼ Q when in fact X ∼ P , i.e.,

mleP,Q(γ)�= EP

���X,EQ[X|Yγ ]

��. (20)

The following is a new representation of relative entropy, paralleling the Gaussian channel result of [31]:

8

Monday, June 3, 2013

Page 47: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T

0(Xt − E[Xt|Y T ])2dt

or in its integral version

I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ

Relationship between cmmse and mmse??⇒

�⇒

+

=?

What if X ∼ P but the estimator thinks X ∼ Q ?

mseP,Q(γ) = EP

�(X − EQ[X|Y ])2

What is Cost of Mismatch?

D(P�Q) =

� ∞

0[mseP,Q(γ)−mseP,P (γ)]dγ

D(PYsnr�QYsnr) =

� snr

0[mseP,Q(γ)−mseP,P (γ)]dγ

d

dγD(PY �QY ) = mseP,Q(γ)−mseP,P (γ)

X ∼ P

?

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due to

3

Punch Line

[Rami Atar and T.W. 2012]:

under the above

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown tobe the sum of the mutual information and the relative entropy between the true and mismatched output distributions,this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, theinput, is a non-negative random variable while the conditional distribution of the output Y given the input isgiven by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, thechannel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the outputY T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:The channel input represents the squared magnitude of the electric field incident on the photo-detector, while itsoutput is the counting process describing the arrival times of the photons registered by the detector. Here the energyof the channel input signal is proportional to its l1 norm, rather than the l2 norm as in the Gaussian channel. Thus itis the amplification factor γ rather than γ2 that plays the role of SNR. We refer to [32] for a review of the literature onthe Poisson channel and its communication theoretic significance, and to [11] and references therein for applicationsof Poisson channel models in other fields.

The function �0(x) = x log x − x + 1, x > 0 (where log denotes the natural logarithm throughout), being theconvex conjugate of the Poisson distribution’s log moment generating function, arises naturally in analysis of Poissonand continuous time jump Markov processes in a variety of situations. These include relative entropy representationfor jump Markov processes (see, e.g., equation (3.20) and Theorem 3.3 of [8]), large deviation local rate function forsuch processes ([8], Chapter 5 of [29]), mutual information in the Poisson channel (Section 19.5 and equation (19.135)of [20]), and logarithmic transformations in stochastic control theory (Section 3 of [9]). It is also intimately relatedto change-of-measure formulae for point processes in the spirit of the Girsanov transformation (Section VI.(5.5–6) of[4], [16], [28]). It is therefore not surprising that the function �0 appears in this paper in representations for relativeentropy and related calculations. It is less obvious, however, that using it to define estimation loss turns out to bevery useful and, in particular, gives rise to a number of results that parallel the Gaussian theory.

Enter the loss function � : [0,∞)× [0,∞) → [0,∞] defined by x�0(x/x) or, more precisely,

�(x, x) = x log(x/x)− x+ x, (1)

where the right hand side of (1) is well-defined as an extended non-negative real number in view of our conventions0 log 0 = 0, 0 log 0/0 = 0, c/0 = ∞ and log c/0 = ∞ for c > 0. In Section 4, we exhibit properties of this loss functionthat show it is a natural one for measuring goodness of reconstruction of non-negative objects, and that it sharessome of its key properties with the squared error loss, such as optimality of the conditional expectation under themean loss criterion.

The goal of this paper is to show that a set of relations identical to those that hold for the Gaussian channel– ranging from Duncan’s formula [7], to the I-MMSE of [13, 34], to Verdu’s relationship between relative entropyand mismatched estimation [31], to the relationship between causal and non-causal estimation in continuous timefor matched [13] and mismatched [33] filters – hold for the Poisson channel upon replacing the squared error loss bythe loss function in (1).

It is instructive to note that while the relative entropy between two Gaussians of the same variance and meansm1 and m2 is equal to (m1 − m2)2, that between two exponentials of parameters λ1 and λ2 is equal to �(λ1,λ2)(with additional multiplicative terms in both cases). Although this simple fact does not exclusively explain theGaussian-Poissonian analogy, it lies at its heart, along with further properties of � observed in Section 4.

Our emphasis is on the results for the mismatched setting, relating the cost of mismatch to relative entropy inthe Poisson channel. The results for the exact (i.e., non-mismatched) setting, relating the minimum mean loss tomutual information, and causal to non-causal minimum mean estimation loss, are shown to follow as special cases.The latter results, for the exact setting, are consistent and in fact coincide with those of [14] – which considered amore general Poisson channel model that accommodates the presence of dark current – when specialized to the caseof zero dark current. Our framework complements the results of [14] not only in extending the scope to the presence

4

Monday, June 3, 2013

Page 48: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

and i mean everything

• i-mmse

• Duncan

• causal - non-causal

• mismatch

• minimax

Monday, June 3, 2013

Page 49: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

the universal picture

Monday, June 3, 2013

Page 50: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

universal denoising

Monday, June 3, 2013

Page 51: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

universal probability assignments:

X1, X2, X3, . . . ,Xi−1, Xi, . . .

Y1, Y2, Y3, . . . , Yi−1, Yi, . . .

I(Xi;Yi|Y i−1)

I(Y i−1;Xi|Xi−1)

C = limn→∞

max1n

I(Xn → Y n)

I(Xn → Y n)� I(Y n−1 → Xn) ⇒ “X causes Y”

I(Xn → Y n)� I(Y n−1 → Xn) ⇒ “Y causes X”

I(Xn → Y n) ≈ I(Y n−1 → Xn)� 0 ⇒ “X and Y are causing each other”

I(Xn;Y n) ≈ 0 ⇒ X and Y are essentially independent

I(X→ Y) = limn→∞

1n

I(Xn → Y n)

Q is universal if

limn→∞

1n

D(PXn�QXn) = 0

for every stationary P

and pointwise universal if

lim supn→∞

1n

logPXn(Xn)QXn(Xn)

≤ 0 P − a.s.

for every stationary and ergodic P

1

X1, X2, X3, . . . ,Xi−1, Xi, . . .

Y1, Y2, Y3, . . . , Yi−1, Yi, . . .

I(Xi;Yi|Y i−1)

I(Y i−1;Xi|Xi−1)

C = limn→∞

max1n

I(Xn → Y n)

I(Xn → Y n)� I(Y n−1 → Xn) ⇒ “X causes Y”

I(Xn → Y n)� I(Y n−1 → Xn) ⇒ “Y causes X”

I(Xn → Y n) ≈ I(Y n−1 → Xn)� 0 ⇒ “X and Y are causing each other”

I(Xn;Y n) ≈ 0 ⇒ X and Y are essentially independent

I(X→ Y) = limn→∞

1n

I(Xn → Y n)

Q is universal if

limn→∞

1n

D(PXn�QXn) = 0

for every stationary P

and pointwise universal if

lim supn→∞

1n

logPXn(Xn)QXn(Xn)

≤ 0 P − a.s.

for every stationary and ergodic P

1

X1, X2, X3, . . . ,Xi−1, Xi, . . .

Y1, Y2, Y3, . . . , Yi−1, Yi, . . .

I(Xi;Yi|Y i−1)

I(Y i−1;Xi|Xi−1)

C = limn→∞

max1n

I(Xn → Y n)

I(Xn → Y n)� I(Y n−1 → Xn) ⇒ “X causes Y”

I(Xn → Y n)� I(Y n−1 → Xn) ⇒ “Y causes X”

I(Xn → Y n) ≈ I(Y n−1 → Xn)� 0 ⇒ “X and Y are causing each other”

I(Xn;Y n) ≈ 0 ⇒ X and Y are essentially independent

I(X→ Y) = limn→∞

1n

I(Xn → Y n)

Q is universal if

limn→∞

1n

D(PXn�QXn) = 0

for every stationary P

and pointwise universal if

lim supn→∞

1n

logPXn(Xn)QXn(Xn)

≤ 0 P − a.s.

for every stationary and ergodic P

1

Monday, June 3, 2013

Page 52: cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

universal compressors (e.g.: Lempel-Ziv 78, CTW)

X1, X2, X3, . . . , Xi−1, Xi, . . .

Y1, Y2, Y3, . . . , Yi−1, Yi, . . .

H(X) =�

x

PX(x) log1

PX(x)

• I(X;Y ) = I(Y ;X)

• I(f(X); g(Y )) = I(X;Y ) if f and g are one-to-one

• chain rules

I(X;Y )

I(Xi;Yi|Y i−1)

I(Y i−1;Xi|Xi−1)

C = limn→∞

max1

nI(Xn → Y

n)

I(Xn → Yn) � I(Y n−1 → X

n) ⇒ “X causes Y”

I(Xn → Yn) � I(Y n−1 → X

n) ⇒ “Y causes X”

I(Xn → Yn) ≈ I(Y n−1 → X

n) � 0 ⇒ “X and Y are causing each other”

⇐⇒

I(Xn;Y n) ≈ 0 ⇒ X and Y are essentially independent

I(X → Y) = limn→∞

1

nI(Xn → Y

n)

Q is universal if

limn→∞

1

nD(PXn�QXn) = 0

for every stationary P

and pointwise universal if

lim supn→∞

1

nlog

PXn(Xn)

QXn(Xn)≤ 0 P − a.s.

for every stationary and ergodic P

1

X1, X2, X3, . . . , Xi−1, Xi, . . .

Y1, Y2, Y3, . . . , Yi−1, Yi, . . .

H(X) =�

x

PX(x) log1

PX(x)

• I(X;Y ) = I(Y ;X)

• I(f(X); g(Y )) = I(X;Y ) if f and g are one-to-one

• chain rules

I(X;Y )

I(Xi;Yi|Y i−1)

I(Y i−1;Xi|Xi−1)

C = limn→∞

max1

nI(Xn → Y

n)

I(Xn → Yn) � I(Y n−1 → X

n) ⇒ “X causes Y”

I(Xn → Yn) � I(Y n−1 → X

n) ⇒ “Y causes X”

I(Xn → Yn) ≈ I(Y n−1 → X

n) � 0 ⇒ “X and Y are causing each other”

⇐⇒

I(Xn;Y n) ≈ 0 ⇒ X and Y are essentially independent

I(X → Y) = limn→∞

1

nI(Xn → Y

n)

Q is universal if

limn→∞

1

nD(PXn�QXn) = 0

for every stationary P

and pointwise universal if

lim supn→∞

1

nlog

PXn(Xn)

QXn(Xn)≤ 0 P − a.s.

for every stationary and ergodic P

1

universal probability assignment

univ. sequential prob. assignment

(much more in ee376c)

univ. prediction, filtering, denoising, lossy compression

Monday, June 3, 2013