EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

$Page 1: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David$
Logistics Logistics IT-II Differential Entropy Scratch

EE515A – Information Theory IISpring 2012

Prof. Jeff Bilmes

University of Washington, SeattleDepartment of Electrical Engineering

Spring Quarter, 2012http://j.ee.washington.edu/~bilmes/classes/ee515a_spring_2012/

Lecture 19 - March 27th, 2012

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 1


Outstanding Reading

Read all chapters assigned from IT-I (EE514, Winter 2012).

Read chapter 8 in the book (but what book you might ask? You’llsoon see if you don’t know).


http://j.ee.washington.edu/~bilmes/classes/ee515a_spring_2012/


Information Theory I and IIThis two-quarter course will be an thorough introduction toinformation theory.Information Theory I (EE514, Winter 2012): entropy, mutualinformation, asymptotic equipartition properties, data compressionto the entropy limit (source coding theorem), Huffman,communication at the channel capacity limit (channel codingtheorem), method of types, arithmetic coding, Fano codesInformation Theory II (EE515, Spring 2012) : Lempel-Ziv,convolutional codes, differential entropy, maximum entropy, ECC,turbo, LDPC and other codes, Kolmogorov complexity, spectralestimation, rate-distortion theory, alternating minimization forcomputation of RD curve and channel capacity, more on theGaussian channel, network information theory, information geometry,and some recent results on use of polymatroids in information theory.Additional topics throughout will include information theory as it isapplicable to pattern recognition, natural language processing,computer science and complexity, biological science, andcommunications.Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 3


Course Web Pages

our web page (http://j.ee.washington.edu/~bilmes/classes/ee515a_spring_2012/)

our dropbox (https://catalyst.uw.edu/collectit/dropbox/bilmes/21171) whichis where all homework will be due, electronically, in PDF format. Nopaper homework accepted.

our discussion board(https://catalyst.uw.edu/gopost/board/bilmes/27386/) iswhere you can ask questions. Please use this rather than email sothat all can benefit from answers to your questions.




https://catalyst.uw.edu/collectit/dropbox/bilmes/21171

https://catalyst.uw.edu/collectit/dropbox/bilmes/21171

https://catalyst.uw.edu/gopost/board/bilmes/27386/


Prerequisites

basic probability and statistics

random processes (e.g., EE505 or a Stat 5xx class).

Knowledge of matlab.

EE514A - Information Theory I (or equivalent).

The course is open to students in all UW departments.



Homework

There will be about 4-5 homeworks this quarter.

You will have approximately 1 to 1.5 weeks to solve the problem set.

Problem sets might also include matlab exercises, so you will need tohave access to matlab, please let me know if you currently do not.

The problem sets that are longer will take longer to do, so please donot wait until the night before they are due to start them.



Homework Grading Strategy

We will do swap grading. I.e. you will turn in your homeworks, andI’ll send them out randomly for others to grade.

You will receive a grade on your homework from the person who isgrading it.

You will also assign a grade to your grader, for the quality of thegrading (but not the grade) given to you.

This means, for each HW you will get two grades: one for your HW,and one for the quality of grading you do.

No collusion! ,

No grading your own HW. ,



Exams

No exams this quarter.

Final presentations: Monday, June 4th in the afternoon late/evening(currently scheduled for 8:30am but that is too early).



Grading

Final grades will be based on a combination of the homeworks(60%) and the final presentations (40%).

If you are active in the class (via attendance, asking good questions,etc.), this may boost your final grade.



On Final Presentations

Your task is to give a 15-20 minute presentation that summarizes2-3 related and significant papers that come from IEEE Transactionson Information Theory.

The papers must not be ones that we covered in class, althoughthey can be related.

You need to do the research to find the papers yourself (i.e., that ispart of the assignment).

The papers must have been published in the last 10 years (so no oldor classic papers).

Your grade will be based on how clear, understandable, and accurateyour presentation is.

This is a real challenge and will require significant work! Many ofthe papers are complex. To get a good grade, you will need to workto present complex ideas in a simple way.



Our Main Text

“Elements of Information Theory” by Thomas Cover and JoyThomas, 2nd edition (2006).

It should be available at the UW bookstore, or you can get it, say,via amazon.com.

Reading assignment: Read Chapter 8.



Other Relevant Texts

“A First Course in Information Theory”, by Raymond W. Yeung,2002. Very good chapters on information measures and networkinformation theory.

“Elements of information theory”, 1991, Thomas M. Cover, Joy A.Thomas, Q360.C68 (First edition of our book).

“Information theory and reliable communication”, 1968, Robert G.Gallager. Q360.G3 (classic text by foundational researcher).

“Information theory and statistics”, 1968, Solomon Kullback, MathQA276.K8

“Information theory : coding theorems for discrete memorylesssystems”, 1981, Imre Csiszar and Janos Korner. Q360.C75 (anotherkey book, but a little harder to read).

“Information Theory, Inference, and Learning Algorithms”, DavidJ.C. MacKay 2003




“The Theory of Information and Coding: 2nd Edition” RobertMcEliece, Cambridge, April 2002

“An Introduction to Information Theory: Symbols, Signals, andNoise”, John R. Pierce, Dover 1980

“Information Theory”, Robert Ash, Dover 1965

“An Introduction to Information Theory”, Fazlollah M. Reza, Dover1991

“Mathematical Foundations of Information Theory”, A. Khonchin,Dover 1957. (best brief summary of key results).




“Convex Optimization”, Boyd and Vandenberghe

“Probability & Measure” Billingsley,

“Probability with Martingales”, Williams

“Probability, Theory & Examples”, Durrett

“Probability and Random Processes”, Grimmett and Stirzaker (greatbook on stochastic processes).



On Our Lecture Slides

First time class is using these lecture slides.

Slides will (hopefully) be available by the early morning beforelecture.

Slides available on the web page, as well as a printable version.

Updated slides with typo and bug fixes will be posted as well as(buggy) ones with hand annotations/corrections/fill-ins.

Annotations are (currently) being made with Adobe Acrobat but theiphone/ipad pdf previewer has trouble viewing them.



Class Road Map - IT-II

L1 (3/28): Overview, Diff Entropy

L2 (3/30):

L3 (4/4):

L4 (4/6): No class.

L5 (4/11):

L6 (4/13):

L6 (4/18):

L7 (4/20): No class.

L8 (4/25):

L9 (4/27):

L10 (5/2):

L11 (5/4):

L12 (5/9):

L13 (5/11):

L14 (5/16):

L15 (5/18):

L16 (5/23):

L17 (5/25):

L18 (5/30):

L19 (6/1):

Finals Presentations: June 4th, 2012.



Shannon’s model of communication

Shannon’s general model of communicationencoder decoder

noise

source channelencoder

channel sourcedecoder

receiversourcecoder

channeldecoder

This has been our guiding principle so far, and will continue to guideus for a while (key exception is network information theory that wewill cover, where the above picture may become an arbitrarydirected graph with multiple sources/receivers).

So far, channel has been discrete, but real-world channels are notdiscrete, we thus need differential entropy.



Sources and Channels

Source Channel

voice telephone line

text words disk

pictures disk

pictures Ethernet or internet

music mp3 file

human cells mitosis

population of living organisms life

So channel can be over a typical communication link(communication over space).

Channel can also be storage (communication over time), orgenerations (biology).

Many applications of IT, not just communication (e.g., Kolmogorovcomplexity).



This is IT-II

This is lecture 19.You know that the source compresses down to the entropy H, butno further.

E(R)

R→H

ErrorExponent

R→H

logPe

0

You also know that the signal may be sent through the channel at arate no more than C.

R→C0

E(R)

ErrorExponent

R→

logPe

C

0



This is IT-II

What if we want to compress R < H or transmit R > C? ⇒ Error.

But are all errors created equality? Are all errors as bad as others?

We can measure errors with a distortion function, and we havegeneralization of the previously stated results.

Rate-distortion curves with achievable region

Distortion →0

H

Rate R

achievableregion

unachievableregion

No equal rights for equal errors.



This is IT-II

Continuous Entropy

Differential Entropy

Gaussian Channel

Shannon Capacity for the Gaussian Channel

Information Measures

Generalized inequalities for entropy functions

Negative discrete information measures

The class of entropy functions

More on Compression

Lempel-Ziv compression

Kolmogorov complexity (algorithmic compression)

Philosophy



This is IT-II

Coding with errors

Rate distortion theory

Alternating minimization, Blahut-Arimoto algorithm, EM algorithm,variational EM, information geometry.

Network information theory

multi-port communication

Slepian-wolf,

Multiple sensors/receivers, and models for this.

Polymatroidal theory and network information theory

Codes on Graphs

Convolutional coding

Turbo codes

LDPC codes

inference and LBPProf. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 22


This is IT-II

And more applications in

Communications

Pattern recognition and machine learning

Compute Science

Biology

Philosophy (and poetry).

Social science



Entropy

H(X) = −∑x p(x) log p(x)

All entropic quantities we’ve encountered in IT-I have been discrete.

The world is continuous, channels are continuous, noise iscontinuous,

We need a theory of compression, entropy, and channel capacitythat applies to such continuous domains.

We explore this next.



Continuous/Differential Entropy

Let X now be a continuous r.v. with cumulative distribution

F (x) = Pr(X ≤ x) (1)

and f(x) = ddxF (x) is the density function.

Let S = {x : f(x) > 0} be the support set. Then

Definition 4.1 (differential entropy h(X))

h(X) = −∫Sf(x) log f(x)dx (2)

Since we integrate over only the support set, no worries about log 0.

Perhaps it is best to do some examples.



Continuous Entropy Of Uniform Distribution

Here, X ∼ U [0, a] with a ∈ R++.

Then

h(X) = −∫ a

0

1

alog

1

adx = − log

1

a(3)

Note: continuous entropy can be both positive or negative.

How can entropy (which we know to mean “uncertainty”, or“information”) be negative?

In fact, entropy (as we’ve seen perhaps once or twice) can beinterpreted as the exponent of the “volume” of a typical set.

Example: 2H(X) is the number of things that happen, on average,and can have 2H(X) � |X |.Consider a uniform r.v. Y such that 2H(X) = |Y|.Thus having a negative exponent just means the volume is small.



Continuous Entropy Of Normal Distribution

Normal (Gaussian) distributions are very important.We have:

X ∼ N(0, σ2) ⇔ f(x) =1

(2πσ2)1/2e−

12x2/σ2

(4)

Lets compute the entropy of f in nats.

h(X) = −∫f ln f = −

∫f(x)

[− x2

2σ2− ln√

2πσ2

]dx (5)

EX2

2σ2+

1

2ln(2πσ2) =

1

2+

1

2ln(2πσ2) (6)

=1

2ln e+

1

2ln(2πσ2) =

1

2ln(2πeσ2)nats×

(1

ln 2bits/nats

)=

1

2log(2πeσ2)bits (7)

Note: only a function of the variance σ2, not the mean. Why?So entropy of a Gaussian is monotonically related to the variance.



AEP lives

We even have our own AEP in the continuous case, but before thata bit more intuition.

In the discrete case, we have Pr(x1, x2, . . . , xn) ≈ 2−nH(X) for big n

and |A(n)ε | = 2nH = (2H)n.

Thus, 2H can be seen like a “side length” of an n-dimensionalhypercube, and 2nH is like the volume of this hypercube (or volumeof the typical set).

So H being negative could mean small side length (small 2H butstill positive).



AEP

Things are similar for the continuous case. Indeed

Theorem 4.2

Let X1, X2, . . . , Xn be a sequence of r.v.’s, i.i.d. ∼ f(x). Then

− 1

nlog f(X1, X2, . . . , Xn)→ E[− log f(X)] = h(X) (8)

this follows via the weak law of large numbers (WLLN) just like inthe discrete case.

Definition 4.3

A(n)ε =

{x1:n ∈ Sn : | − 1

n log f(x1, . . . , xn)− h(X)| ≤ ε}

Note: f(x1, . . . , xn) =∏ni=1 f(xi).

Thus, we have upper/lower bounds on the probability

2−n(h+ε) ≤ f(x1:n) ≤ 2−n(h−ε) (9)



AEP

The volume of A ⊆ Rn is well defined as:

Vol(A) =

∫Adx1dx2 . . . dxn (10)

then we have

Theorem 4.4

1 Pr(A(n)ε ) > 1− ε

2 (1− ε)2n(h(X)−ε) ≤ Vol(A(n)ε ) ≤ 2n(h(X)+ε)

Note this is a bound on volume Vol(A(n)ε ) of typical set.

In discrete AEP, we bound cardinality of typical set |A(n)ε |, and never

need entropy to be negative — H(X) ≥ 0 suffices to limit |A(n)ε |

down to its lowest sensible value, namely 1.



AEPproof of theorem 4.4.

1: First,

p(A(n)ε ) =

∫x1:n∈A(n)

ε

f(x1, . . . , xn)dx1 . . . dxn (11)

= Pr

(∣∣∣− 1

nf(x1, x2, . . . , xn)− h(X)

∣∣∣ ≤ ε) ≥ 1− ε (12)

for big enough n which follows from the WLLN.2: Next, we have

1 =

∫Snf(x1, . . . , xn)dx1 . . . dxn ≥

∫A

(n)ε

f(x1, . . . , xn)dx1 . . . dxn

(13)

≥∫A

(n)ε

2−n(h(X)+ε)dx1:n = 2−n(h(X)+ε) Vol(A(n)ε ) (14)

⇒ Vol(A(n)ε ) ≤ 2n(h(X)+ε). . . .



AEP

proof of theorem 4.4 cont.

Similarly,

1− ε ≤ Pr(A(n)ε ) =

∫A

(n)ε

f(x1:n)dx1:n (15)

≤∫A

(n)ε

2−n(h(X)−ε)dx1:n = 2−n(h(X)−ε) Vol(A(n)ε ) (16)

Like in the discrete case, A(n)ε is the smallest volume that contains,

essentially, all of the probability, and that volume is ≈ 2nh.

If we look at (2nh)1/n, we get a “side length” of 2h.

So, −∞ < h <∞ is a meaningful range for entropy since it is theexponent of the equivalent side length of the n-D volume.

Large negative entropy just means small volume.



Differential vs. Discrete Entropy

Let X ∼ f(x), and divide the range of X up into bins of length ∆.

E.g., quantize the range of X using n bits, so that ∆ = 2−n.

We can then view this as follows:

f(x) ∆ = 2−n

→|∆ |←

x

Mean value theorem, i.e., that if continuous within bin ∃xi such that

f(xi) =1

∆

∫ (i+1)∆

i∆f(x)dx (17)

f(xi)

xi∆ (i+1)∆xi




Create a quantized random variable X∆ having those values so that

X∆ = xi if i∆ ≤ X < (i+ 1)∆ (18)

This gives a discrete distribution

Pr(X∆ = xi) = pi =

∫ (i+1)∆

i∆f(x)dx = ∆f(xi) (19)

and we can calculate the entropy

H(X∆) = −∞∑

i=−∞pi log pi = −

∑i

f(xi)∆ log(f(xi)∆) (20)

= −∑i

∆f(xi) log f(xi)−∑i

f(xi)∆ log ∆ (21)

= −∑i

∆f(xi) log f(xi)− log ∆ (22)




This follows since (as expected)∑i

∆f(xi) = ∆∑i

1

∆

∫ (i+1)∆

i∆f(x)dx = ∆

1

∆

∫f(x)dx = 1

(23)

Also, as ∆→ 0, we have − log ∆→∞ and (assuming all isintegrable in a Riemannian sense)

−∑i

∆f(xi) log f(xi)→ −∫f(x) log f(x)dx (24)

So, H(X∆) + log ∆→ h(f) as ∆→ 0.Loosely, h(f) ≈ H(X∆) + log ∆ and for an n-bit quantization with∆ = 2−n, we have

H(X∆) ≈ h(f)− log ∆ = h(f) + n (25)

This means that as n→∞, H(X∆) gets larger. Why?




This makes sense. We start with a continuous random variable Xand quantize it at an n-bit accuracy.

For a discrete representation to represent 2n values, we expect theentropy to go up with n, and as n gets large so would the entropy,but then adjusted by h(X).

H(X∆) is the number of bits to describe this n-bit equally spacedquantization of the continuous random variable X.

H(X∆) ≈= h(f) + n says that it might take either more than nbits to describe X at n-bit accuracy, or less than n bits to describeX at n-bit accuracy.

If X is very concentrated h(f) < 0 then fewer bits. If X is veryspread out, then more than n bits.



Joint Differential Entropy

Like discrete case, we have entropy for vectors of r.v.s

The joint differential entropy is defined as:

h(X1, X2, . . . , Xn) = −∫f(x1:n) log f(x1:n)dx1:n (26)

Conditional differential entropy

h(X|Y ) = −∫f(x, y) log f(x|y)dxdy = h(X,Y )− h(Y ) (27)



Entropy of a Multivariate Gaussian

When X is distributed according to a multivariate Gaussiandistribution, i.e.,

X ∼ N (µ,Σ) =1

|2πΣ|1/2 e− 1

2(x−µ)ᵀΣ−1(x−µ) (28)

then the entropy of X has a nice form, in particular

h(X) =1

2log[(2πe)n|Σ|

]bits (29)

Notice that the entropy is monotonically related to the determinantof the covariance matrix Σ and is not at all dependent on the meanµ.

The determinant is a form of spread, or dispersion of thedistribution.



Entropy of a Multivariate Gaussian: Derivation

h(X) = −∫f(x)

[−1

2(x− µ)ᵀΣ−1(x− µ)− ln

((2π)n/2|Σ|1/2

)](30)

=1

2Ef[tr (x− µ)ᵀΣ−1(x− µ)

]+

1

2ln [(2π)n|Σ|] (31)

=1

2Ef[tr(x− µ)(x− µ)ᵀΣ−1

]+

1

2ln [(2π)n|Σ|] (32)

=1

2trEf [(x− µ)(x− µ)ᵀ] Σ−1 +

1

2ln [(2π)n|Σ|] (33)

=1

2tr ΣΣ−1 +

1

2ln [(2π)n|Σ|] (34)

=1

2tr I +

1

2ln [(2π)n|Σ|] (35)

=n

2+

1

2ln [(2π)n|Σ|] =

1

2ln [(2πe)n|Σ|] (36)

This uses the “trace trick”, that tr(ABC) = tr(CAB).Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 39


Relative Entropy/KL-Divergence & Mutual Information

The relative entropy (or Kullback-Leibler divergence) for continuousdistributions also has a familiar form

D(f ||g) =

∫f(x) log

f(x)

g(x)dx ≥ 0 (37)

We can, like in the discrete case, use Jensen’s inequality to provethe non-negativity of D(f ||g).

Mutual Information:

D(f(X,Y )||f(X)f(Y )) = I(X;Y ) = h(X)− h(X|Y ) (38)

= h(Y )− h(Y |X) ≥ 0 (39)

Thus, since I(X;Y ) ≥ 0 we have again that conditioning reducesentropy, i.e., h(Y ) ≥ h(Y |X).



Chain rules and moreWe still have chain rules

h(X1, X2, . . . , Xn) =∑i

h(Xi|X1:i−1) (40)

And bounds of the form

h(X1, X2, . . . , Xn) ≤∑i

h(Xi) (41)

For discrete entropy, we have monotonicity. I.e.,H(X1, X2, . . . , Xk) ≤ H(X1, X2, . . . , Xk, Xk+1). More generally

f(A) = H(XA) (42)

is monotonic non-decreasing in set A (i.e., f(A) ≤ f(B), ∀A ⊆ B).Is f(A) = h(XA) monotonic? No, consider Gaussian entropy with

diagonal Σ with small diagonal values. So h(X) = 12 log

[(2πe)n|Σ|

]can get smaller with more random variables.Similarly, when some variables independent, adding independentvariables with negative entropy can decrease overall entropy.



Scratch Paper



Scratch Paper



Scratch Paper


EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Documents