Top Banner
Logistics Logistics IT-II Differential Entropy Scratch EE515A – Information Theory II Spring 2012 Prof. Jeff Bilmes University of Washington, Seattle Department of Electrical Engineering Spring Quarter, 2012 http://j.ee.washington.edu/ ~ bilmes/classes/ee515a_spring_2012/ Lecture 19 - March 27th, 2012 Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 1 Logistics Logistics IT-II Differential Entropy Scratch Outstanding Reading Read all chapters assigned from IT-I (EE514, Winter 2012). Read chapter 8 in the book (but what book you might ask? You’ll soon see if you don’t know). Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 2
22

EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Jul 12, 2018

Download

Documents

dangthuan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

EE515A – Information Theory IISpring 2012

Prof. Jeff Bilmes

University of Washington, SeattleDepartment of Electrical Engineering

Spring Quarter, 2012http://j.ee.washington.edu/~bilmes/classes/ee515a_spring_2012/

Lecture 19 - March 27th, 2012

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 1

Logistics Logistics IT-II Differential Entropy Scratch

Outstanding Reading

Read all chapters assigned from IT-I (EE514, Winter 2012).

Read chapter 8 in the book (but what book you might ask? You’llsoon see if you don’t know).

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 2

Page 2: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

Information Theory I and IIThis two-quarter course will be an thorough introduction toinformation theory.Information Theory I (EE514, Winter 2012): entropy, mutualinformation, asymptotic equipartition properties, data compressionto the entropy limit (source coding theorem), Huffman,communication at the channel capacity limit (channel codingtheorem), method of types, arithmetic coding, Fano codesInformation Theory II (EE515, Spring 2012) : Lempel-Ziv,convolutional codes, differential entropy, maximum entropy, ECC,turbo, LDPC and other codes, Kolmogorov complexity, spectralestimation, rate-distortion theory, alternating minimization forcomputation of RD curve and channel capacity, more on theGaussian channel, network information theory, information geometry,and some recent results on use of polymatroids in information theory.Additional topics throughout will include information theory as it isapplicable to pattern recognition, natural language processing,computer science and complexity, biological science, andcommunications.Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 3

Logistics Logistics IT-II Differential Entropy Scratch

Course Web Pages

our web page (http://j.ee.washington.edu/~bilmes/classes/ee515a_spring_2012/)

our dropbox (https://catalyst.uw.edu/collectit/dropbox/bilmes/21171) whichis where all homework will be due, electronically, in PDF format. Nopaper homework accepted.

our discussion board(https://catalyst.uw.edu/gopost/board/bilmes/27386/) iswhere you can ask questions. Please use this rather than email sothat all can benefit from answers to your questions.

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 4

Page 3: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

Prerequisites

basic probability and statistics

random processes (e.g., EE505 or a Stat 5xx class).

Knowledge of matlab.

EE514A - Information Theory I (or equivalent).

The course is open to students in all UW departments.

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 5

Logistics Logistics IT-II Differential Entropy Scratch

Homework

There will be about 4-5 homeworks this quarter.

You will have approximately 1 to 1.5 weeks to solve the problem set.

Problem sets might also include matlab exercises, so you will need tohave access to matlab, please let me know if you currently do not.

The problem sets that are longer will take longer to do, so please donot wait until the night before they are due to start them.

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 6

Page 4: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

Homework Grading Strategy

We will do swap grading. I.e. you will turn in your homeworks, andI’ll send them out randomly for others to grade.

You will receive a grade on your homework from the person who isgrading it.

You will also assign a grade to your grader, for the quality of thegrading (but not the grade) given to you.

This means, for each HW you will get two grades: one for your HW,and one for the quality of grading you do.

No collusion! ,

No grading your own HW. ,

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 7

Logistics Logistics IT-II Differential Entropy Scratch

Exams

No exams this quarter.

Final presentations: Monday, June 4th in the afternoon late/evening(currently scheduled for 8:30am but that is too early).

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 8

Page 5: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

Grading

Final grades will be based on a combination of the homeworks(60%) and the final presentations (40%).

If you are active in the class (via attendance, asking good questions,etc.), this may boost your final grade.

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 9

Logistics Logistics IT-II Differential Entropy Scratch

On Final Presentations

Your task is to give a 15-20 minute presentation that summarizes2-3 related and significant papers that come from IEEE Transactionson Information Theory.

The papers must not be ones that we covered in class, althoughthey can be related.

You need to do the research to find the papers yourself (i.e., that ispart of the assignment).

The papers must have been published in the last 10 years (so no oldor classic papers).

Your grade will be based on how clear, understandable, and accurateyour presentation is.

This is a real challenge and will require significant work! Many ofthe papers are complex. To get a good grade, you will need to workto present complex ideas in a simple way.

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 10

Page 6: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

Our Main Text

“Elements of Information Theory” by Thomas Cover and JoyThomas, 2nd edition (2006).

It should be available at the UW bookstore, or you can get it, say,via amazon.com.

Reading assignment: Read Chapter 8.

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 11

Logistics Logistics IT-II Differential Entropy Scratch

Other Relevant Texts

“A First Course in Information Theory”, by Raymond W. Yeung,2002. Very good chapters on information measures and networkinformation theory.

“Elements of information theory”, 1991, Thomas M. Cover, Joy A.Thomas, Q360.C68 (First edition of our book).

“Information theory and reliable communication”, 1968, Robert G.Gallager. Q360.G3 (classic text by foundational researcher).

“Information theory and statistics”, 1968, Solomon Kullback, MathQA276.K8

“Information theory : coding theorems for discrete memorylesssystems”, 1981, Imre Csiszar and Janos Korner. Q360.C75 (anotherkey book, but a little harder to read).

“Information Theory, Inference, and Learning Algorithms”, DavidJ.C. MacKay 2003

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 12

Page 7: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

Other Relevant Texts

“The Theory of Information and Coding: 2nd Edition” RobertMcEliece, Cambridge, April 2002

“An Introduction to Information Theory: Symbols, Signals, andNoise”, John R. Pierce, Dover 1980

“Information Theory”, Robert Ash, Dover 1965

“An Introduction to Information Theory”, Fazlollah M. Reza, Dover1991

“Mathematical Foundations of Information Theory”, A. Khonchin,Dover 1957. (best brief summary of key results).

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 13

Logistics Logistics IT-II Differential Entropy Scratch

Other Relevant Texts

“Convex Optimization”, Boyd and Vandenberghe

“Probability & Measure” Billingsley,

“Probability with Martingales”, Williams

“Probability, Theory & Examples”, Durrett

“Probability and Random Processes”, Grimmett and Stirzaker (greatbook on stochastic processes).

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 14

Page 8: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

On Our Lecture Slides

First time class is using these lecture slides.

Slides will (hopefully) be available by the early morning beforelecture.

Slides available on the web page, as well as a printable version.

Updated slides with typo and bug fixes will be posted as well as(buggy) ones with hand annotations/corrections/fill-ins.

Annotations are (currently) being made with Adobe Acrobat but theiphone/ipad pdf previewer has trouble viewing them.

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 15

Logistics Logistics IT-II Differential Entropy Scratch

Class Road Map - IT-II

L1 (3/28): Overview, Diff Entropy

L2 (3/30):

L3 (4/4):

L4 (4/6): No class.

L5 (4/11):

L6 (4/13):

L6 (4/18):

L7 (4/20): No class.

L8 (4/25):

L9 (4/27):

L10 (5/2):

L11 (5/4):

L12 (5/9):

L13 (5/11):

L14 (5/16):

L15 (5/18):

L16 (5/23):

L17 (5/25):

L18 (5/30):

L19 (6/1):

Finals Presentations: June 4th, 2012.

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 16

Page 9: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

Shannon’s model of communication

Shannon’s general model of communicationencoder decoder

noise

source channelencoder

channel sourcedecoder

receiversourcecoder

channeldecoder

This has been our guiding principle so far, and will continue to guideus for a while (key exception is network information theory that wewill cover, where the above picture may become an arbitrarydirected graph with multiple sources/receivers).

So far, channel has been discrete, but real-world channels are notdiscrete, we thus need differential entropy.

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 17

Logistics Logistics IT-II Differential Entropy Scratch

Sources and Channels

Source Channel

voice telephone line

text words disk

pictures disk

pictures Ethernet or internet

music mp3 file

human cells mitosis

population of living organisms life

So channel can be over a typical communication link(communication over space).

Channel can also be storage (communication over time), orgenerations (biology).

Many applications of IT, not just communication (e.g., Kolmogorovcomplexity).

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 18

Page 10: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

This is IT-II

This is lecture 19.You know that the source compresses down to the entropy H, butno further.

E(R)

R→H

ErrorExponent

R→H

logPe

0

You also know that the signal may be sent through the channel at arate no more than C.

R→C0

E(R)

ErrorExponent

R→

logPe

C

0

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 19

Logistics Logistics IT-II Differential Entropy Scratch

This is IT-II

What if we want to compress R < H or transmit R > C? ⇒ Error.

But are all errors created equality? Are all errors as bad as others?

We can measure errors with a distortion function, and we havegeneralization of the previously stated results.

Rate-distortion curves with achievable region

Distortion →0

H

Rate R

achievableregion

unachievableregion

No equal rights for equal errors.

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 20

Page 11: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

This is IT-II

Continuous Entropy

Differential Entropy

Gaussian Channel

Shannon Capacity for the Gaussian Channel

Information Measures

Generalized inequalities for entropy functions

Negative discrete information measures

The class of entropy functions

More on Compression

Lempel-Ziv compression

Kolmogorov complexity (algorithmic compression)

Philosophy

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 21

Logistics Logistics IT-II Differential Entropy Scratch

This is IT-II

Coding with errors

Rate distortion theory

Alternating minimization, Blahut-Arimoto algorithm, EM algorithm,variational EM, information geometry.

Network information theory

multi-port communication

Slepian-wolf,

Multiple sensors/receivers, and models for this.

Polymatroidal theory and network information theory

Codes on Graphs

Convolutional coding

Turbo codes

LDPC codes

inference and LBPProf. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 22

Page 12: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

This is IT-II

And more applications in

Communications

Pattern recognition and machine learning

Compute Science

Biology

Philosophy (and poetry).

Social science

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 23

Logistics Logistics IT-II Differential Entropy Scratch

Entropy

H(X) = −∑x p(x) log p(x)

All entropic quantities we’ve encountered in IT-I have been discrete.

The world is continuous, channels are continuous, noise iscontinuous,

We need a theory of compression, entropy, and channel capacitythat applies to such continuous domains.

We explore this next.

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 24

Page 13: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

Continuous/Differential Entropy

Let X now be a continuous r.v. with cumulative distribution

F (x) = Pr(X ≤ x) (1)

and f(x) = ddxF (x) is the density function.

Let S = {x : f(x) > 0} be the support set. Then

Definition 4.1 (differential entropy h(X))

h(X) = −∫Sf(x) log f(x)dx (2)

Since we integrate over only the support set, no worries about log 0.

Perhaps it is best to do some examples.

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 25

Logistics Logistics IT-II Differential Entropy Scratch

Continuous Entropy Of Uniform Distribution

Here, X ∼ U [0, a] with a ∈ R++.

Then

h(X) = −∫ a

0

1

alog

1

adx = − log

1

a(3)

Note: continuous entropy can be both positive or negative.

How can entropy (which we know to mean “uncertainty”, or“information”) be negative?

In fact, entropy (as we’ve seen perhaps once or twice) can beinterpreted as the exponent of the “volume” of a typical set.

Example: 2H(X) is the number of things that happen, on average,and can have 2H(X) � |X |.Consider a uniform r.v. Y such that 2H(X) = |Y|.Thus having a negative exponent just means the volume is small.

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 26

Page 14: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

Continuous Entropy Of Normal Distribution

Normal (Gaussian) distributions are very important.We have:

X ∼ N(0, σ2) ⇔ f(x) =1

(2πσ2)1/2e−

12x2/σ2

(4)

Lets compute the entropy of f in nats.

h(X) = −∫f ln f = −

∫f(x)

[− x2

2σ2− ln√

2πσ2

]dx (5)

EX2

2σ2+

1

2ln(2πσ2) =

1

2+

1

2ln(2πσ2) (6)

=1

2ln e+

1

2ln(2πσ2) =

1

2ln(2πeσ2)nats×

(1

ln 2bits/nats

)=

1

2log(2πeσ2)bits (7)

Note: only a function of the variance σ2, not the mean. Why?So entropy of a Gaussian is monotonically related to the variance.

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 27

Logistics Logistics IT-II Differential Entropy Scratch

AEP lives

We even have our own AEP in the continuous case, but before thata bit more intuition.

In the discrete case, we have Pr(x1, x2, . . . , xn) ≈ 2−nH(X) for big n

and |A(n)ε | = 2nH = (2H)n.

Thus, 2H can be seen like a “side length” of an n-dimensionalhypercube, and 2nH is like the volume of this hypercube (or volumeof the typical set).

So H being negative could mean small side length (small 2H butstill positive).

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 28

Page 15: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

AEP

Things are similar for the continuous case. Indeed

Theorem 4.2

Let X1, X2, . . . , Xn be a sequence of r.v.’s, i.i.d. ∼ f(x). Then

− 1

nlog f(X1, X2, . . . , Xn)→ E[− log f(X)] = h(X) (8)

this follows via the weak law of large numbers (WLLN) just like inthe discrete case.

Definition 4.3

A(n)ε =

{x1:n ∈ Sn : | − 1

n log f(x1, . . . , xn)− h(X)| ≤ ε}

Note: f(x1, . . . , xn) =∏ni=1 f(xi).

Thus, we have upper/lower bounds on the probability

2−n(h+ε) ≤ f(x1:n) ≤ 2−n(h−ε) (9)

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 29

Logistics Logistics IT-II Differential Entropy Scratch

AEP

The volume of A ⊆ Rn is well defined as:

Vol(A) =

∫Adx1dx2 . . . dxn (10)

then we have

Theorem 4.4

1 Pr(A(n)ε ) > 1− ε

2 (1− ε)2n(h(X)−ε) ≤ Vol(A(n)ε ) ≤ 2n(h(X)+ε)

Note this is a bound on volume Vol(A(n)ε ) of typical set.

In discrete AEP, we bound cardinality of typical set |A(n)ε |, and never

need entropy to be negative — H(X) ≥ 0 suffices to limit |A(n)ε |

down to its lowest sensible value, namely 1.

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 30

Page 16: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

AEPproof of theorem 4.4.

1: First,

p(A(n)ε ) =

∫x1:n∈A(n)

ε

f(x1, . . . , xn)dx1 . . . dxn (11)

= Pr

(∣∣∣− 1

nf(x1, x2, . . . , xn)− h(X)

∣∣∣ ≤ ε) ≥ 1− ε (12)

for big enough n which follows from the WLLN.2: Next, we have

1 =

∫Snf(x1, . . . , xn)dx1 . . . dxn ≥

∫A

(n)ε

f(x1, . . . , xn)dx1 . . . dxn

(13)

≥∫A

(n)ε

2−n(h(X)+ε)dx1:n = 2−n(h(X)+ε) Vol(A(n)ε ) (14)

⇒ Vol(A(n)ε ) ≤ 2n(h(X)+ε). . . .

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 31

Logistics Logistics IT-II Differential Entropy Scratch

AEP

proof of theorem 4.4 cont.

Similarly,

1− ε ≤ Pr(A(n)ε ) =

∫A

(n)ε

f(x1:n)dx1:n (15)

≤∫A

(n)ε

2−n(h(X)−ε)dx1:n = 2−n(h(X)−ε) Vol(A(n)ε ) (16)

Like in the discrete case, A(n)ε is the smallest volume that contains,

essentially, all of the probability, and that volume is ≈ 2nh.

If we look at (2nh)1/n, we get a “side length” of 2h.

So, −∞ < h <∞ is a meaningful range for entropy since it is theexponent of the equivalent side length of the n-D volume.

Large negative entropy just means small volume.

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 32

Page 17: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

Differential vs. Discrete Entropy

Let X ∼ f(x), and divide the range of X up into bins of length ∆.

E.g., quantize the range of X using n bits, so that ∆ = 2−n.

We can then view this as follows:

f(x) ∆ = 2−n

→|∆ |←

x

Mean value theorem, i.e., that if continuous within bin ∃xi such that

f(xi) =1

∫ (i+1)∆

i∆f(x)dx (17)

f(xi)

xi∆ (i+1)∆xi

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 33

Logistics Logistics IT-II Differential Entropy Scratch

Differential vs. Discrete Entropy

Create a quantized random variable X∆ having those values so that

X∆ = xi if i∆ ≤ X < (i+ 1)∆ (18)

This gives a discrete distribution

Pr(X∆ = xi) = pi =

∫ (i+1)∆

i∆f(x)dx = ∆f(xi) (19)

and we can calculate the entropy

H(X∆) = −∞∑

i=−∞pi log pi = −

∑i

f(xi)∆ log(f(xi)∆) (20)

= −∑i

∆f(xi) log f(xi)−∑i

f(xi)∆ log ∆ (21)

= −∑i

∆f(xi) log f(xi)− log ∆ (22)

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 34

Page 18: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

Differential vs. Discrete Entropy

This follows since (as expected)∑i

∆f(xi) = ∆∑i

1

∫ (i+1)∆

i∆f(x)dx = ∆

1

∫f(x)dx = 1

(23)

Also, as ∆→ 0, we have − log ∆→∞ and (assuming all isintegrable in a Riemannian sense)

−∑i

∆f(xi) log f(xi)→ −∫f(x) log f(x)dx (24)

So, H(X∆) + log ∆→ h(f) as ∆→ 0.Loosely, h(f) ≈ H(X∆) + log ∆ and for an n-bit quantization with∆ = 2−n, we have

H(X∆) ≈ h(f)− log ∆ = h(f) + n (25)

This means that as n→∞, H(X∆) gets larger. Why?

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 35

Logistics Logistics IT-II Differential Entropy Scratch

Differential vs. Discrete Entropy

This makes sense. We start with a continuous random variable Xand quantize it at an n-bit accuracy.

For a discrete representation to represent 2n values, we expect theentropy to go up with n, and as n gets large so would the entropy,but then adjusted by h(X).

H(X∆) is the number of bits to describe this n-bit equally spacedquantization of the continuous random variable X.

H(X∆) ≈= h(f) + n says that it might take either more than nbits to describe X at n-bit accuracy, or less than n bits to describeX at n-bit accuracy.

If X is very concentrated h(f) < 0 then fewer bits. If X is veryspread out, then more than n bits.

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 36

Page 19: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

Joint Differential Entropy

Like discrete case, we have entropy for vectors of r.v.s

The joint differential entropy is defined as:

h(X1, X2, . . . , Xn) = −∫f(x1:n) log f(x1:n)dx1:n (26)

Conditional differential entropy

h(X|Y ) = −∫f(x, y) log f(x|y)dxdy = h(X,Y )− h(Y ) (27)

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 37

Logistics Logistics IT-II Differential Entropy Scratch

Entropy of a Multivariate Gaussian

When X is distributed according to a multivariate Gaussiandistribution, i.e.,

X ∼ N (µ,Σ) =1

|2πΣ|1/2 e− 1

2(x−µ)ᵀΣ−1(x−µ) (28)

then the entropy of X has a nice form, in particular

h(X) =1

2log[(2πe)n|Σ|

]bits (29)

Notice that the entropy is monotonically related to the determinantof the covariance matrix Σ and is not at all dependent on the meanµ.

The determinant is a form of spread, or dispersion of thedistribution.

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 38

Page 20: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

Entropy of a Multivariate Gaussian: Derivation

h(X) = −∫f(x)

[−1

2(x− µ)ᵀΣ−1(x− µ)− ln

((2π)n/2|Σ|1/2

)](30)

=1

2Ef[tr (x− µ)ᵀΣ−1(x− µ)

]+

1

2ln [(2π)n|Σ|] (31)

=1

2Ef[tr(x− µ)(x− µ)ᵀΣ−1

]+

1

2ln [(2π)n|Σ|] (32)

=1

2trEf [(x− µ)(x− µ)ᵀ] Σ−1 +

1

2ln [(2π)n|Σ|] (33)

=1

2tr ΣΣ−1 +

1

2ln [(2π)n|Σ|] (34)

=1

2tr I +

1

2ln [(2π)n|Σ|] (35)

=n

2+

1

2ln [(2π)n|Σ|] =

1

2ln [(2πe)n|Σ|] (36)

This uses the “trace trick”, that tr(ABC) = tr(CAB).Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 39

Logistics Logistics IT-II Differential Entropy Scratch

Relative Entropy/KL-Divergence & Mutual Information

The relative entropy (or Kullback-Leibler divergence) for continuousdistributions also has a familiar form

D(f ||g) =

∫f(x) log

f(x)

g(x)dx ≥ 0 (37)

We can, like in the discrete case, use Jensen’s inequality to provethe non-negativity of D(f ||g).

Mutual Information:

D(f(X,Y )||f(X)f(Y )) = I(X;Y ) = h(X)− h(X|Y ) (38)

= h(Y )− h(Y |X) ≥ 0 (39)

Thus, since I(X;Y ) ≥ 0 we have again that conditioning reducesentropy, i.e., h(Y ) ≥ h(Y |X).

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 40

Page 21: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

Chain rules and moreWe still have chain rules

h(X1, X2, . . . , Xn) =∑i

h(Xi|X1:i−1) (40)

And bounds of the form

h(X1, X2, . . . , Xn) ≤∑i

h(Xi) (41)

For discrete entropy, we have monotonicity. I.e.,H(X1, X2, . . . , Xk) ≤ H(X1, X2, . . . , Xk, Xk+1). More generally

f(A) = H(XA) (42)

is monotonic non-decreasing in set A (i.e., f(A) ≤ f(B), ∀A ⊆ B).Is f(A) = h(XA) monotonic? No, consider Gaussian entropy with

diagonal Σ with small diagonal values. So h(X) = 12 log

[(2πe)n|Σ|

]can get smaller with more random variables.Similarly, when some variables independent, adding independentvariables with negative entropy can decrease overall entropy.

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 41

Logistics Logistics IT-II Differential Entropy Scratch

Scratch Paper

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 42

Page 22: EE515A Information Theory II Spring 2012 - …j.ee.washington.edu/~bilmes/classes/ee514a_winter_2012/lecture19... · \Information Theory, Inference, and Learning Algorithms", David

Logistics Logistics IT-II Differential Entropy Scratch

Scratch Paper

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 43

Logistics Logistics IT-II Differential Entropy Scratch

Scratch Paper

Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 44