Top Banner
Elements of Information Theory Second Edition Solutions to Problems Thomas M. Cover Joy A. Thomas September 22, 2006
164

Solution to Information Theory

Oct 27, 2014

Download

Documents

nbj_1
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Solution to Information Theory

Elements of Information Theory

Second Edition

Solutions to Problems

Thomas M. Cover

Joy A. Thomas

September 22, 2006

Page 2: Solution to Information Theory

1

COPYRIGHT 2006

Thomas CoverJoy Thomas

All rights reserved

Page 3: Solution to Information Theory

2

Page 4: Solution to Information Theory

Contents

1 Introduction 7

2 Entropy, Relative Entropy and Mutual Information 9

3 The Asymptotic Equipartition Property 49

4 Entropy Rates of a Stochastic Process 61

5 Data Compression 97

6 Gambling and Data Compression 139

3

Page 5: Solution to Information Theory

4 CONTENTS

Page 6: Solution to Information Theory

Preface

The problems in the book, “Elements of Information Theory, Second Edition”, were chosenfrom the problems used during the course at Stanford. Most of the solutions here wereprepared by the graders and instructors of the course. We would particularly like to thankProf. John Gill, David Evans, Jim Roche, Laura Ekroot and Young Han Kim for their helpin preparing these solutions.

Most of the problems in the book are straightforward, and we have included hints inthe problem statement for the difficult problems. In some cases, the solutions include extramaterial of interest (for example, the problem on coin weighing on Pg. 12).

We would appreciate any comments, suggestions and corrections to this Solutions Man-ual.

Tom Cover Joy ThomasDurand 121, Information Systems Lab Stratify

Stanford University 701 N Shoreline AvenueStanford, CA 94305. Mountain View, CA 94043.Ph. 415-723-4505 Ph. 650-210-2722

FAX: 415-723-8473 FAX: 650-988-2159Email: [email protected] Email: [email protected]

5

Page 7: Solution to Information Theory

6 CONTENTS

Page 8: Solution to Information Theory

Chapter 1

Introduction

7

Page 9: Solution to Information Theory

8 Introduction

Page 10: Solution to Information Theory

Chapter 2

Entropy, Relative Entropy and

Mutual Information

1. Coin flips. A fair coin is flipped until the first head occurs. Let X denote the numberof flips required.

(a) Find the entropy H(X) in bits. The following expressions may be useful:

∞∑

n=0

rn =1

1 − r,

∞∑

n=0

nrn =r

(1 − r)2.

(b) A random variable X is drawn according to this distribution. Find an “efficient”sequence of yes-no questions of the form, “Is X contained in the set S ?” CompareH(X) to the expected number of questions required to determine X .

Solution:

(a) The number X of tosses till the first head appears has the geometric distributionwith parameter p = 1/2 , where P (X = n) = pqn−1 , n ∈ {1, 2, . . .} . Hence theentropy of X is

H(X) = −∞∑

n=1

pqn−1 log(pqn−1)

= −[ ∞∑

n=0

pqn log p+∞∑

n=0

npqn log q

]

=−p log p

1 − q− pq log q

p2

=−p log p− q log q

p

= H(p)/p bits.

If p = 1/2 , then H(X) = 2 bits.

9

Page 11: Solution to Information Theory

10 Entropy, Relative Entropy and Mutual Information

(b) Intuitively, it seems clear that the best questions are those that have equally likelychances of receiving a yes or a no answer. Consequently, one possible guess isthat the most “efficient” series of questions is: Is X = 1? If not, is X = 2?If not, is X = 3? . . . with a resulting expected number of questions equal to∑∞

n=1 n(1/2n) = 2. This should reinforce the intuition that H(X) is a mea-sure of the uncertainty of X . Indeed in this case, the entropy is exactly thesame as the average number of questions needed to define X , and in generalE(# of questions) ≥ H(X) . This problem has an interpretation as a source cod-ing problem. Let 0 =no, 1 =yes, X =Source, and Y =Encoded Source. Thenthe set of questions in the above procedure can be written as a collection of (X,Y )pairs: (1, 1) , (2, 01) , (3, 001) , etc. . In fact, this intuitively derived code is theoptimal (Huffman) code minimizing the expected number of questions.

2. Entropy of functions. Let X be a random variable taking on a finite number of values.What is the (general) inequality relationship of H(X) and H(Y ) if

(a) Y = 2X ?

(b) Y = cosX ?

Solution: Let y = g(x) . Then

p(y) =∑

x: y=g(x)

p(x).

Consider any set of x ’s that map onto a single y . For this set∑

x: y=g(x)

p(x) log p(x) ≤∑

x: y=g(x)

p(x) log p(y) = p(y) log p(y),

since log is a monotone increasing function and p(x) ≤ ∑

x: y=g(x) p(x) = p(y) . Ex-tending this argument to the entire range of X (and Y ), we obtain

H(X) = −∑

x

p(x) log p(x)

= −∑

y

x: y=g(x)

p(x) log p(x)

≥ −∑

y

p(y) log p(y)

= H(Y ),

with equality iff g is one-to-one with probability one.

(a) Y = 2X is one-to-one and hence the entropy, which is just a function of theprobabilities (and not the values of a random variable) does not change, i.e.,H(X) = H(Y ) .

(b) Y = cos(X) is not necessarily one-to-one. Hence all that we can say is thatH(X) ≥ H(Y ) , with equality if cosine is one-to-one on the range of X .

Page 12: Solution to Information Theory

Entropy, Relative Entropy and Mutual Information 11

3. Minimum entropy. What is the minimum value of H(p1, ..., pn) = H(p) as p rangesover the set of n -dimensional probability vectors? Find all p ’s which achieve thisminimum.

Solution: We wish to find all probability vectors p = (p1, p2, . . . , pn) which minimize

H(p) = −∑

i

pi log pi.

Now −pi log pi ≥ 0 , with equality iff pi = 0 or 1 . Hence the only possible probabilityvectors which minimize H(p) are those with pi = 1 for some i and pj = 0, j 6= i .There are n such vectors, i.e., (1, 0, . . . , 0) , (0, 1, 0, . . . , 0) , . . . , (0, . . . , 0, 1) , and theminimum value of H(p) is 0.

4. Entropy of functions of a random variable. Let X be a discrete random variable.Show that the entropy of a function of X is less than or equal to the entropy of X byjustifying the following steps:

H(X, g(X))(a)= H(X) +H(g(X) | X) (2.1)

(b)= H(X); (2.2)

H(X, g(X))(c)= H(g(X)) +H(X | g(X)) (2.3)

(d)

≥ H(g(X)). (2.4)

Thus H(g(X)) ≤ H(X).

Solution: Entropy of functions of a random variable.

(a) H(X, g(X)) = H(X) +H(g(X)|X) by the chain rule for entropies.

(b) H(g(X)|X) = 0 since for any particular value of X, g(X) is fixed, and henceH(g(X)|X) =

x p(x)H(g(X)|X = x) =∑

x 0 = 0 .

(c) H(X, g(X)) = H(g(X)) +H(X|g(X)) again by the chain rule.

(d) H(X|g(X)) ≥ 0 , with equality iff X is a function of g(X) , i.e., g(.) is one-to-one.Hence H(X, g(X)) ≥ H(g(X)) .

Combining parts (b) and (d), we obtain H(X) ≥ H(g(X)) .

5. Zero conditional entropy. Show that if H(Y |X) = 0 , then Y is a function of X , i.e.,for all x with p(x) > 0 , there is only one possible value of y with p(x, y) > 0 .

Solution: Zero Conditional Entropy. Assume that there exists an x , say x0 and twodifferent values of y , say y1 and y2 such that p(x0, y1) > 0 and p(x0, y2) > 0 . Thenp(x0) ≥ p(x0, y1) + p(x0, y2) > 0 , and p(y1|x0) and p(y2|x0) are not equal to 0 or 1.Thus

H(Y |X) = −∑

x

p(x)∑

y

p(y|x) log p(y|x) (2.5)

≥ p(x0)(−p(y1|x0) log p(y1|x0) − p(y2|x0) log p(y2|x0)) (2.6)

> > 0, (2.7)

Page 13: Solution to Information Theory

12 Entropy, Relative Entropy and Mutual Information

since −t log t ≥ 0 for 0 ≤ t ≤ 1 , and is strictly positive for t not equal to 0 or 1.Therefore the conditional entropy H(Y |X) is 0 if and only if Y is a function of X .

6. Conditional mutual information vs. unconditional mutual information. Give examplesof joint random variables X , Y and Z such that

(a) I(X;Y | Z) < I(X;Y ) ,

(b) I(X;Y | Z) > I(X;Y ) .

Solution: Conditional mutual information vs. unconditional mutual information.

(a) The last corollary to Theorem 2.8.1 in the text states that if X → Y → Z thatis, if p(x, y | z) = p(x | z)p(y | z) then, I(X;Y ) ≥ I(X;Y | Z) . Equality holds ifand only if I(X;Z) = 0 or X and Z are independent.

A simple example of random variables satisfying the inequality conditions aboveis, X is a fair binary random variable and Y = X and Z = Y . In this case,

I(X;Y ) = H(X) −H(X | Y ) = H(X) = 1

and,I(X;Y | Z) = H(X | Z) −H(X | Y,Z) = 0.

So that I(X;Y ) > I(X;Y | Z) .

(b) This example is also given in the text. Let X,Y be independent fair binaryrandom variables and let Z = X + Y . In this case we have that,

I(X;Y ) = 0

and,I(X;Y | Z) = H(X | Z) = 1/2.

So I(X;Y ) < I(X;Y | Z) . Note that in this case X,Y,Z are not markov.

7. Coin weighing. Suppose one has n coins, among which there may or may not be onecounterfeit coin. If there is a counterfeit coin, it may be either heavier or lighter thanthe other coins. The coins are to be weighed by a balance.

(a) Find an upper bound on the number of coins n so that k weighings will find thecounterfeit coin (if any) and correctly declare it to be heavier or lighter.

(b) (Difficult) What is the coin weighing strategy for k = 3 weighings and 12 coins?

Solution: Coin weighing.

(a) For n coins, there are 2n+ 1 possible situations or “states”.

• One of the n coins is heavier.

• One of the n coins is lighter.

• They are all of equal weight.

Page 14: Solution to Information Theory

Entropy, Relative Entropy and Mutual Information 13

Each weighing has three possible outcomes - equal, left pan heavier or right panheavier. Hence with k weighings, there are 3k possible outcomes and hence wecan distinguish between at most 3k different “states”. Hence 2n + 1 ≤ 3k orn ≤ (3k − 1)/2 .

Looking at it from an information theoretic viewpoint, each weighing gives at mostlog2 3 bits of information. There are 2n + 1 possible “states”, with a maximumentropy of log2(2n + 1) bits. Hence in this situation, one would require at leastlog2(2n+ 1)/ log2 3 weighings to extract enough information for determination ofthe odd coin, which gives the same result as above.

(b) There are many solutions to this problem. We will give one which is based on theternary number system.

We may express the numbers {−12,−11, . . . ,−1, 0, 1, . . . , 12} in a ternary numbersystem with alphabet {−1, 0, 1} . For example, the number 8 is (-1,0,1) where−1× 30 + 0× 31 + 1× 32 = 8. We form the matrix with the representation of thepositive numbers as its columns.

1 2 3 4 5 6 7 8 9 10 11 12

30 1 -1 0 1 -1 0 1 -1 0 1 -1 0 Σ1 = 031 0 1 1 1 -1 -1 -1 0 0 0 1 1 Σ2 = 232 0 0 0 0 1 1 1 1 1 1 1 1 Σ3 = 8

Note that the row sums are not all zero. We can negate some columns to makethe row sums zero. For example, negating columns 7,9,11 and 12, we obtain

1 2 3 4 5 6 7 8 9 10 11 12

30 1 -1 0 1 -1 0 -1 -1 0 1 1 0 Σ1 = 031 0 1 1 1 -1 -1 1 0 0 0 -1 -1 Σ2 = 032 0 0 0 0 1 1 -1 1 -1 1 -1 -1 Σ3 = 0

Now place the coins on the balance according to the following rule: For weighing#i , place coin n

• On left pan, if ni = −1 .

• Aside, if ni = 0.

• On right pan, if ni = 1.

The outcome of the three weighings will find the odd coin if any and tell whetherit is heavy or light. The result of each weighing is 0 if both pans are equal, -1 ifthe left pan is heavier, and 1 if the right pan is heavier. Then the three weighingsgive the ternary expansion of the index of the odd coin. If the expansion is thesame as the expansion in the matrix, it indicates that the coin is heavier. Ifthe expansion is of the opposite sign, the coin is lighter. For example, (0,-1,-1)indicates (0)30+(−1)3+(−1)32 = −12 , hence coin #12 is heavy, (1,0,-1) indicates#8 is light, (0,0,0) indicates no odd coin.

Why does this scheme work? It is a single error correcting Hamming code for theternary alphabet (discussed in Section 8.11 in the book). Here are some details.

First note a few properties of the matrix above that was used for the scheme.All the columns are distinct and no two columns add to (0,0,0). Also if any coin

Page 15: Solution to Information Theory

14 Entropy, Relative Entropy and Mutual Information

is heavier, it will produce the sequence of weighings that matches its column inthe matrix. If it is lighter, it produces the negative of its column as a sequenceof weighings. Combining all these facts, we can see that any single odd coin willproduce a unique sequence of weighings, and that the coin can be determined fromthe sequence.

One of the questions that many of you had whether the bound derived in part (a)was actually achievable. For example, can one distinguish 13 coins in 3 weighings?No, not with a scheme like the one above. Yes, under the assumptions underwhich the bound was derived. The bound did not prohibit the division of coinsinto halves, neither did it disallow the existence of another coin known to benormal. Under both these conditions, it is possible to find the odd coin of 13 coinsin 3 weighings. You could try modifying the above scheme to these cases.

8. Drawing with and without replacement. An urn contains r red, w white, and b blackballs. Which has higher entropy, drawing k ≥ 2 balls from the urn with replacementor without replacement? Set it up and show why. (There is both a hard way and arelatively simple way to do this.)

Solution: Drawing with and without replacement. Intuitively, it is clear that if theballs are drawn with replacement, the number of possible choices for the i -th ball islarger, and therefore the conditional entropy is larger. But computing the conditionaldistributions is slightly involved. It is easier to compute the unconditional entropy.

• With replacement. In this case the conditional distribution of each draw is thesame for every draw. Thus

Xi =

red with prob. rr+w+b

white with prob. wr+w+b

black with prob. br+w+b

(2.8)

and therefore

H(Xi|Xi−1, . . . , X1) = H(Xi) (2.9)

= log(r + w + b) − r

r + w + blog r − w

r + w + blogw − b

r + w + blog b.(2.10)

• Without replacement. The unconditional probability of the i -th ball being red isstill r/(r+w+ b) , etc. Thus the unconditional entropy H(Xi) is still the same aswith replacement. The conditional entropy H(Xi|Xi−1, . . . , X1) is less than theunconditional entropy, and therefore the entropy of drawing without replacementis lower.

9. A metric. A function ρ(x, y) is a metric if for all x, y ,

• ρ(x, y) ≥ 0

• ρ(x, y) = ρ(y, x)

Page 16: Solution to Information Theory

Entropy, Relative Entropy and Mutual Information 15

• ρ(x, y) = 0 if and only if x = y

• ρ(x, y) + ρ(y, z) ≥ ρ(x, z) .

(a) Show that ρ(X,Y ) = H(X|Y ) + H(Y |X) satisfies the first, second and fourthproperties above. If we say that X = Y if there is a one-to-one function mappingfrom X to Y , then the third property is also satisfied, and ρ(X,Y ) is a metric.

(b) Verify that ρ(X,Y ) can also be expressed as

ρ(X,Y ) = H(X) +H(Y ) − 2I(X;Y ) (2.11)

= H(X,Y ) − I(X;Y ) (2.12)

= 2H(X,Y ) −H(X) −H(Y ). (2.13)

Solution: A metric

(a) Letρ(X,Y ) = H(X|Y ) +H(Y |X). (2.14)

Then

• Since conditional entropy is always ≥ 0 , ρ(X,Y ) ≥ 0 .

• The symmetry of the definition implies that ρ(X,Y ) = ρ(Y,X) .

• By problem 2.6, it follows that H(Y |X) is 0 iff Y is a function of X andH(X|Y ) is 0 iff X is a function of Y . Thus ρ(X,Y ) is 0 iff X and Yare functions of each other - and therefore are equivalent up to a reversibletransformation.

• Consider three random variables X , Y and Z . Then

H(X|Y ) +H(Y |Z) ≥ H(X|Y,Z) +H(Y |Z) (2.15)

= H(X,Y |Z) (2.16)

= H(X|Z) +H(Y |X,Z) (2.17)

≥ H(X|Z), (2.18)

from which it follows that

ρ(X,Y ) + ρ(Y,Z) ≥ ρ(X,Z). (2.19)

Note that the inequality is strict unless X → Y → Z forms a Markov Chainand Y is a function of X and Z .

(b) Since H(X|Y ) = H(X)− I(X;Y ) , the first equation follows. The second relationfollows from the first equation and the fact that H(X,Y ) = H(X) + H(Y ) −I(X;Y ) . The third follows on substituting I(X;Y ) = H(X) +H(Y )−H(X,Y ) .

10. Entropy of a disjoint mixture. Let X1 and X2 be discrete random variables drawnaccording to probability mass functions p1(·) and p2(·) over the respective alphabetsX1 = {1, 2, . . . ,m} and X2 = {m+ 1, . . . , n}. Let

X =

{

X1, with probability α,X2, with probability 1 − α.

Page 17: Solution to Information Theory

16 Entropy, Relative Entropy and Mutual Information

(a) Find H(X) in terms of H(X1) and H(X2) and α.

(b) Maximize over α to show that 2H(X) ≤ 2H(X1) + 2H(X2) and interpret using thenotion that 2H(X) is the effective alphabet size.

Solution: Entropy. We can do this problem by writing down the definition of entropyand expanding the various terms. Instead, we will use the algebra of entropies for asimpler proof.

Since X1 and X2 have disjoint support sets, we can write

X =

{

X1 with probability αX2 with probability 1 − α

Define a function of X ,

θ = f(X) =

{

1 when X = X1

2 when X = X2

Then as in problem 1, we have

H(X) = H(X, f(X)) = H(θ) +H(X|θ)= H(θ) + p(θ = 1)H(X|θ = 1) + p(θ = 2)H(X|θ = 2)

= H(α) + αH(X1) + (1 − α)H(X2)

where H(α) = −α log α− (1 − α) log(1 − α) .

11. A measure of correlation. Let X1 and X2 be identically distributed, but not necessarilyindependent. Let

ρ = 1 − H(X2 | X1)

H(X1).

(a) Show ρ = I(X1;X2)H(X1) .

(b) Show 0 ≤ ρ ≤ 1.

(c) When is ρ = 0?

(d) When is ρ = 1?

Solution: A measure of correlation. X1 and X2 are identically distributed and

ρ = 1 − H(X2|X1)

H(X1)

(a)

ρ =H(X1) −H(X2|X1)

H(X1)

=H(X2) −H(X2|X1)

H(X1)(since H(X1) = H(X2))

=I(X1;X2)

H(X1).

Page 18: Solution to Information Theory

Entropy, Relative Entropy and Mutual Information 17

(b) Since 0 ≤ H(X2|X1) ≤ H(X2) = H(X1) , we have

0 ≤ H(X2|X1)

H(X1)≤ 1

0 ≤ ρ ≤ 1.

(c) ρ = 0 iff I(X1;X2) = 0 iff X1 and X2 are independent.

(d) ρ = 1 iff H(X2|X1) = 0 iff X2 is a function of X1 . By symmetry, X1 is afunction of X2 , i.e., X1 and X2 have a one-to-one relationship.

12. Example of joint entropy. Let p(x, y) be given by

@@

@XY

0 1

0 13

13

1 0 13

Find

(a) H(X),H(Y ).

(b) H(X | Y ),H(Y | X).

(c) H(X,Y ).

(d) H(Y ) −H(Y | X).

(e) I(X;Y ) .

(f) Draw a Venn diagram for the quantities in (a) through (e).

Solution: Example of joint entropy

(a) H(X) = 23 log 3

2 + 13 log 3 = 0.918 bits = H(Y ) .

(b) H(X|Y ) = 13H(X|Y = 0) + 2

3H(X|Y = 1) = 0.667 bits = H(Y |X) .

(c) H(X,Y ) = 3 × 13 log 3 = 1.585 bits.

(d) H(Y ) −H(Y |X) = 0.251 bits.

(e) I(X;Y ) = H(Y ) −H(Y |X) = 0.251 bits.

(f) See Figure 1.

13. Inequality. Show ln x ≥ 1 − 1x for x > 0.

Solution: Inequality. Using the Remainder form of the Taylor expansion of ln(x)about x = 1, we have for some c between 1 and x

ln(x) = ln(1) +

(1

t

)

t=1(x− 1) +

(−1

t2

)

t=c

(x− 1)2

2≤ x− 1

Page 19: Solution to Information Theory

18 Entropy, Relative Entropy and Mutual Information

Figure 2.1: Venn diagram to illustrate the relationships of entropy and relative entropy

H(X|Y) I(X;Y)H(Y|X)

H(Y)

H(X)

since the second term is always negative. Hence letting y = 1/x , we obtain

− ln y ≤ 1

y− 1

or

ln y ≥ 1 − 1

y

with equality iff y = 1.

14. Entropy of a sum. Let X and Y be random variables that take on values x1, x2, . . . , xr

and y1, y2, . . . , ys , respectively. Let Z = X + Y.

(a) Show that H(Z|X) = H(Y |X). Argue that if X,Y are independent, then H(Y ) ≤H(Z) and H(X) ≤ H(Z). Thus the addition of independent random variablesadds uncertainty.

(b) Give an example of (necessarily dependent) random variables in which H(X) >H(Z) and H(Y ) > H(Z).

(c) Under what conditions does H(Z) = H(X) +H(Y ) ?

Solution: Entropy of a sum.

(a) Z = X + Y . Hence p(Z = z|X = x) = p(Y = z − x|X = x) .

H(Z|X) =∑

p(x)H(Z|X = x)

= −∑

x

p(x)∑

z

p(Z = z|X = x) log p(Z = z|X = x)

=∑

x

p(x)∑

y

p(Y = z − x|X = x) log p(Y = z − x|X = x)

=∑

p(x)H(Y |X = x)

= H(Y |X).

Page 20: Solution to Information Theory

Entropy, Relative Entropy and Mutual Information 19

If X and Y are independent, then H(Y |X) = H(Y ) . Since I(X;Z) ≥ 0 ,we have H(Z) ≥ H(Z|X) = H(Y |X) = H(Y ) . Similarly we can show thatH(Z) ≥ H(X) .

(b) Consider the following joint distribution for X and Y Let

X = −Y =

{

1 with probability 1/20 with probability 1/2

Then H(X) = H(Y ) = 1 , but Z = 0 with prob. 1 and hence H(Z) = 0 .

(c) We have

H(Z) ≤ H(X,Y ) ≤ H(X) +H(Y )

because Z is a function of (X,Y ) and H(X,Y ) = H(X) +H(Y |X) ≤ H(X) +H(Y ) . We have equality iff (X,Y ) is a function of Z and H(Y ) = H(Y |X) , i.e.,X and Y are independent.

15. Data processing. Let X1 → X2 → X3 → · · · → Xn form a Markov chain in this order;i.e., let

p(x1, x2, . . . , xn) = p(x1)p(x2|x1) · · · p(xn|xn−1).

Reduce I(X1;X2, . . . , Xn) to its simplest form.

Solution: Data Processing. By the chain rule for mutual information,

I(X1;X2, . . . , Xn) = I(X1;X2)+I(X1;X3|X2)+ · · ·+I(X1;Xn|X2, . . . , Xn−2). (2.20)

By the Markov property, the past and the future are conditionally independent giventhe present and hence all terms except the first are zero. Therefore

I(X1;X2, . . . , Xn) = I(X1;X2). (2.21)

16. Bottleneck. Suppose a (non-stationary) Markov chain starts in one of n states, necksdown to k < n states, and then fans back to m > k states. Thus X1 → X2 → X3 ,i.e., p(x1, x2, x3) = p(x1)p(x2|x1)p(x3|x2) , for all x1 ∈ {1, 2, . . . , n} , x2 ∈ {1, 2, . . . , k} ,x3 ∈ {1, 2, . . . ,m} .

(a) Show that the dependence of X1 and X3 is limited by the bottleneck by provingthat I(X1;X3) ≤ log k.

(b) Evaluate I(X1;X3) for k = 1, and conclude that no dependence can survive sucha bottleneck.

Solution:

Bottleneck.

Page 21: Solution to Information Theory

20 Entropy, Relative Entropy and Mutual Information

(a) From the data processing inequality, and the fact that entropy is maximum for auniform distribution, we get

I(X1;X3) ≤ I(X1;X2)

= H(X2) −H(X2 | X1)

≤ H(X2)

≤ log k.

Thus, the dependence between X1 and X3 is limited by the size of the bottleneck.That is I(X1;X3) ≤ log k .

(b) For k = 1, I(X1;X3) ≤ log 1 = 0 and since I(X1, X3) ≥ 0 , I(X1, X3) = 0 .Thus, for k = 1, X1 and X3 are independent.

17. Pure randomness and bent coins. Let X1, X2, . . . , Xn denote the outcomes of inde-pendent flips of a bent coin. Thus Pr {Xi = 1} = p, Pr {Xi = 0} = 1 − p ,where p is unknown. We wish to obtain a sequence Z1, Z2, . . . , ZK of fair coin flipsfrom X1, X2, . . . , Xn . Toward this end let f : X n → {0, 1}∗ , (where {0, 1}∗ ={Λ, 0, 1, 00, 01, . . .} is the set of all finite length binary sequences), be a mappingf(X1, X2, . . . , Xn) = (Z1, Z2, . . . , ZK) , where Zi ∼ Bernoulli ( 1

2 ) , and K may dependon (X1, . . . , Xn) . In order that the sequence Z1, Z2, . . . appear to be fair coin flips, themap f from bent coin flips to fair flips must have the property that all 2k sequences(Z1, Z2, . . . , Zk) of a given length k have equal probability (possibly 0), for k = 1, 2, . . . .For example, for n = 2, the map f(01) = 0 , f(10) = 1 , f(00) = f(11) = Λ (the nullstring), has the property that Pr{Z1 = 1|K = 1} = Pr{Z1 = 0|K = 1} = 1

2 .

Give reasons for the following inequalities:

nH(p)(a)= H(X1, . . . , Xn)

(b)

≥ H(Z1, Z2, . . . , ZK ,K)

(c)= H(K) +H(Z1, . . . , ZK |K)

(d)= H(K) +E(K)

(e)

≥ EK.

Thus no more than nH(p) fair coin tosses can be derived from (X1, . . . , Xn) , on theaverage. Exhibit a good map f on sequences of length 4.

Solution: Pure randomness and bent coins.

nH(p)(a)= H(X1, . . . , Xn)

(b)

≥ H(Z1, Z2, . . . , ZK)

Page 22: Solution to Information Theory

Entropy, Relative Entropy and Mutual Information 21

(c)= H(Z1, Z2, . . . , ZK ,K)

(d)= H(K) +H(Z1, . . . , ZK |K)

(e)= H(K) +E(K)

(f)

≥ EK .

(a) Since X1, X2, . . . , Xn are i.i.d. with probability of Xi = 1 being p , the entropyH(X1, X2, . . . , Xn) is nH(p) .

(b) Z1, . . . , ZK is a function of X1, X2, . . . , Xn , and since the entropy of a function of arandom variable is less than the entropy of the random variable, H(Z1, . . . , ZK) ≤H(X1, X2, . . . , Xn) .

(c) K is a function of Z1, Z2, . . . , ZK , so its conditional entropy given Z1, Z2, . . . , ZK

is 0. Hence H(Z1, Z2, . . . , ZK ,K) = H(Z1, . . . , ZK) + H(K|Z1, Z2, . . . , ZK) =H(Z1, Z2, . . . , ZK).

(d) Follows from the chain rule for entropy.

(e) By assumption, Z1, Z2, . . . , ZK are pure random bits (given K ), with entropy 1bit per symbol. Hence

H(Z1, Z2, . . . , ZK |K) =∑

k

p(K = k)H(Z1, Z2, . . . , Zk|K = k) (2.22)

=∑

k

p(k)k (2.23)

= EK. (2.24)

(f) Follows from the non-negativity of discrete entropy.

(g) Since we do not know p , the only way to generate pure random bits is to usethe fact that all sequences with the same number of ones are equally likely. Forexample, the sequences 0001,0010,0100 and 1000 are equally likely and can be usedto generate 2 pure random bits. An example of a mapping to generate randombits is

0000 → Λ0001 → 00 0010 → 01 0100 → 10 1000 → 110011 → 00 0110 → 01 1100 → 10 1001 → 111010 → 0 0101 → 11110 → 11 1101 → 10 1011 → 01 0111 → 001111 → Λ

(2.25)

The resulting expected number of bits is

EK = 4pq3 × 2 + 4p2q2 × 2 + 2p2q2 × 1 + 4p3q × 2 (2.26)

= 8pq3 + 10p2q2 + 8p3q. (2.27)

Page 23: Solution to Information Theory

22 Entropy, Relative Entropy and Mutual Information

For example, for p ≈ 12 , the expected number of pure random bits is close to 1.625.

This is substantially less then the 4 pure random bits that could be generated ifp were exactly 1

2 .

We will now analyze the efficiency of this scheme of generating random bits for longsequences of bent coin flips. Let n be the number of bent coin flips. The algorithmthat we will use is the obvious extension of the above method of generating purebits using the fact that all sequences with the same number of ones are equallylikely.

Consider all sequences with k ones. There are(nk

)

such sequences, which areall equally likely. If

(nk

)

were a power of 2, then we could generate log(nk

)

purerandom bits from such a set. However, in the general case,

(nk

)

is not a power of2 and the best we can to is the divide the set of

(nk

)elements into subset of sizes

which are powers of 2. The largest set would have a size 2blog (nk)c and could be

used to generate blog (nk)c random bits. We could divide the remaining elements

into the largest set which is a power of 2, etc. The worst case would occur when(nk

)

= 2l+1 − 1 , in which case the subsets would be of sizes 2l, 2l−1, 2l−2, . . . , 1 .

Instead of analyzing the scheme exactly, we will just find a lower bound on numberof random bits generated from a set of size

(nk

). Let l = blog (nk

)c . Then at leasthalf of the elements belong to a set of size 2l and would generate l random bits,at least 1

4 th belong to a set of size 2l−1 and generate l− 1 random bits, etc. Onthe average, the number of bits generated is

E[K|k 1’s in sequence] ≥ 1

2l +

1

4(l − 1) + · · · + 1

2l1 (2.28)

= l − 1

4

(

1 +1

2+

2

4+

3

8+ · · · + l − 1

2l−2

)

(2.29)

≥ l − 1, (2.30)

since the infinite series sums to 1.

Hence the fact that(nk

)

is not a power of 2 will cost at most 1 bit on the averagein the number of random bits that are produced.

Hence, the expected number of pure random bits produced by this algorithm is

EK ≥n∑

k=0

(

n

k

)

pkqn−kblog(

n

k

)

− 1c (2.31)

≥n∑

k=0

(

n

k

)

pkqn−k

(

log

(

n

k

)

− 2

)

(2.32)

=n∑

k=0

(

n

k

)

pkqn−k log

(

n

k

)

− 2 (2.33)

≥∑

n(p−ε)≤k≤n(p+ε)

(

n

k

)

pkqn−k log

(

n

k

)

− 2. (2.34)

Page 24: Solution to Information Theory

Entropy, Relative Entropy and Mutual Information 23

Now for sufficiently large n , the probability that the number of 1’s in the sequenceis close to np is near 1 (by the weak law of large numbers). For such sequences,kn is close to p and hence there exists a δ such that

(

n

k

)

≥ 2n(H( kn

)−δ) ≥ 2n(H(p)−2δ) (2.35)

using Stirling’s approximation for the binomial coefficients and the continuity ofthe entropy function. If we assume that n is large enough so that the probabilitythat n(p − ε) ≤ k ≤ n(p + ε) is greater than 1 − ε , then we see that EK ≥(1− ε)n(H(p)−2δ)−2 , which is very good since nH(p) is an upper bound on thenumber of pure random bits that can be produced from the bent coin sequence.

18. World Series. The World Series is a seven-game series that terminates as soon as eitherteam wins four games. Let X be the random variable that represents the outcome ofa World Series between teams A and B; possible values of X are AAAA, BABABAB,and BBBAAAA. Let Y be the number of games played, which ranges from 4 to 7.Assuming that A and B are equally matched and that the games are independent,calculate H(X) , H(Y ) , H(Y |X) , and H(X|Y ) .

Solution:

World Series. Two teams play until one of them has won 4 games.

There are 2 (AAAA, BBBB) World Series with 4 games. Each happens with probability(1/2)4 .

There are 8 = 2(43

)World Series with 5 games. Each happens with probability (1/2)5 .

There are 20 = 2(53

)

World Series with 6 games. Each happens with probability (1/2)6 .

There are 40 = 2(63

)World Series with 7 games. Each happens with probability (1/2)7 .

The probability of a 4 game series (Y = 4) is 2(1/2)4 = 1/8 .

The probability of a 5 game series (Y = 5) is 8(1/2)5 = 1/4 .

The probability of a 6 game series (Y = 6) is 20(1/2)6 = 5/16 .

The probability of a 7 game series (Y = 7) is 40(1/2)7 = 5/16 .

H(X) =∑

p(x)log1

p(x)

= 2(1/16) log 16 + 8(1/32) log 32 + 20(1/64) log 64 + 40(1/128) log 128

= 5.8125

H(Y ) =∑

p(y)log1

p(y)

= 1/8 log 8 + 1/4 log 4 + 5/16 log(16/5) + 5/16 log(16/5)

= 1.924

Page 25: Solution to Information Theory

24 Entropy, Relative Entropy and Mutual Information

Y is a deterministic function of X, so if you know X there is no randomness in Y. Or,H(Y |X) = 0 .

Since H(X) + H(Y |X) = H(X,Y ) = H(Y ) + H(X|Y ) , it is easy to determineH(X|Y ) = H(X) +H(Y |X) −H(Y ) = 3.889

19. Infinite entropy. This problem shows that the entropy of a discrete random variable canbe infinite. Let A =

∑∞n=2(n log2 n)−1 . (It is easy to show that A is finite by bounding

the infinite sum by the integral of (x log2 x)−1 .) Show that the integer-valued randomvariable X defined by Pr(X = n) = (An log2 n)−1 for n = 2, 3, . . . , has H(X) = +∞ .

Solution: Infinite entropy. By definition, pn = Pr(X = n) = 1/An log2 n for n ≥ 2 .Therefore

H(X) = −∞∑

n=2

p(n) log p(n)

= −∞∑

n=2

(

1/An log2 n)

log(

1/An log2 n)

=∞∑

n=2

log(An log2 n)

An log2 n

=∞∑

n=2

logA+ log n+ 2 log log n

An log2 n

= logA+∞∑

n=2

1

An log n+

∞∑

n=2

2 log log n

An log2 n.

The first term is finite. For base 2 logarithms, all the elements in the sum in the lastterm are nonnegative. (For any other base, the terms of the last sum eventually allbecome positive.) So all we have to do is bound the middle sum, which we do bycomparing with an integral.

∞∑

n=2

1

An log n>

∫ ∞

2

1

Ax log xdx = K ln lnx

∣∣∣

2= +∞ .

We conclude that H(X) = +∞ .

20. Run length coding. Let X1, X2, . . . , Xn be (possibly dependent) binary random vari-ables. Suppose one calculates the run lengths R = (R1, R2, . . .) of this sequence (inorder as they occur). For example, the sequence X = 0001100100 yields run lengthsR = (3, 2, 2, 1, 2) . Compare H(X1, X2, . . . , Xn) , H(R) and H(Xn,R) . Show allequalities and inequalities, and bound all the differences.

Solution: Run length coding. Since the run lengths are a function of X1, X2, . . . , Xn ,H(R) ≤ H(X) . Any Xi together with the run lengths determine the entire sequenceX1, X2, . . . , Xn . Hence

H(X1, X2, . . . , Xn) = H(Xi,R) (2.36)

Page 26: Solution to Information Theory

Entropy, Relative Entropy and Mutual Information 25

= H(R) +H(Xi|R) (2.37)

≤ H(R) +H(Xi) (2.38)

≤ H(R) + 1. (2.39)

21. Markov’s inequality for probabilities. Let p(x) be a probability mass function. Prove,for all d ≥ 0 ,

Pr {p(X) ≤ d} log

(1

d

)

≤ H(X). (2.40)

Solution: Markov inequality applied to entropy.

P (p(X) < d) log1

d=

x:p(x)<d

p(x) log1

d(2.41)

≤∑

x:p(x)<d

p(x) log1

p(x)(2.42)

≤∑

x

p(x) log1

p(x)(2.43)

= H(X) (2.44)

22. Logical order of ideas. Ideas have been developed in order of need, and then generalizedif necessary. Reorder the following ideas, strongest first, implications following:

(a) Chain rule for I(X1, . . . , Xn;Y ) , chain rule for D(p(x1, . . . , xn)||q(x1, x2, . . . , xn)) ,and chain rule for H(X1, X2, . . . , Xn) .

(b) D(f ||g) ≥ 0 , Jensen’s inequality, I(X;Y ) ≥ 0 .

Solution: Logical ordering of ideas.

(a) The following orderings are subjective. Since I(X;Y ) = D(p(x, y)||p(x)p(y)) is aspecial case of relative entropy, it is possible to derive the chain rule for I fromthe chain rule for D .

Since H(X) = I(X;X) , it is possible to derive the chain rule for H from thechain rule for I .

It is also possible to derive the chain rule for I from the chain rule for H as wasdone in the notes.

(b) In class, Jensen’s inequality was used to prove the non-negativity of D . Theinequality I(X;Y ) ≥ 0 followed as a special case of the non-negativity of D .

23. Conditional mutual information. Consider a sequence of n binary random variablesX1, X2, . . . , Xn . Each sequence with an even number of 1’s has probability 2−(n−1)

and each sequence with an odd number of 1’s has probability 0. Find the mutualinformations

I(X1;X2), I(X2;X3|X1), . . . , I(Xn−1;Xn|X1, . . . , Xn−2).

Page 27: Solution to Information Theory

26 Entropy, Relative Entropy and Mutual Information

Solution: Conditional mutual information.

Consider a sequence of n binary random variables X1, X2, . . . , Xn . Each sequence oflength n with an even number of 1’s is equally likely and has probability 2−(n−1) .

Any n− 1 or fewer of these are independent. Thus, for k ≤ n− 1 ,

I(Xk−1;Xk|X1, X2, . . . , Xk−2) = 0.

However, given X1, X2, . . . , Xn−2 , we know that once we know either Xn−1 or Xn weknow the other.

I(Xn−1;Xn|X1, X2, . . . , Xn−2) = H(Xn|X1, X2, . . . , Xn−2) −H(Xn|X1, X2, . . . , Xn−1)

= 1 − 0 = 1 bit.

24. Average entropy. Let H(p) = −p log2 p − (1 − p) log2(1 − p) be the binary entropyfunction.

(a) Evaluate H(1/4) using the fact that log2 3 ≈ 1.584 . Hint: You may wish toconsider an experiment with four equally likely outcomes, one of which is moreinteresting than the others.

(b) Calculate the average entropy H(p) when the probability p is chosen uniformlyin the range 0 ≤ p ≤ 1 .

(c) (Optional) Calculate the average entropy H(p1, p2, p3) where (p1, p2, p3) is a uni-formly distributed probability vector. Generalize to dimension n .

Solution: Average Entropy.

(a) We can generate two bits of information by picking one of four equally likelyalternatives. This selection can be made in two steps. First we decide whether thefirst outcome occurs. Since this has probability 1/4 , the information generatedis H(1/4) . If not the first outcome, then we select one of the three remainingoutcomes; with probability 3/4 , this produces log2 3 bits of information. Thus

H(1/4) + (3/4) log2 3 = 2

and so H(1/4) = 2 − (3/4) log2 3 = 2 − (.75)(1.585) = 0.811 bits.

(b) If p is chosen uniformly in the range 0 ≤ p ≤ 1 , then the average entropy (innats) is

−∫ 1

0p ln p+ (1 − p) ln(1 − p)dp = −2

∫ 1

0x lnx dx = −2

(

x2

2lnx+

x2

4

)∣∣∣

1

0= 1

2 .

Therefore the average entropy is 12 log2 e = 1/(2 ln 2) = .721 bits.

Page 28: Solution to Information Theory

Entropy, Relative Entropy and Mutual Information 27

(c) Choosing a uniformly distributed probability vector (p1, p2, p3) is equivalent tochoosing a point (p1, p2) uniformly from the triangle 0 ≤ p1 ≤ 1 , p1 ≤ p2 ≤ 1 .The probability density function has the constant value 2 because the area of thetriangle is 1/2. So the average entropy H(p1, p2, p3) is

−2

∫ 1

0

∫ 1

p1

p1 ln p1 + p2 ln p2 + (1 − p1 − p2) ln(1 − p1 − p2)dp2dp1 .

After some enjoyable calculus, we obtain the final result 5/(6 ln 2) = 1.202 bits.

25. Venn diagrams. There isn’t realy a notion of mutual information common to threerandom variables. Here is one attempt at a definition: Using Venn diagrams, we cansee that the mutual information common to three random variables X , Y and Z canbe defined by

I(X;Y ;Z) = I(X;Y ) − I(X;Y |Z) .

This quantity is symmetric in X , Y and Z , despite the preceding asymmetric defi-nition. Unfortunately, I(X;Y ;Z) is not necessarily nonnegative. Find X , Y and Zsuch that I(X;Y ;Z) < 0 , and prove the following two identities:

(a) I(X;Y ;Z) = H(X,Y,Z)−H(X)−H(Y )−H(Z) + I(X;Y ) + I(Y ;Z) + I(Z;X)

(b) I(X;Y ;Z) = H(X,Y,Z)−H(X,Y )−H(Y,Z)−H(Z,X)+H(X)+H(Y )+H(Z)

The first identity can be understood using the Venn diagram analogy for entropy andmutual information. The second identity follows easily from the first.

Solution: Venn Diagrams. To show the first identity,

I(X;Y ;Z) = I(X;Y ) − I(X;Y |Z) by definition

= I(X;Y ) − (I(X;Y,Z) − I(X;Z)) by chain rule

= I(X;Y ) + I(X;Z) − I(X;Y,Z)

= I(X;Y ) + I(X;Z) − (H(X) +H(Y,Z) −H(X,Y,Z))

= I(X;Y ) + I(X;Z) −H(X) +H(X,Y,Z) −H(Y,Z)

= I(X;Y ) + I(X;Z) −H(X) +H(X,Y,Z) − (H(Y ) +H(Z) − I(Y ;Z))

= I(X;Y ) + I(X;Z) + I(Y ;Z) +H(X,Y,Z) −H(X) −H(Y ) −H(Z).

To show the second identity, simply substitute for I(X;Y ) , I(X;Z) , and I(Y ;Z)using equations like

I(X;Y ) = H(X) +H(Y ) −H(X,Y ) .

These two identities show that I(X;Y ;Z) is a symmetric (but not necessarily nonneg-ative) function of three random variables.

26. Another proof of non-negativity of relative entropy. In view of the fundamental natureof the result D(p||q) ≥ 0 , we will give another proof.

(a) Show that lnx ≤ x− 1 for 0 < x <∞ .

Page 29: Solution to Information Theory

28 Entropy, Relative Entropy and Mutual Information

(b) Justify the following steps:

−D(p||q) =∑

x

p(x) lnq(x)

p(x)(2.45)

≤∑

x

p(x)

(q(x)

p(x)− 1

)

(2.46)

≤ 0 (2.47)

(c) What are the conditions for equality?

Solution: Another proof of non-negativity of relative entropy. In view of the funda-mental nature of the result D(p||q) ≥ 0 , we will give another proof.

(a) Show that lnx ≤ x− 1 for 0 < x <∞ .

There are many ways to prove this. The easiest is using calculus. Let

f(x) = x− 1 − lnx (2.48)

for 0 < x < ∞ . Then f ′(x) = 1 − 1x and f ′′(x) = 1

x2 > 0 , and therefore f(x)is strictly convex. Therefore a local minimum of the function is also a globalminimum. The function has a local minimum at the point where f ′(x) = 0 , i.e.,when x = 1. Therefore f(x) ≥ f(1) , i.e.,

x− 1 − lnx ≥ 1 − 1 − ln 1 = 0 (2.49)

which gives us the desired inequality. Equality occurs only if x = 1.

(b) We let A be the set of x such that p(x) > 0 .

−De(p||q) =∑

x∈A

p(x)lnq(x)

p(x)(2.50)

≤∑

x∈A

p(x)

(q(x)

p(x)− 1

)

(2.51)

=∑

x∈A

q(x) −∑

x∈A

p(x) (2.52)

≤ 0 (2.53)

The first step follows from the definition of D , the second step follows from theinequality ln t ≤ t− 1 , the third step from expanding the sum, and the last stepfrom the fact that the q(A) ≤ 1 and p(A) = 1 .

(c) What are the conditions for equality?

We have equality in the inequality ln t ≤ t− 1 if and only if t = 1. Therefore wehave equality in step 2 of the chain iff q(x)/p(x) = 1 for all x ∈ A . This impliesthat p(x) = q(x) for all x , and we have equality in the last step as well. Thusthe condition for equality is that p(x) = q(x) for all x .

Page 30: Solution to Information Theory

Entropy, Relative Entropy and Mutual Information 29

27. Grouping rule for entropy: Let p = (p1, p2, . . . , pm) be a probability distribution on melements, i.e, pi ≥ 0 , and

∑mi=1 pi = 1. Define a new distribution q on m−1 elements

as q1 = p1 , q2 = p2 ,. . . , qm−2 = pm−2 , and qm−1 = pm−1 + pm , i.e., the distributionq is the same as p on {1, 2, . . . ,m− 2} , and the probability of the last element in q

is the sum of the last two probabilities of p . Show that

H(p) = H(q) + (pm−1 + pm)H

(pm−1

pm−1 + pm,

pm

pm−1 + pm

)

. (2.54)

Solution:

H(p) = −m∑

i=1

pi log pi (2.55)

= −m−2∑

i=1

pi log pi − pm−1 log pm−1 − pm log pm (2.56)

= −m−2∑

i=1

pi log pi − pm−1 logpm−1

pm−1 + pm− pm log

pm

pm−1 + pm(2.57)

−(pm−1 + pm) log(pm−1 + pm) (2.58)

= H(q) − pm−1 logpm−1

pm−1 + pm− pm log

pm

pm−1 + pm(2.59)

= H(q) − (pm−1 + pm)

(pm−1

pm−1 + pmlog

pm−1

pm−1 + pm− pm

pm−1 + pmlog

pm

pm−1 + pm

)

(2.60)

= H(q) + (pm−1 + pm)H2

(pm−1

pm−1 + pm,

pm

pm−1 + pm

)

, (2.61)

where H2(a, b) = −a log a− b log b .

28. Mixing increases entropy. Show that the entropy of the probability distribution,(p1, . . . , pi, . . . , pj , . . . , pm) , is less than the entropy of the distribution

(p1, . . . ,pi+pj

2 , . . . ,pi+pj

2 , . . . , pm) . Show that in general any transfer of probability thatmakes the distribution more uniform increases the entropy.

Solution:

Mixing increases entropy.

This problem depends on the convexity of the log function. Let

P1 = (p1, . . . , pi, . . . , pj, . . . , pm)

P2 = (p1, . . . ,pi + pj

2, . . . ,

pj + pi

2, . . . , pm)

Then, by the log sum inequality,

H(P2) −H(P1) = −2(pi + pj

2) log(

pi + pj

2) + pi log pi + pj log pj

= −(pi + pj) log(pi + pj

2) + pi log pi + pj log pj

≥ 0.

Page 31: Solution to Information Theory

30 Entropy, Relative Entropy and Mutual Information

Thus,H(P2) ≥ H(P1).

29. Inequalities. Let X , Y and Z be joint random variables. Prove the following inequal-ities and find conditions for equality.

(a) H(X,Y |Z) ≥ H(X|Z) .

(b) I(X,Y ;Z) ≥ I(X;Z) .

(c) H(X,Y,Z) −H(X,Y ) ≤ H(X,Z) −H(X) .

(d) I(X;Z|Y ) ≥ I(Z;Y |X) − I(Z;Y ) + I(X;Z) .

Solution: Inequalities.

(a) Using the chain rule for conditional entropy,

H(X,Y |Z) = H(X|Z) +H(Y |X,Z) ≥ H(X|Z),

with equality iff H(Y |X,Z) = 0 , that is, when Y is a function of X and Z .

(b) Using the chain rule for mutual information,

I(X,Y ;Z) = I(X;Z) + I(Y ;Z|X) ≥ I(X;Z),

with equality iff I(Y ;Z|X) = 0 , that is, when Y and Z are conditionally inde-pendent given X .

(c) Using first the chain rule for entropy and then the definition of conditional mutualinformation,

H(X,Y,Z) −H(X,Y ) = H(Z|X,Y ) = H(Z|X) − I(Y ;Z|X)

≤ H(Z|X) = H(X,Z) −H(X) ,

with equality iff I(Y ;Z|X) = 0 , that is, when Y and Z are conditionally inde-pendent given X .

(d) Using the chain rule for mutual information,

I(X;Z|Y ) + I(Z;Y ) = I(X,Y ;Z) = I(Z;Y |X) + I(X;Z) ,

and thereforeI(X;Z|Y ) = I(Z;Y |X) − I(Z;Y ) + I(X;Z) .

We see that this inequality is actually an equality in all cases.

30. Maximum entropy. Find the probability mass function p(x) that maximizes the en-tropy H(X) of a non-negative integer-valued random variable X subject to the con-straint

EX =∞∑

n=0

np(n) = A

Page 32: Solution to Information Theory

Entropy, Relative Entropy and Mutual Information 31

for a fixed value A > 0 . Evaluate this maximum H(X) .

Solution: Maximum entropy

Recall that,

−∞∑

i=0

pi log pi ≤ −∞∑

i=0

pi log qi.

Let qi = α(β)i . Then we have that,

−∞∑

i=0

pi log pi ≤ −∞∑

i=0

pi log qi

= −(

log(α)∞∑

i=0

pi + log(β)∞∑

i=0

ipi

)

= − log α−A log β

Notice that the final right hand side expression is independent of {pi} , and that theinequality,

−∞∑

i=0

pi log pi ≤ − log α−A log β

holds for all α, β such that,

∞∑

i=0

αβi = 1 = α1

1 − β.

The constraint on the expected value also requires that,

∞∑

i=0

iαβi = A = αβ

(1 − β)2.

Combining the two constraints we have,

αβ

(1 − β)2=

1 − β

)(β

1 − β

)

1 − β= A,

which implies that,

β =A

A+ 1

α =1

A+ 1.

Page 33: Solution to Information Theory

32 Entropy, Relative Entropy and Mutual Information

So the entropy maximizing distribution is,

pi =1

A+ 1

(A

A+ 1

)i

.

Plugging these values into the expression for the maximum entropy,

− logα−A log β = (A+ 1) log(A+ 1) −A logA.

The general form of the distribution,

pi = αβi

can be obtained either by guessing or by Lagrange multipliers where,

F (pi, λ1, λ2) = −∞∑

i=0

pi log pi + λ1(∞∑

i=0

pi − 1) + λ2(∞∑

i=0

ipi −A)

is the function whose gradient we set to 0.

To complete the argument with Lagrange multipliers, it is necessary to show that thelocal maximum is the global maximum. One possible argument is based on the factthat −H(p) is convex, it has only one local minima, no local maxima and thereforeLagrange multiplier actually gives the global maximum for H(p) .

31. Conditional entropy. Under what conditions does H(X | g(Y )) = H(X | Y ) ?

Solution: (Conditional Entropy). If H(X|g(Y )) = H(X|Y ) , then H(X)−H(X|g(Y )) =H(X) − H(X|Y ) , i.e., I(X; g(Y )) = I(X;Y ) . This is the condition for equality inthe data processing inequality. From the derivation of the inequality, we have equal-ity iff X → g(Y ) → Y forms a Markov chain. Hence H(X|g(Y )) = H(X|Y ) iffX → g(Y ) → Y . This condition includes many special cases, such as g being one-to-one, and X and Y being independent. However, these two special cases do notexhaust all the possibilities.

32. Fano. We are given the following joint distribution on (X,Y )

Y

X a b c

1 16

112

112

2 112

16

112

3 112

112

16

Let X(Y ) be an estimator for X (based on Y) and let Pe = Pr{X(Y ) 6= X}.

(a) Find the minimum probability of error estimator X(Y ) and the associated Pe .

(b) Evaluate Fano’s inequality for this problem and compare.

Page 34: Solution to Information Theory

Entropy, Relative Entropy and Mutual Information 33

Solution:

(a) From inspection we see that

X(y) =

1 y = a2 y = b3 y = c

Hence the associated Pe is the sum of P (1, b), P (1, c), P (2, a), P (2, c), P (3, a)and P (3, b). Therefore, Pe = 1/2.

(b) From Fano’s inequality we know

Pe ≥ H(X|Y ) − 1

log |X | .

Here,

H(X|Y ) = H(X|Y = a) Pr{y = a} +H(X|Y = b) Pr{y = b} +H(X|Y = c) Pr{y = c}

= H

(1

2,1

4,1

4

)

Pr{y = a} +H

(1

2,1

4,1

4

)

Pr{y = b} +H

(1

2,1

4,1

4

)

Pr{y = c}

= H

(1

2,1

4,1

4

)

(Pr{y = a} + Pr{y = b} + Pr{y = c})

= H

(1

2,1

4,1

4

)

= 1.5 bits.

Hence

Pe ≥ 1.5 − 1

log 3= .316.

Hence our estimator X(Y ) is not very close to Fano’s bound in this form. IfX ∈ X , as it does here, we can use the stronger form of Fano’s inequality to get

Pe ≥ H(X|Y ) − 1

log(|X |-1).

and

Pe ≥ 1.5 − 1

log 2=

1

2.

Therefore our estimator X(Y ) is actually quite good.

33. Fano’s inequality. Let Pr(X = i) = pi, i = 1, 2, . . . ,m and let p1 ≥ p2 ≥ p3 ≥· · · ≥ pm. The minimal probability of error predictor of X is X = 1, with resultingprobability of error Pe = 1−p1. Maximize H(p) subject to the constraint 1−p1 = Pe

to find a bound on Pe in terms of H . This is Fano’s inequality in the absence ofconditioning.

Page 35: Solution to Information Theory

34 Entropy, Relative Entropy and Mutual Information

Solution: (Fano’s Inequality.) The minimal probability of error predictor when thereis no information is X = 1, the most probable value of X . The probability of error inthis case is Pe = 1− p1 . Hence if we fix Pe , we fix p1 . We maximize the entropy of Xfor a given Pe to obtain an upper bound on the entropy for a given Pe . The entropy,

H(p) = −p1 log p1 −m∑

i=2

pi log pi (2.62)

= −p1 log p1 −m∑

i=2

Pepi

Pelog

pi

Pe− Pe logPe (2.63)

= H(Pe) + PeH

(p2

Pe,p3

Pe, . . . ,

pm

Pe

)

(2.64)

≤ H(Pe) + Pe log(m− 1), (2.65)

since the maximum of H(

p2

Pe, p3

Pe, . . . , pm

Pe

)

is attained by an uniform distribution. Hence

any X that can be predicted with a probability of error Pe must satisfy

H(X) ≤ H(Pe) + Pe log(m− 1), (2.66)

which is the unconditional form of Fano’s inequality. We can weaken this inequality toobtain an explicit lower bound for Pe ,

Pe ≥H(X) − 1

log(m− 1). (2.67)

34. Entropy of initial conditions. Prove that H(X0|Xn) is non-decreasing with n for anyMarkov chain.

Solution: Entropy of initial conditions. For a Markov chain, by the data processingtheorem, we have

I(X0;Xn−1) ≥ I(X0;Xn). (2.68)

ThereforeH(X0) −H(X0|Xn−1) ≥ H(X0) −H(X0|Xn) (2.69)

or H(X0|Xn) increases with n .

35. Relative entropy is not symmetric: Let the random variable X have three possibleoutcomes {a, b, c} . Consider two distributions on this random variable

Symbol p(x) q(x)

a 1/2 1/3b 1/4 1/3c 1/4 1/3

Calculate H(p) , H(q) , D(p||q) and D(q||p) . Verify that in this case D(p||q) 6=D(q||p) .

Solution:

H(p) =1

2log 2 +

1

4log 4 +

1

4log 4 = 1.5 bits. (2.70)

Page 36: Solution to Information Theory

Entropy, Relative Entropy and Mutual Information 35

H(q) =1

3log 3 +

1

3log 3 +

1

3log 3 = log 3 = 1.58496 bits. (2.71)

D(p||q) = 1/2 log(3/2)+1/4 log(3/4)+1/4 log(3/4) = log(3)−1.5 = 1.58496−1.5 = 0.08496(2.72)

D(q||p) = 1/3 log(2/3)+1/3 log(4/3)+1/3 log(4/3) = 5/3−log(3) = 1.66666−1.58496 = 0.08170(2.73)

36. Symmetric relative entropy: Though, as the previous example shows, D(p||q) 6= D(q||p)in general, there could be distributions for which equality holds. Give an example oftwo distributions p and q on a binary alphabet such that D(p||q) = D(q||p) (otherthan the trivial case p = q ).

Solution:

A simple case for D((p, 1 − p)||(q, 1 − q)) = D((q, 1 − q)||(p, 1 − p)) , i.e., for

p logp

q+ (1 − p) log

1 − p

1 − q= q log

q

p+ (1 − q) log

1 − q

1 − p(2.74)

is when q = 1 − p .

37. Relative entropy: Let X,Y,Z be three random variables with a joint probability massfunction p(x, y, z) . The relative entropy between the joint distribution and the productof the marginals is

D(p(x, y, z)||p(x)p(y)p(z)) = E

[

logp(x, y, z)

p(x)p(y)p(z)

]

(2.75)

Expand this in terms of entropies. When is this quantity zero?

Solution:

D(p(x, y, z)||p(x)p(y)p(z)) = E

[

logp(x, y, z)

p(x)p(y)p(z)

]

(2.76)

= E[log p(x, y, z)] −E[log p(x)] −E[log p(y)] −E[log p(z)](2.77)

= −H(X,Y,Z) +H(X) +H(Y ) +H(Z) (2.78)

We have D(p(x, y, z)||p(x)p(y)p(z)) = 0 if and only p(x, y, z) = p(x)p(y)p(z) for all(x, y, z) , i.e., if X and Y and Z are independent.

38. The value of a question Let X ∼ p(x) , x = 1, 2, . . . ,m . We are given a set S ⊆{1, 2, . . . ,m} . We ask whether X ∈ S and receive the answer

Y =

{

1, if X ∈ S0, if X 6∈ S.

Suppose Pr{X ∈ S} = α . Find the decrease in uncertainty H(X) −H(X|Y ) .

Apparently any set S with a given α is as good as any other.

Page 37: Solution to Information Theory

36 Entropy, Relative Entropy and Mutual Information

Solution: The value of a question.

H(X) −H(X|Y ) = I(X;Y )

= H(Y ) −H(Y |X)

= H(α) −H(Y |X)

= H(α)

since H(Y |X) = 0 .

39. Entropy and pairwise independence.Let X,Y,Z be three binary Bernoulli ( 1

2 ) random variables that are pairwise indepen-dent, that is, I(X;Y ) = I(X;Z) = I(Y ;Z) = 0 .

(a) Under this constraint, what is the minimum value for H(X,Y,Z) ?

(b) Give an example achieving this minimum.

Solution:

(a)

H(X,Y,Z) = H(X,Y ) +H(Z|X,Y ) (2.79)

≥ H(X,Y ) (2.80)

= 2. (2.81)

So the minimum value for H(X,Y,Z) is at least 2. To show that is is actuallyequal to 2, we show in part (b) that this bound is attainable.

(b) Let X and Y be iid Bernoulli( 12 ) and let Z = X⊕Y , where ⊕ denotes addition

mod 2 (xor).

40. Discrete entropies

Let X and Y be two independent integer-valued random variables. Let X be uniformlydistributed over {1, 2, . . . , 8} , and let Pr{Y = k} = 2−k , k = 1, 2, 3, . . .

(a) Find H(X)

(b) Find H(Y )

(c) Find H(X + Y,X − Y ) .

Solution:

(a) For a uniform distribution, H(X) = logm = log 8 = 3 .

(b) For a geometric distribution, H(Y ) =∑

k k2−k = 2. (See solution to problem 2.1

Page 38: Solution to Information Theory

Entropy, Relative Entropy and Mutual Information 37

(c) Since (X,Y ) → (X+Y,X−Y ) is a one to one transformation, H(X+Y,X−Y ) =H(X,Y ) = H(X) +H(Y ) = 3 + 2 = 5 .

41. Random questions

One wishes to identify a random object X ∼ p(x) . A question Q ∼ r(q) is askedat random according to r(q) . This results in a deterministic answer A = A(x, q) ∈{a1, a2, . . .} . Suppose X and Q are independent. Then I(X;Q,A) is the uncertaintyin X removed by the question-answer (Q,A) .

(a) Show I(X;Q,A) = H(A|Q) . Interpret.

(b) Now suppose that two i.i.d. questions Q1, Q2,∼ r(q) are asked, eliciting answersA1 and A2 . Show that two questions are less valuable than twice a single questionin the sense that I(X;Q1, A1, Q2, A2) ≤ 2I(X;Q1, A1) .

Solution: Random questions.

(a)

I(X;Q,A) = H(Q,A) −H(Q,A, |X)

= H(Q) +H(A|Q) −H(Q|X) −H(A|Q,X)

= H(Q) +H(A|Q) −H(Q)

= H(A|Q)

The interpretation is as follows. The uncertainty removed in X given (Q,A) isthe same as the uncertainty in the answer given the question.

(b) Using the result from part a and the fact that questions are independent, we caneasily obtain the desired relationship.

I(X;Q1, A1, Q2, A2)(a)= I(X;Q1) + I(X;A1|Q1) + I(X;Q2|A1, Q1) + I(X;A2|A1, Q1, Q2)

(b)= I(X;A1|Q1) +H(Q2|A1, Q1) −H(Q2|X,A1, Q1) + I(X;A2|A1, Q1, Q2)

(c)= I(X;A1|Q1) + I(X;A2|A1, Q1, Q2)

= I(X;A1|Q1) +H(A2|A1, Q1, Q2) −H(A2|X,A1, Q1, Q2)

(d)= I(X;A1|Q1) +H(A2|A1, Q1, Q2)

(e)

≤ I(X;A1|Q1) +H(A2|Q2)

(f)= 2I(X;A1|Q1)

(a) Chain Rule.(b) X and Q1 are independent.

Page 39: Solution to Information Theory

38 Entropy, Relative Entropy and Mutual Information

(c) Q2 are independent of X , Q1 , and A1 .(d) A2 is completely determined given Q2 and X .(e) Conditioning decreases entropy.(f) Result from part a.

42. Inequalities. Which of the following inequalities are generally ≥,=,≤? Label each with≥,=, or ≤ .

(a) H(5X) vs. H(X)

(b) I(g(X);Y ) vs. I(X;Y )

(c) H(X0|X−1) vs. H(X0|X−1, X1)

(d) H(X1, X2, . . . , Xn) vs. H(c(X1, X2, . . . , Xn)) , where c(x1, x2, . . . , xn) is the Huff-man codeword assigned to (x1, x2, . . . , xn) .

(e) H(X,Y )/(H(X) +H(Y )) vs. 1

Solution:

(a) X → 5X is a one to one mapping, and hence H(X) = H(5X) .

(b) By data processing inequality, I(g(X);Y ) ≤ I(X;Y ) .

(c) Because conditioning reduces entropy, H(X0|X−1) ≥ H(X0|X−1, X1) .

(d) H(X,Y ) ≤ H(X) +H(Y ) , so H(X,Y )/(H(X) +H(Y )) ≤ 1 .

43. Mutual information of heads and tails.

(a) Consider a fair coin flip. What is the mutual information between the top sideand the bottom side of the coin?

(b) A 6-sided fair die is rolled. What is the mutual information between the top sideand the front face (the side most facing you)?

Solution:

Mutual information of heads and tails.

To prove (a) observe that

I(T ;B) = H(B) −H(B|T )

= log 2 = 1

since B ∼ Ber(1/2) , and B = f(T ) . Here B, T stand for Bottom and Top respectively.

To prove (b) note that having observed a side of the cube facing us F , there are fourpossibilities for the top T , which are equally probable. Thus,

I(T ;F ) = H(T ) −H(T |F )

= log 6 − log 4

= log 3 − 1

since T has uniform distribution on {1, 2, . . . , 6} .

Page 40: Solution to Information Theory

Entropy, Relative Entropy and Mutual Information 39

44. Pure randomness

We wish to use a 3-sided coin to generate a fair coin toss. Let the coin X haveprobability mass function

X =

A, pA

B, pB

C, pC

where pA, pB , pC are unknown.

(a) How would you use 2 independent flips X1, X2 to generate (if possible) a Bernoulli( 12 )

random variable Z ?

(b) What is the resulting maximum expected number of fair bits generated?

Solution:

(a) The trick here is to notice that for any two letters Y and Z produced by twoindependent tosses of our bent three-sided coin, Y Z has the same probability asZY . So we can produce B ∼Bernoulli( 1

2 ) coin flips by letting B = 0 when weget AB , BC or AC , and B = 1 when we get BA , CB or CA (if we get AA ,BB or CC we don’t assign a value to B .)

(b) The expected number of bits generated by the above scheme is as follows. We getone bit, except when the two flips of the 3-sided coin produce the same symbol.So the expected number of fair bits generated is

0 ∗ [P (AA) + P (BB) + P (CC)] + 1 ∗ [1 − P (AA) − P (BB) − P (CC)], (2.82)

or, 1 − p2A − p2

B − p2C . (2.83)

45. Finite entropy. Show that for a discrete random variable X ∈ {1, 2, . . .} , if E logX <∞ , then H(X) <∞ .

Solution: Let the distribution on the integers be p1, p2, . . . . Then H(p) = −∑ pilogpi

and E logX =∑pilogi = c <∞ .

We will now find the maximum entropy distribution subject to the constraint on theexpected logarithm. Using Lagrange multipliers or the results of Chapter 12, we havethe following functional to optimize

J(p) = −∑

pi log pi − λ1

pi − λ2

pi log i (2.84)

Differentiating with respect to pi and setting to zero, we find that the pi that maximizesthe entropy set pi = aiλ , where a = 1/(

∑iλ) and λ chosed to meet the expected log

constraint, i.e.∑

iλ log i = c∑

iλ (2.85)

Using this value of pi , we can see that the entropy is finite.

Page 41: Solution to Information Theory

40 Entropy, Relative Entropy and Mutual Information

46. Axiomatic definition of entropy. If we assume certain axioms for our measure of infor-mation, then we will be forced to use a logarithmic measure like entropy. Shannon usedthis to justify his initial definition of entropy. In this book, we will rely more on theother properties of entropy rather than its axiomatic derivation to justify its use. Thefollowing problem is considerably more difficult than the other problems in this section.

If a sequence of symmetric functions Hm(p1, p2, . . . , pm) satisfies the following proper-ties,

• Normalization: H2

(12 ,

12

)

= 1,

• Continuity: H2(p, 1 − p) is a continuous function of p ,

• Grouping: Hm(p1, p2, . . . , pm) = Hm−1(p1+p2, p3, . . . , pm)+(p1+p2)H2

(p1

p1+p2, p2

p1+p2

)

,

prove that Hm must be of the form

Hm(p1, p2, . . . , pm) = −m∑

i=1

pi log pi, m = 2, 3, . . . . (2.86)

There are various other axiomatic formulations which also result in the same definitionof entropy. See, for example, the book by Csiszar and Korner[1].

Solution: Axiomatic definition of entropy. This is a long solution, so we will firstoutline what we plan to do. First we will extend the grouping axiom by induction andprove that

Hm(p1, p2, . . . , pm) = Hm−k(p1 + p2 + · · · + pk, pk+1, . . . , pm)

+(p1 + p2 + · · · + pk)Hk

(p1

p1 + p2 + · · · + pk, . . . ,

pk

p1 + p2 + · · · + pk

)

.(2.87)

Let f(m) be the entropy of a uniform distribution on m symbols, i.e.,

f(m) = Hm

(1

m,

1

m, . . . ,

1

m

)

. (2.88)

We will then show that for any two integers r and s , that f(rs) = f(r) + f(s) .We use this to show that f(m) = logm . We then show for rational p = r/s , thatH2(p, 1−p) = −p log p− (1−p) log(1−p) . By continuity, we will extend it to irrationalp and finally by induction and grouping, we will extend the result to Hm for m ≥ 2 .

To begin, we extend the grouping axiom. For convenience in notation, we will let

Sk =k∑

i=1

pi (2.89)

and we will denote H2(q, 1 − q) as h(q) . Then we can write the grouping axiom as

Hm(p1, . . . , pm) = Hm−1(S2, p3, . . . , pm) + S2h

(p2

S2

)

. (2.90)

Page 42: Solution to Information Theory

Entropy, Relative Entropy and Mutual Information 41

Applying the grouping axiom again, we have

Hm(p1, . . . , pm) = Hm−1(S2, p3, . . . , pm) + S2h

(p2

S2

)

(2.91)

= Hm−2(S3, p4, . . . , pm) + S3h

(p3

S3

)

+ S2h

(p2

S2

)

(2.92)

... (2.93)

= Hm−(k−1)(Sk, pk+1, . . . , pm) +k∑

i=2

Sih

(pi

Si

)

. (2.94)

Now, we apply the same grouping axiom repeatedly to Hk(p1/Sk, . . . , pk/Sk) , to obtain

Hk

(p1

Sk, . . . ,

pk

Sk

)

= H2

(Sk−1

Sk,pk

Sk

)

+k−1∑

i=2

Si

Skh

(pi/Sk

Si/Sk

)

(2.95)

=1

Sk

k∑

i=2

Sih

(pi

Si

)

. (2.96)

From (2.94) and (2.96), it follows that

Hm(p1, . . . , pm) = Hm−k(Sk, pk+1, . . . , pm) + SkHk

(p1

Sk, . . . ,

pk

Sk

)

, (2.97)

which is the extended grouping axiom.

Now we need to use an axiom that is not explicitly stated in the text, namely that thefunction Hm is symmetric with respect to its arguments. Using this, we can combineany set of arguments of Hm using the extended grouping axiom.

Let f(m) denote Hm( 1m ,

1m , . . . ,

1m ) .

Consider

f(mn) = Hmn(1

mn,

1

mn, . . . ,

1

mn). (2.98)

By repeatedly applying the extended grouping axiom, we have

f(mn) = Hmn(1

mn,

1

mn, . . . ,

1

mn) (2.99)

= Hmn−n(1

m,

1

mn, . . . ,

1

mn) +

1

mHn(

1

n, . . . ,

1

n) (2.100)

= Hmn−2n(1

m,

1

m,

1

mn, . . . ,

1

mn) +

2

mHn(

1

n, . . . ,

1

n) (2.101)

... (2.102)

= Hm(1

m, . . . .

1

m) +H(

1

n, . . . ,

1

n) (2.103)

= f(m) + f(n). (2.104)

Page 43: Solution to Information Theory

42 Entropy, Relative Entropy and Mutual Information

We can immediately use this to conclude that f(mk) = kf(m) .

Now, we will argue that H2(1, 0) = h(1) = 0 . We do this by expanding H3(p1, p2, 0)(p1 + p2 = 1) in two different ways using the grouping axiom

H3(p1, p2, 0) = H2(p1, p2) + p2H2(1, 0) (2.105)

= H2(1, 0) + (p1 + p2)H2(p1, p2) (2.106)

Thus p2H2(1, 0) = H2(1, 0) for all p2 , and therefore H(1, 0) = 0 .

We will also need to show that f(m + 1) − f(m) → 0 as m → ∞ . To prove this, weuse the extended grouping axiom and write

f(m+ 1) = Hm+1(1

m+ 1, . . . ,

1

m+ 1) (2.107)

= h(1

m+ 1) +

m

m+ 1Hm(

1

m, . . . ,

1

m) (2.108)

= h(1

m+ 1) +

m

m+ 1f(m) (2.109)

and therefore

f(m+ 1) − m

m+ 1f(m) = h(

1

m+ 1). (2.110)

Thus lim f(m+ 1) − mm+1f(m) = lim h( 1

m+1 ). But by the continuity of H2 , it follows

that the limit on the right is h(0) = 0 . Thus limh( 1m+1 ) = 0 .

Let us define

an+1 = f(n+ 1) − f(n) (2.111)

and

bn = h(1

n). (2.112)

Then

an+1 = − 1

n+ 1f(n) + bn+1 (2.113)

= − 1

n+ 1

n∑

i=2

ai + bn+1 (2.114)

and therefore

(n+ 1)bn+1 = (n+ 1)an+1 +n∑

i=2

ai. (2.115)

Therefore summing over n , we have

N∑

n=2

nbn =N∑

n=2

(nan + an−1 + . . .+ a2) = NN∑

n=2

ai. (2.116)

Page 44: Solution to Information Theory

Entropy, Relative Entropy and Mutual Information 43

Dividing both sides by∑N

n=1 n = N(N + 1)/2 , we obtain

2

N + 1

N∑

n=2

an =

∑Nn=2 nbn∑N

n=2 n(2.117)

Now by continuity of H2 and the definition of bn , it follows that bn → 0 as n → ∞ .Since the right hand side is essentially an average of the bn ’s, it also goes to 0 (Thiscan be proved more precisely using ε ’s and δ ’s). Thus the left hand side goes to 0. Wecan then see that

aN+1 = bN+1 −1

N + 1

N∑

n=2

an (2.118)

also goes to 0 as N → ∞ . Thus

f(n+ 1) − f(n) → 0 asn→ ∞. (2.119)

We will now prove the following lemma

Lemma 2.0.1 Let the function f(m) satisfy the following assumptions:

• f(mn) = f(m) + f(n) for all integers m , n .

• limn→∞(f(n+ 1) − f(n)) = 0

• f(2) = 1 ,

then the function f(m) = log2m .

Proof of the lemma: Let P be an arbitrary prime number and let

g(n) = f(n) − f(P ) log2 n

log2 P(2.120)

Then g(n) satisfies the first assumption of the lemma. Also g(P ) = 0 .

Also if we let

αn = g(n+ 1) − g(n) = f(n+ 1) − f(n) +f(P )

log2 Plog2

n

n+ 1(2.121)

then the second assumption in the lemma implies that limαn = 0.

For an integer n , define

n(1) =

⌊n

P

. (2.122)

Then it follows that n(1) < n/P , and

n = n(1)P + l (2.123)

Page 45: Solution to Information Theory

44 Entropy, Relative Entropy and Mutual Information

where 0 ≤ l < P . From the fact that g(P ) = 0 , it follows that g(Pn(1)) = g(n(1)) ,and

g(n) = g(n(1)) + g(n) − g(Pn(1)) = g(n(1)) +n−1∑

i=Pn(1)

αi (2.124)

Just as we have defined n(1) from n , we can define n(2) from n(1) . Continuing thisprocess, we can then write

g(n) = g(n(k) +k∑

j=1

n(i−1)∑

i=Pn(i)

αi

. (2.125)

Since n(k) ≤ n/P k , after

k =

⌊log n

log P

+ 1 (2.126)

terms, we have n(k) = 0, and g(0) = 0 (this follows directly from the additive propertyof g ). Thus we can write

g(n) =tn∑

i=1

αi (2.127)

the sum of bn terms, where

bn ≤ P

(log n

log P+ 1

)

. (2.128)

Since αn → 0 , it follows that g(n)log2 n → 0 , since g(n) has at most o(log2 n) terms αi .

Thus it follows that

limn→∞

f(n)

log2 n=

f(P )

log2 P(2.129)

Since P was arbitrary, it follows that f(P )/ log2 P = c for every prime number P .Applying the third axiom in the lemma, it follows that the constant is 1, and f(P ) =log2 P .

For composite numbers N = P1P2 . . . Pl , we can apply the first property of f and theprime number factorization of N to show that

f(N) =∑

f(Pi) =∑

log2 Pi = log2N. (2.130)

Thus the lemma is proved.

The lemma can be simplified considerably, if instead of the second assumption, wereplace it by the assumption that f(n) is monotone in n . We will now argue that theonly function f(m) such that f(mn) = f(m)+f(n) for all integers m,n is of the formf(m) = logam for some base a .

Let c = f(2) . Now f(4) = f(2 × 2) = f(2) + f(2) = 2c . Similarly, it is easy to seethat f(2k) = kc = c log2 2k . We will extend this to integers that are not powers of 2.

Page 46: Solution to Information Theory

Entropy, Relative Entropy and Mutual Information 45

For any integer m , let r > 0 , be another integer and let 2k ≤ mr < 2k+1 . Then bythe monotonicity assumption on f , we have

kc ≤ rf(m) < (k + 1)c (2.131)

or

ck

r≤ f(m) < c

k + 1

r(2.132)

Now by the monotonicity of log , we have

k

r≤ log2m <

k + 1

r(2.133)

Combining these two equations, we obtain

∣∣∣∣f(m) − log2m

c

∣∣∣∣ <

1

r(2.134)

Since r was arbitrary, we must have

f(m) =log2m

c(2.135)

and we can identify c = 1 from the last assumption of the lemma.

Now we are almost done. We have shown that for any uniform distribution on moutcomes, f(m) = Hm(1/m, . . . , 1/m) = log2m .

We will now show that

H2(p, 1 − p) = −p log p− (1 − p) log(1 − p). (2.136)

To begin, let p be a rational number, r/s , say. Consider the extended grouping axiomfor Hs

f(s) = Hs(1

s, . . . ,

1

s) = H(

1

s, . . . ,

1

s︸ ︷︷ ︸

r

,s− r

s) +

s− r

sf(s− r) (2.137)

= H2(r

s,s− r

s) +

s

rf(s) +

s− r

sf(s− r) (2.138)

Substituting f(s) = log2 s , etc, we obtain

H2(r

s,s− r

s) = −r

slog2

r

s−(

1 − s− r

s

)

log2

(

1 − s− r

s

)

. (2.139)

Thus (2.136) is true for rational p . By the continuity assumption, (2.136) is also trueat irrational p .

To complete the proof, we have to extend the definition from H2 to Hm , i.e., we haveto show that

Hm(p1, . . . , pm) = −∑

pi log pi (2.140)

Page 47: Solution to Information Theory

46 Entropy, Relative Entropy and Mutual Information

for all m . This is a straightforward induction. We have just shown that this is true form = 2. Now assume that it is true for m = n− 1 . By the grouping axiom,

Hn(p1, . . . , pn) = Hn−1(p1 + p2, p3, . . . , pn) (2.141)

+(p1 + p2)H2

(p1

p1 + p2,

p2

p1 + p2

)

(2.142)

= −(p1 + p2) log(p1 + p2) −n∑

i=3

pi log pi (2.143)

− p1

p1 + p2log

p1

p1 + p2− p2

p1 + p2log

p2

p1 + p2(2.144)

= −n∑

i=1

pi log pi. (2.145)

Thus the statement is true for m = n , and by induction, it is true for all m . Thus wehave finally proved that the only symmetric function that satisfies the axioms is

Hm(p1, . . . , pm) = −m∑

i=1

pi log pi. (2.146)

The proof above is due to Renyi[4]

47. The entropy of a missorted file.A deck of n cards in order 1, 2, . . . , n is provided. One card is removed at randomthen replaced at random. What is the entropy of the resulting deck?

Solution: The entropy of a missorted file.

The heart of this problem is simply carefully counting the possible outcome states.There are n ways to choose which card gets mis-sorted, and, once the card is chosen,there are again n ways to choose where the card is replaced in the deck. Each of theseshuffling actions has probability 1/n2 . Unfortunately, not all of these n2 actions resultsin a unique mis-sorted file. So we need to carefully count the number of distinguishableoutcome states. The resulting deck can only take on one of the following three cases.

• The selected card is at its original location after a replacement.

• The selected card is at most one location away from its original location after areplacement.

• The selected card is at least two locations away from its original location after areplacement.

To compute the entropy of the resulting deck, we need to know the probability of eachcase.

Case 1 (resulting deck is the same as the original): There are n ways to achieve thisoutcome state, one for each of the n cards in the deck. Thus, the probability associatedwith case 1 is n/n2 = 1/n .

Page 48: Solution to Information Theory

Entropy, Relative Entropy and Mutual Information 47

Case 2 (adjacent pair swapping): There are n − 1 adjacent pairs, each of which willhave a probability of 2/n2 , since for each pair, there are two ways to achieve the swap,either by selecting the left-hand card and moving it one to the right, or by selecting theright-hand card and moving it one to the left.

Case 3 (typical situation): None of the remaining actions “collapses”. They all resultin unique outcome states, each with probability 1/n2 . Of the n2 possible shufflingactions, n2 − n − 2(n − 1) of them result in this third case (we’ve simply subtractedthe case 1 and case 2 situations above).

The entropy of the resulting deck can be computed as follows.

H(X) =1

nlog(n) + (n− 1)

2

n2log(

n2

2) + (n2 − 3n+ 2)

1

n2log(n2)

=2n− 1

nlog(n) − 2(n− 1)

n2

48. Sequence length.How much information does the length of a sequence give about the content of a se-quence? Suppose we consider a Bernoulli (1/2) process {Xi}.Stop the process when the first 1 appears. Let N designate this stopping time.Thus XN is an element of the set of all finite length binary sequences {0, 1}∗ ={0, 1, 00, 01, 10, 11, 000, . . .}.

(a) Find I(N ;XN ).

(b) Find H(XN |N).

(c) Find H(XN ).

Let’s now consider a different stopping time. For this part, again assume Xi ∼ Bernoulli (1/2)but stop at time N = 6, with probability 1/3 and stop at time N = 12 with probability2/3. Let this stopping time be independent of the sequence X1X2 . . . X12.

(d) Find I(N ;XN ).

(e) Find H(XN |N).

(f) Find H(XN ).

Solution:

(a)

I(XN ;N) = H(N) −H(N |XN )

= H(N) − 0

Page 49: Solution to Information Theory

48 Entropy, Relative Entropy and Mutual Information

(a)= E(N)

I(XN ;N) = 2

where (a) comes from the fact that the entropy of a geometric random variable isjust the mean.

(b) Since given N we know that Xi = 0 for all i < N and XN = 1,

H(XN |N) = 0.

(c)

H(XN ) = I(XN ;N) +H(XN |N)

= I(XN ;N) + 0

H(XN ) = 2.

(d)

I(XN ;N) = H(N) −H(N |XN )

= H(N) − 0

I(XN ;N) = HB(1/3)

(e)

H(XN |N) =1

3H(X6|N = 6) +

2

3H(X12|N = 12)

=1

3H(X6) +

2

3H(X12)

=1

36 +

2

312

H(XN |N) = 10.

(f)

H(XN ) = I(XN ;N) +H(XN |N)

= I(XN ;N) + 10

H(XN ) = HB(1/3) + 10.

Page 50: Solution to Information Theory

Chapter 3

The Asymptotic Equipartition

Property

1. Markov’s inequality and Chebyshev’s inequality.

(a) (Markov’s inequality.) For any non-negative random variable X and any t > 0 ,show that

Pr {X ≥ t} ≤ EX

t. (3.1)

Exhibit a random variable that achieves this inequality with equality.

(b) (Chebyshev’s inequality.) Let Y be a random variable with mean µ and varianceσ2 . By letting X = (Y − µ)2 , show that for any ε > 0 ,

Pr {|Y − µ| > ε} ≤ σ2

ε2. (3.2)

(c) (The weak law of large numbers.) Let Z1, Z2, . . . , Zn be a sequence of i.i.d. randomvariables with mean µ and variance σ2 . Let Zn = 1

n

∑ni=1 Zi be the sample mean.

Show that

Pr{∣∣∣Zn − µ

∣∣∣ > ε

}

≤ σ2

nε2. (3.3)

Thus Pr{∣∣∣Zn − µ

∣∣∣ > ε

}

→ 0 as n→ ∞ . This is known as the weak law of large

numbers.

Solution: Markov’s inequality and Chebyshev’s inequality.

(a) If X has distribution F (x) ,

EX =

∫ ∞

0xdF

=

∫ δ

0xdF +

∫ ∞

δxdF

49

Page 51: Solution to Information Theory

50 The Asymptotic Equipartition Property

≥∫ ∞

δxdF

≥∫ ∞

δδdF

= δ Pr{X ≥ δ}.

Rearranging sides and dividing by δ we get,

Pr{X ≥ δ} ≤ EX

δ. (3.4)

One student gave a proof based on conditional expectations. It goes like

EX = E(X|X ≤ δ) Pr{X ≥ δ} +E(X|X < δ) Pr{X < δ}≥ E(X|X ≤ δ) Pr{X ≥ δ}≥ δ Pr{X ≥ δ},

which leads to (3.4) as well.

Given δ , the distribution achieving

Pr{X ≥ δ} =EX

δ,

is

X =

{

δ with probability µδ

0 with probability 1 − µδ ,

where µ ≤ δ .

(b) Letting X = (Y − µ)2 in Markov’s inequality,

Pr{(Y − µ)2 > ε2} ≤ Pr{(Y − µ)2 ≥ ε2}

≤ E(Y − µ)2

ε2

=σ2

ε2,

and noticing that Pr{(Y − µ)2 > ε2} = Pr{|Y − µ| > ε} , we get,

Pr{|Y − µ| > ε} ≤ σ2

ε2.

(c) Letting Y in Chebyshev’s inequality from part (b) equal Zn , and noticing that

EZn = µ and Var(Zn) = σ2

n (ie. Zn is the sum of n iid r.v.’s, Zi

n , each with

variance σ2

n2 ), we have,

Pr{|Zn − µ| > ε} ≤ σ2

nε2.

Page 52: Solution to Information Theory

The Asymptotic Equipartition Property 51

2. AEP and mutual information. Let (Xi, Yi) be i.i.d. ∼ p(x, y) . We form the loglikelihood ratio of the hypothesis that X and Y are independent vs. the hypothesisthat X and Y are dependent. What is the limit of

1

nlog

p(Xn)p(Y n)

p(Xn, Y n)?

Solution:

1

nlog

p(Xn)p(Y n)

p(Xn, Y n)=

1

nlog

n∏

i=1

p(Xi)p(Yi)

p(Xi, Yi)

=1

n

n∑

i=i

logp(Xi)p(Yi)

p(Xi, Yi)

→ E(logp(Xi)p(Yi)

p(Xi, Yi))

= −I(X;Y )

Thus, p(Xn)p(Y n)p(Xn,Y n) → 2−nI(X;Y ) , which will converge to 1 if X and Y are indeed

independent.

3. Piece of cakeA cake is sliced roughly in half, the largest piece being chosen each time, the otherpieces discarded. We will assume that a random cut creates pieces of proportions:

P =

{

(23 ,

13) w.p. 3

4(25 ,

35) w.p. 1

4

Thus, for example, the first cut (and choice of largest piece) may result in a piece ofsize 3

5 . Cutting and choosing from this piece might reduce it to size ( 35 )(2

3) at time 2,and so on.

How large, to first order in the exponent, is the piece of cake after n cuts?

Solution: Let Ci be the fraction of the piece of cake that is cut at the i th cut, and letTn be the fraction of cake left after n cuts. Then we have Tn = C1C2 . . . Cn =

∏ni=1Ci .

Hence, as in Question 2 of Homework Set #3,

lim1

nlog Tn = lim

1

n

n∑

i=1

logCi

= E[logC1]

=3

4log

2

3+

1

4log

3

5.

Page 53: Solution to Information Theory

52 The Asymptotic Equipartition Property

4. AEPLet Xi be iid ∼ p(x), x ∈ {1, 2, . . . ,m} . Let µ = EX, and H = −∑ p(x) log p(x). LetAn = {xn ∈ X n : |− 1

n log p(xn)−H| ≤ ε} . Let Bn = {xn ∈ X n : | 1n∑n

i=1Xi−µ| ≤ ε} .

(a) Does Pr{Xn ∈ An} −→ 1?

(b) Does Pr{Xn ∈ An ∩Bn} −→ 1?

(c) Show |An ∩Bn| ≤ 2n(H+ε) , for all n .

(d) Show |An ∩Bn| ≥ (12 )2n(H−ε) , for n sufficiently large.

Solution:

(a) Yes, by the AEP for discrete random variables the probability Xn is typical goesto 1.

(b) Yes, by the Strong Law of Large Numbers Pr(Xn ∈ Bn) → 1 . So there existsε > 0 and N1 such that Pr(Xn ∈ An) > 1 − ε

2 for all n > N1 , and there existsN2 such that Pr(Xn ∈ Bn) > 1− ε

2 for all n > N2 . So for all n > max(N1, N2) :

Pr(Xn ∈ An ∩Bn) = Pr(Xn ∈ An) + Pr(Xn ∈ Bn) − Pr(Xn ∈ An ∪Bn)

> 1 − ε

2+ 1 − ε

2− 1

= 1 − ε

So for any ε > 0 there exists N = max(N1, N2) such that Pr(Xn ∈ An ∩Bn) >1 − ε for all n > N , therefore Pr(Xn ∈ An ∩Bn) → 1 .

(c) By the law of total probability∑

xn∈An∩Bn p(xn) ≤ 1 . Also, for xn ∈ An , fromTheorem 3.1.2 in the text, p(xn) ≥ 2−n(H+ε) . Combining these two equations gives1 ≥ ∑

xn∈An∩Bn p(xn) ≥ ∑

xn∈An∩Bn 2−n(H+ε) = |An ∩ Bn|2−n(H+ε) . Multiplyingthrough by 2n(H+ε) gives the result |An ∩Bn| ≤ 2n(H+ε) .

(d) Since from (b) Pr{Xn ∈ An ∩ Bn} → 1 , there exists N such that Pr{Xn ∈An ∩ Bn} ≥ 1

2 for all n > N . From Theorem 3.1.2 in the text, for xn ∈ An ,

p(xn) ≤ 2−n(H−ε) . So combining these two gives 12 ≤ ∑

xn∈An∩Bn p(xn) ≤∑

xn∈An∩Bn 2−n(H−ε) = |An∩Bn|2−n(H−ε) . Multiplying through by 2n(H−ε) givesthe result |An ∩Bn| ≥ (1

2 )2n(H−ε) for n sufficiently large.

5. Sets defined by probabilities.Let X1, X2, . . . be an i.i.d. sequence of discrete random variables with entropy H(X).Let

Cn(t) = {xn ∈ X n : p(xn) ≥ 2−nt}denote the subset of n -sequences with probabilities ≥ 2−nt.

(a) Show |Cn(t)| ≤ 2nt.

(b) For what values of t does P ({Xn ∈ Cn(t)}) → 1?

Solution:

Page 54: Solution to Information Theory

The Asymptotic Equipartition Property 53

(a) Since the total probability of all sequences is less than 1, |Cn(t)|minxn∈Cn(t) p(xn) ≤

1 , and hence |Cn(t)| ≤ 2nt .

(b) Since − 1n log p(xn) → H , if t < H , the probability that p(xn) > 2−nt goes to 0,

and if t > H , the probability goes to 1.

6. An AEP-like limit. Let X1, X2, . . . be i.i.d. drawn according to probability massfunction p(x). Find

limn→∞

[p(X1, X2, . . . , Xn)]1n .

Solution: An AEP-like limit. X1, X2, . . . , i.i.d. ∼ p(x) . Hence log(Xi) are also i.i.d.and

lim(p(X1, X2, . . . , Xn))1n = lim 2log(p(X1,X2,...,Xn))

1n

= 2lim 1n

∑log p(Xi) a.e.

= 2E(log(p(X))) a.e.

= 2−H(X) a.e.

by the strong law of large numbers (assuming of course that H(X) exists).

7. The AEP and source coding. A discrete memoryless source emits a sequence of statisti-cally independent binary digits with probabilities p(1) = 0.005 and p(0) = 0.995 . Thedigits are taken 100 at a time and a binary codeword is provided for every sequence of100 digits containing three or fewer ones.

(a) Assuming that all codewords are the same length, find the minimum length re-quired to provide codewords for all sequences with three or fewer ones.

(b) Calculate the probability of observing a source sequence for which no codewordhas been assigned.

(c) Use Chebyshev’s inequality to bound the probability of observing a source sequencefor which no codeword has been assigned. Compare this bound with the actualprobability computed in part (b).

Solution: THe AEP and source coding.

(a) The number of 100-bit binary sequences with three or fewer ones is

(

100

0

)

+

(

100

1

)

+

(

100

2

)

+

(

100

3

)

= 1 + 100 + 4950 + 161700 = 166751 .

The required codeword length is dlog2 166751e = 18 . (Note that H(0.005) =0.0454 , so 18 is quite a bit larger than the 4.5 bits of entropy.)

(b) The probability that a 100-bit sequence has three or fewer ones is

3∑

i=0

(

100

i

)

(0.005)i(0.995)100−i = 0.60577 + 0.30441 + 0.7572 + 0.01243 = 0.99833

Page 55: Solution to Information Theory

54 The Asymptotic Equipartition Property

Thus the probability that the sequence that is generated cannot be encoded is1 − 0.99833 = 0.00167 .

(c) In the case of a random variable Sn that is the sum of n i.i.d. random variablesX1, X2, . . . , Xn , Chebyshev’s inequality states that

Pr(|Sn − nµ| ≥ ε) ≤ nσ2

ε2,

where µ and σ2 are the mean and variance of Xi . (Therefore nµ and nσ2

are the mean and variance of Sn .) In this problem, n = 100 , µ = 0.005 , andσ2 = (0.005)(0.995) . Note that S100 ≥ 4 if and only if |S100 − 100(0.005)| ≥ 3.5 ,so we should choose ε = 3.5 . Then

Pr(S100 ≥ 4) ≤ 100(0.005)(0.995)

(3.5)2≈ 0.04061 .

This bound is much larger than the actual probability 0.00167.

8. Products. Let

X =

1, 12

2, 14

3, 14

Let X1, X2, . . . be drawn i.i.d. according to this distribution. Find the limiting behaviorof the product

(X1X2 · · ·Xn)1n .

Solution: Products. Let

Pn = (X1X2 . . . Xn)1n . (3.5)

Then

logPn =1

n

n∑

i=1

logXi → E logX, (3.6)

with probability 1, by the strong law of large numbers. Thus Pn → 2E log X with prob.1. We can easily calculate E logX = 1

2 log 1 + 14 log 2 + 1

4 log 3 = 14 log 6 , and therefore

Pn → 214

log 6 = 1.565 .

9. AEP. Let X1, X2, . . . be independent identically distributed random variables drawn ac-cording to the probability mass function p(x), x ∈ {1, 2, . . . ,m} . Thus p(x1, x2, . . . , xn) =∏n

i=1 p(xi) . We know that − 1n log p(X1, X2, . . . , Xn) → H(X) in probability. Let

q(x1, x2, . . . , xn) =∏n

i=1 q(xi), where q is another probability mass function on {1, 2, . . . ,m} .

(a) Evaluate lim− 1n log q(X1, X2, . . . , Xn) , where X1, X2, . . . are i.i.d. ∼ p(x) .

(b) Now evaluate the limit of the log likelihood ratio 1n log q(X1,...,Xn)

p(X1,...,Xn) when X1, X2, . . .

are i.i.d. ∼ p(x) . Thus the odds favoring q are exponentially small when p istrue.

Page 56: Solution to Information Theory

The Asymptotic Equipartition Property 55

Solution: (AEP).

(a) Since the X1, X2, . . . , Xn are i.i.d., so are q(X1), q(X2), . . . , q(Xn) , and hence wecan apply the strong law of large numbers to obtain

lim− 1

nlog q(X1, X2, . . . , Xn) = lim− 1

n

log q(Xi) (3.7)

= −E(log q(X)) w.p. 1 (3.8)

= −∑

p(x) log q(x) (3.9)

=∑

p(x) logp(x)

q(x)−∑

p(x) log p(x)(3.10)

= D(p||q) +H(p). (3.11)

(b) Again, by the strong law of large numbers,

lim− 1

nlog

q(X1, X2, . . . , Xn)

p(X1, X2, . . . , Xn)= lim− 1

n

logq(Xi)

p(Xi)(3.12)

= −E(logq(X)

p(X)) w.p. 1 (3.13)

= −∑

p(x) logq(x)

p(x)(3.14)

=∑

p(x) logp(x)

q(x)(3.15)

= D(p||q). (3.16)

10. Random box size. An n -dimensional rectangular box with sides X1, X2, X3, . . . , Xn isto be constructed. The volume is Vn =

∏ni=1Xi . The edge length l of a n -cube with

the same volume as the random box is l = V1/nn . Let X1, X2, . . . be i.i.d. uniform

random variables over the unit interval [0, 1]. Find limn→∞ V1/nn , and compare to

(EVn)1n . Clearly the expected edge length does not capture the idea of the volume

of the box. The geometric mean, rather than the arithmetic mean, characterizes thebehavior of products.

Solution: Random box size. The volume Vn =∏n

i=1Xi is a random variable, sincethe Xi are random variables uniformly distributed on [0, 1] . Vn tends to 0 as n→ ∞ .However

loge V1n

n =1

nloge Vn =

1

n

logeXi → E(loge(X)) a.e.

by the Strong Law of Large Numbers, since Xi and loge(Xi) are i.i.d. and E(loge(X)) <∞ . Now

E(loge(Xi)) =

∫ 1

0loge(x) dx = −1

Hence, since ex is a continuous function,

limn→∞

V1n

n = elimn→∞

1n

loge Vn =1

e<

1

2.

Page 57: Solution to Information Theory

56 The Asymptotic Equipartition Property

Thus the “effective” edge length of this solid is e−1 . Note that since the Xi ’s areindependent, E(Vn) =

∏E(Xi) = (1

2 )n . Also 12 is the arithmetic mean of the random

variable, and 1e is the geometric mean.

11. Proof of Theorem 3.3.1. This problem shows that the size of the smallest “probable”

set is about 2nH . Let X1, X2, . . . , Xn be i.i.d. ∼ p(x) . Let B(n)δ ⊂ X n such that

Pr(B(n)δ ) > 1 − δ . Fix ε < 1

2 .

(a) Given any two sets A , B such that Pr(A) > 1 − ε1 and Pr(B) > 1 − ε2 , show

that Pr(A ∩B) > 1 − ε1 − ε2 . Hence Pr(A(n)ε ∩B(n)

δ ) ≥ 1 − ε− δ.

(b) Justify the steps in the chain of inequalities

1 − ε− δ ≤ Pr(A(n)ε ∩B(n)

δ ) (3.17)

=∑

A(n)ε ∩B

(n)δ

p(xn) (3.18)

≤∑

A(n)ε ∩B

(n)δ

2−n(H−ε) (3.19)

= |A(n)ε ∩B(n)

δ |2−n(H−ε) (3.20)

≤ |B(n)δ |2−n(H−ε). (3.21)

(c) Complete the proof of the theorem.

Solution: Proof of Theorem 3.3.1.

(a) Let Ac denote the complement of A . Then

P (Ac ∪Bc) ≤ P (Ac) + P (Bc). (3.22)

Since P (A) ≥ 1 − ε1 , P (Ac) ≤ ε1 . Similarly, P (Bc) ≤ ε2 . Hence

P (A ∩B) = 1 − P (Ac ∪Bc) (3.23)

≥ 1 − P (Ac) − P (Bc) (3.24)

≥ 1 − ε1 − ε2. (3.25)

(b) To complete the proof, we have the following chain of inequalities

1 − ε− δ(a)≤ Pr(A(n)

ε ∩B(n)δ ) (3.26)

(b)=

A(n)ε ∩B

(n)δ

p(xn) (3.27)

(c)≤

A(n)ε ∩B

(n)δ

2−n(H−ε) (3.28)

(d)= |A(n)

ε ∩B(n)δ |2−n(H−ε) (3.29)

(e)

≤ |B(n)δ |2−n(H−ε). (3.30)

Page 58: Solution to Information Theory

The Asymptotic Equipartition Property 57

where (a) follows from the previous part, (b) follows by definition of probability ofa set, (c) follows from the fact that the probability of elements of the typical set are

bounded by 2−n(H−ε) , (d) from the definition of |A(n)ε ∩ B(n)

δ | as the cardinality

of the set A(n)ε ∩B(n)

δ , and (e) from the fact that A(n)ε ∩B(n)

δ ⊆ B(n)δ .

12. Monotonic convergence of the empirical distribution. Let pn denote the empirical prob-ability mass function corresponding to X1, X2, . . . , Xn i.i.d. ∼ p(x), x ∈ X . Specifi-cally,

pn(x) =1

n

n∑

i=1

I(Xi = x)

is the proportion of times that Xi = x in the first n samples, where I is the indicatorfunction.

(a) Show for X binary that

ED(p2n ‖ p) ≤ ED(pn ‖ p).

Thus the expected relative entropy “distance” from the empirical distribution tothe true distribution decreases with sample size.Hint: Write p2n = 1

2 pn + 12 p

′n and use the convexity of D .

(b) Show for an arbitrary discrete X that

ED(pn ‖ p) ≤ ED(pn−1 ‖ p).

Hint: Write pn as the average of n empirical mass functions with each of the nsamples deleted in turn.

Solution: Monotonic convergence of the empirical distribution.

(a) Note that,

p2n(x) =1

2n

2n∑

i=1

I(Xi = x)

=1

2

1

n

n∑

i=1

I(Xi = x) +1

2

1

n

2n∑

i=n+1

I(Xi = x)

=1

2pn(x) +

1

2p′n(x).

Using convexity of D(p||q) we have that,

D(p2n||p) = D(1

2pn +

1

2p′n||

1

2p+

1

2p)

≤ 1

2D(pn||p) +

1

2D(p′n||p).

Taking expectations and using the fact the Xi ’s are identically distributed we get,

ED(p2n||p) ≤ ED(pn||p).

Page 59: Solution to Information Theory

58 The Asymptotic Equipartition Property

(b) The trick to this part is similar to part a) and involves rewriting pn in terms ofpn−1 . We see that,

pn =1

n

n−1∑

i=0

I(Xi = x) +I(Xn = x)

n

or in general,

pn =1

n

i6=j

I(Xi = x) +I(Xj = x)

n,

where j ranges from 1 to n .

Summing over j we get,

npn =n− 1

n

n∑

j=1

pjn−1 + pn,

or,

pn =1

n

n∑

j=1

pjn−1

where,

n∑

j=1

pjn−1 =

1

n− 1

i6=j

I(Xi = x).

Again using the convexity of D(p||q) and the fact that the D(pjn−1||p) are identi-

cally distributed for all j and hence have the same expected value, we obtain thefinal result.

13. Calculation of typical set To clarify the notion of a typical set A(n)ε and the smallest

set of high probability B(n)δ , we will calculate the set for a simple example. Consider a

sequence of i.i.d. binary random variables, X1, X2, . . . , Xn , where the probability thatXi = 1 is 0.6 (and therefore the probability that Xi = 0 is 0.4).

(a) Calculate H(X) .

(b) With n = 25 and ε = 0.1 , which sequences fall in the typical set A(n)ε ? What

is the probability of the typical set? How many elements are there in the typicalset? (This involves computation of a table of probabilities for sequences with k1’s, 0 ≤ k ≤ 25 , and finding those sequences that are in the typical set.)

Page 60: Solution to Information Theory

The Asymptotic Equipartition Property 59

k(nk

) (nk

)pk(1 − p)n−k − 1

n log p(xn)

0 1 0.000000 1.321928

1 25 0.000000 1.298530

2 300 0.000000 1.275131

3 2300 0.000001 1.251733

4 12650 0.000007 1.228334

5 53130 0.000054 1.204936

6 177100 0.000227 1.181537

7 480700 0.001205 1.158139

8 1081575 0.003121 1.134740

9 2042975 0.013169 1.111342

10 3268760 0.021222 1.087943

11 4457400 0.077801 1.064545

12 5200300 0.075967 1.041146

13 5200300 0.267718 1.017748

14 4457400 0.146507 0.994349

15 3268760 0.575383 0.970951

16 2042975 0.151086 0.947552

17 1081575 0.846448 0.924154

18 480700 0.079986 0.900755

19 177100 0.970638 0.877357

20 53130 0.019891 0.853958

21 12650 0.997633 0.830560

22 2300 0.001937 0.807161

23 300 0.999950 0.783763

24 25 0.000047 0.760364

25 1 0.000003 0.736966

(c) How many elements are there in the smallest set that has probability 0.9?

(d) How many elements are there in the intersection of the sets in part (b) and (c)?What is the probability of this intersection?

Solution:

(a) H(X) = −0.6 log 0.6 − 0.4 log 0.4 = 0.97095 bits.

(b) By definition, A(n)ε for ε = 0.1 is the set of sequences such that − 1

n log p(xn) liesin the range (H(X)−ε,H(X)+ε) , i.e., in the range (0.87095, 1.07095). Examiningthe last column of the table, it is easy to see that the typical set is the set of allsequences with k , the number of ones lying between 11 and 19.

The probability of the typical set can be calculated from cumulative probabilitycolumn. The probability that the number of 1’s lies between 11 and 19 is equal toF (19) − F (10) = 0.970638 − 0.034392 = 0.936246 . Note that this is greater than1− ε , i.e., the n is large enough for the probability of the typical set to be greaterthan 1 − ε .

Page 61: Solution to Information Theory

60 The Asymptotic Equipartition Property

The number of elements in the typical set can be found using the third column.

|A(n)ε | =

19∑

k=11

(

n

k

)

=19∑

k=0

(

n

k

)

−10∑

k=0

(

n

k

)

= 33486026 − 7119516 = 26366510.

(3.31)

Note that the upper and lower bounds for the size of the A(n)ε can be calculated

as 2n(H+ε) = 225(0.97095+0.1) = 226.77 = 1.147365 × 108 , and (1 − ε)2n(H−ε) =0.9 × 225(0.97095−0.1) = 0.9 × 221.9875 = 3742308 . Both bounds are very loose!

(c) To find the smallest set B(n)δ of probability 0.9, we can imagine that we are filling

a bag with pieces such that we want to reach a certain weight with the minimumnumber of pieces. To minimize the number of pieces that we use, we should usethe largest possible pieces. In this case, it corresponds to using the sequences withthe highest probability.

Thus we keep putting the high probability sequences into this set until we reacha total probability of 0.9. Looking at the fourth column of the table, it is clearthat the probability of a sequence increases monotonically with k . Thus the setconsists of seqeunces of k = 25, 24, . . . , until we have a total probability 0.9.

Using the cumulative probability column, it follows that the set B(n)δ consist

of sequences with k ≥ 13 and some sequences with k = 12 . The sequences with

k ≥ 13 provide a total probability of 1−0.153768 = 0.846232 to the set B(n)δ . The

remaining probability of 0.9 − 0.846232 = 0.053768 should come from sequenceswith k = 12 . The number of such sequences needed to fill this probability is atleast 0.053768/p(xn) = 0.053768/1.460813×10−8 = 3680690.1 , which we round upto 3680691. Thus the smallest set with probability 0.9 has 33554432−16777216+

3680691 = 20457907 sequences. Note that the set B(n)δ is not uniquely defined

- it could include any 3680691 sequences with k = 12 . However, the size of thesmallest set is a well defined number.

(d) The intersection of the sets A(n)ε and B

(n)δ in parts (b) and (c) consists of all

sequences with k between 13 and 19, and 3680691 sequences with k = 12 . Theprobability of this intersection = 0.970638− 0.153768+0.053768 = 0.870638 , andthe size of this intersection = 33486026 − 16777216 + 3680691 = 20389501 .

Page 62: Solution to Information Theory

Chapter 4

Entropy Rates of a Stochastic

Process

1. Doubly stochastic matrices. An n× n matrix P = [Pij ] is said to be doubly stochasticif Pij ≥ 0 and

j Pij = 1 for all i and∑

i Pij = 1 for all j . An n × n matrix Pis said to be a permutation matrix if it is doubly stochastic and there is precisely onePij = 1 in each row and each column.

It can be shown that every doubly stochastic matrix can be written as the convexcombination of permutation matrices.

(a) Let at = (a1, a2, . . . , an) , ai ≥ 0 ,∑ai = 1, be a probability vector. Let b = aP ,

where P is doubly stochastic. Show that b is a probability vector and thatH(b1, b2, . . . , bn) ≥ H(a1, a2, . . . , an) . Thus stochastic mixing increases entropy.

(b) Show that a stationary distribution µ for a doubly stochastic matrix P is theuniform distribution.

(c) Conversely, prove that if the uniform distribution is a stationary distribution fora Markov transition matrix P , then P is doubly stochastic.

Solution: Doubly Stochastic Matrices.

(a)

H(b) −H(a) = −∑

j

bj log bj +∑

i

ai log ai (4.1)

=∑

j

i

aiPij log(∑

k

akPkj) +∑

i

ai log ai (4.2)

=∑

i

j

aiPij logai

k akPkj(4.3)

i,j

aiPij

log

i,j ai∑

i,j bj(4.4)

61

Page 63: Solution to Information Theory

62 Entropy Rates of a Stochastic Process

= 1 logm

m(4.5)

= 0, (4.6)

where the inequality follows from the log sum inequality.

(b) If the matrix is doubly stochastic, the substituting µi = 1m , we can easily check

that it satisfies µ = µP .

(c) If the uniform is a stationary distribution, then

1

m= µi =

j

µjPji =1

m

j

Pji, (4.7)

or∑

j Pji = 1 or that the matrix is doubly stochastic.

2. Time’s arrow. Let {Xi}∞i=−∞ be a stationary stochastic process. Prove that

H(X0|X−1, X−2, . . . , X−n) = H(X0|X1, X2, . . . , Xn).

In other words, the present has a conditional entropy given the past equal to theconditional entropy given the future.

This is true even though it is quite easy to concoct stationary random processes forwhich the flow into the future looks quite different from the flow into the past. That isto say, one can determine the direction of time by looking at a sample function of theprocess. Nonetheless, given the present state, the conditional uncertainty of the nextsymbol in the future is equal to the conditional uncertainty of the previous symbol inthe past.

Solution: Time’s arrow. By the chain rule for entropy,

H(X0|X−1, . . . , X−n) = H(X0, X−1, . . . , X−n) −H(X−1, . . . , X−n) (4.8)

= H(X0, X1, X2, . . . , Xn) −H(X1, X2, . . . , Xn) (4.9)

= H(X0|X1, X2, . . . , Xn), (4.10)

where (4.9) follows from stationarity.

3. Shuffles increase entropy. Argue that for any distribution on shuffles T and any dis-tribution on card positions X that

H(TX) ≥ H(TX|T ) (4.11)

= H(T−1TX|T ) (4.12)

= H(X|T ) (4.13)

= H(X), (4.14)

if X and T are independent.

Page 64: Solution to Information Theory

Entropy Rates of a Stochastic Process 63

Solution: Shuffles increase entropy.

H(TX) ≥ H(TX|T ) (4.15)

= H(T−1TX|T ) (4.16)

= H(X|T ) (4.17)

= H(X). (4.18)

The inequality follows from the fact that conditioning reduces entropy and the firstequality follows from the fact that given T , we can reverse the shuffle.

4. Second law of thermodynamics. Let X1, X2, X3 . . . be a stationary first-order Markovchain. In Section 4.4, it was shown that H(Xn | X1) ≥ H(Xn−1 | X1) for n = 2, 3 . . . .Thus conditional uncertainty about the future grows with time. This is true althoughthe unconditional uncertainty H(Xn) remains constant. However, show by examplethat H(Xn|X1 = x1) does not necessarily grow with n for every x1 .

Solution: Second law of thermodynamics.

H(Xn|X1) ≤ H(Xn|X1, X2) (Conditioning reduces entropy) (4.19)

= H(Xn|X2) (by Markovity) (4.20)

= H(Xn−1|X1) (by stationarity) (4.21)

Alternatively, by an application of the data processing inequality to the Markov chainX1 → Xn−1 → Xn , we have

I(X1;Xn−1) ≥ I(X1;Xn). (4.22)

Expanding the mutual informations in terms of entropies, we have

H(Xn−1) −H(Xn−1|X1) ≥ H(Xn) −H(Xn|X1). (4.23)

By stationarity, H(Xn−1) = H(Xn) and hence we have

H(Xn−1|X1) ≤ H(Xn|X1). (4.24)

5. Entropy of a random tree. Consider the following method of generating a random treewith n nodes. First expand the root node:

���

@@@

Then expand one of the two terminal nodes at random:

���

@@@�

��

@@@

���

@@@

���

@@@

Page 65: Solution to Information Theory

64 Entropy Rates of a Stochastic Process

At time k , choose one of the k− 1 terminal nodes according to a uniform distributionand expand it. Continue until n terminal nodes have been generated. Thus a sequenceleading to a five node tree might look like this:

���

@@@

���

@@@�

��

@@@

���

@@@�

��

@@@�

��

@@@

���

@@@�

��

@@@�

��

@@@�

��

@@@

Surprisingly, the following method of generating random trees yields the same probabil-ity distribution on trees with n terminal nodes. First choose an integer N1 uniformlydistributed on {1, 2, . . . , n− 1} . We then have the picture.

���

@@@

N1 n−N1

Then choose an integer N2 uniformly distributed over {1, 2, . . . , N1−1} , and indepen-dently choose another integer N3 uniformly over {1, 2, . . . , (n−N1)− 1} . The pictureis now:

�����

HHHHH�

��

@@@

���

@@@

N2 N1 −N2 N3 n−N1 −N3

Continue the process until no further subdivision can be made. (The equivalence ofthese two tree generation schemes follows, for example, from Polya’s urn model.)

Now let Tn denote a random n -node tree generated as described. The probabilitydistribution on such trees seems difficult to describe, but we can find the entropy ofthis distribution in recursive form.

First some examples. For n = 2, we have only one tree. Thus H(T2) = 0 . For n = 3,we have two equally probable trees:

���

@@@�

��

@@@

���

@@@

���

@@@

Page 66: Solution to Information Theory

Entropy Rates of a Stochastic Process 65

Thus H(T3) = log 2 . For n = 4, we have five possible trees, with probabilities 1/3,1/6, 1/6, 1/6, 1/6.

Now for the recurrence relation. Let N1(Tn) denote the number of terminal nodes ofTn in the right half of the tree. Justify each of the steps in the following:

H(Tn)(a)= H(N1, Tn) (4.25)

(b)= H(N1) +H(Tn|N1) (4.26)

(c)= log(n− 1) +H(Tn|N1) (4.27)

(d)= log(n− 1) +

1

n− 1

n−1∑

k=1

[H(Tk) +H(Tn−k)] (4.28)

(e)= log(n− 1) +

2

n− 1

n−1∑

k=1

H(Tk). (4.29)

= log(n− 1) +2

n− 1

n−1∑

k=1

Hk. (4.30)

(f) Use this to show that

(n− 1)Hn = nHn−1 + (n− 1) log(n− 1) − (n− 2) log(n− 2), (4.31)

orHn

n=Hn−1

n− 1+ cn, (4.32)

for appropriately defined cn . Since∑cn = c < ∞ , you have proved that 1

nH(Tn)converges to a constant. Thus the expected number of bits necessary to describe therandom tree Tn grows linearly with n .

Solution: Entropy of a random tree.

(a) H(Tn, N1) = H(Tn) +H(N1|Tn) = H(Tn) + 0 by the chain rule for entropies andsince N1 is a function of Tn .

(b) H(Tn, N1) = H(N1) +H(Tn|N1) by the chain rule for entropies.

(c) H(N1) = log(n− 1) since N1 is uniform on {1, 2, . . . , n− 1} .

(d)

H(Tn|N1) =n−1∑

k=1

P (N1 = k)H(Tn|N1 = k) (4.33)

=1

n− 1

n−1∑

k=1

H(Tn|N1 = k) (4.34)

by the definition of conditional entropy. Since conditional on N1 , the left subtreeand the right subtree are chosen independently, H(Tn|N1 = k) = H(Tk, Tn−k|N1 =

Page 67: Solution to Information Theory

66 Entropy Rates of a Stochastic Process

k) = H(Tk) +H(Tn−k) , so

H(Tn|N1) =1

n− 1

n−1∑

k=1

(H(Tk) +H(Tn−k)) . (4.35)

(e) By a simple change of variables,

n−1∑

k=1

H(Tn−k) =n−1∑

k=1

H(Tk). (4.36)

(f) Hence if we let Hn = H(Tn) ,

(n− 1)Hn = (n− 1) log(n− 1) + 2n−1∑

k=1

Hk (4.37)

(n− 2)Hn−1 = (n− 2) log(n− 2) + 2n−2∑

k=1

Hk (4.38)

(4.39)

Subtracting the second equation from the first, we get

(n− 1)Hn − (n− 2)Hn−1 = (n− 1) log(n− 1)− (n− 2) log(n− 2)+2Hn−1 (4.40)

or

Hn

n=

Hn−1

n− 1+

log(n− 1)

n− (n− 2) log(n− 2)

n(n− 1)(4.41)

=Hn−1

n− 1+ Cn (4.42)

where

Cn =log(n− 1)

n− (n− 2) log(n− 2)

n(n− 1)(4.43)

=log(n− 1)

n− log(n− 2)

(n− 1)+

2 log(n− 2)

n(n− 1)(4.44)

Substituting the equation for Hn−1 in the equation for Hn and proceeding recursively,we obtain a telescoping sum

Hn

n=

n∑

j=3

Cj +H2

2(4.45)

=n∑

j=3

2 log(j − 2)

j(j − 1)+

1

nlog(n− 1). (4.46)

Page 68: Solution to Information Theory

Entropy Rates of a Stochastic Process 67

Since limn→∞1n log(n− 1) = 0

limn→∞

Hn

n=

∞∑

j=3

2

j(j − 1)log j − 2 (4.47)

≤∞∑

j=3

2

(j − 1)2log(j − 1) (4.48)

=∞∑

j=2

2

j2log j (4.49)

For sufficiently large j , log j ≤ √j and hence the sum in (4.49) is dominated by the

sum∑

j j− 3

2 which converges. Hence the above sum converges. In fact, computerevaluation shows that

limHn

n=

∞∑

j=3

2

j(j − 1)log(j − 2) = 1.736 bits. (4.50)

Thus the number of bits required to describe a random n -node tree grows linearly withn .

6. Monotonicity of entropy per element. For a stationary stochastic process X1, X2, . . . , Xn ,show that

(a)

H(X1, X2, . . . , Xn)

n≤ H(X1, X2, . . . , Xn−1)

n− 1. (4.51)

(b)

H(X1, X2, . . . , Xn)

n≥ H(Xn|Xn−1, . . . , X1). (4.52)

Solution: Monotonicity of entropy per element.

(a) By the chain rule for entropy,

H(X1, X2, . . . , Xn)

n=

∑ni=1H(Xi|Xi−1)

n(4.53)

=H(Xn|Xn−1) +

∑n−1i=1 H(Xi|Xi−1)

n(4.54)

=H(Xn|Xn−1) +H(X1, X2, . . . , Xn−1)

n. (4.55)

From stationarity it follows that for all 1 ≤ i ≤ n ,

H(Xn|Xn−1) ≤ H(Xi|Xi−1),

Page 69: Solution to Information Theory

68 Entropy Rates of a Stochastic Process

which further implies, by averaging both sides, that,

H(Xn|Xn−1) ≤∑n−1

i=1 H(Xi|Xi−1)

n− 1(4.56)

=H(X1, X2, . . . , Xn−1)

n− 1. (4.57)

Combining (4.55) and (4.57) yields,

H(X1, X2, . . . , Xn)

n≤ 1

n

[H(X1, X2, . . . , Xn−1)

n− 1+H(X1, X2, . . . , Xn−1)

]

(4.58)

=H(X1, X2, . . . , Xn−1)

n− 1. (4.59)

(b) By stationarity we have for all 1 ≤ i ≤ n ,

H(Xn|Xn−1) ≤ H(Xi|Xi−1),

which implies that,

H(Xn|Xn−1) =

∑ni=1H(Xn|Xn−1)

n(4.60)

≤∑n

i=1H(Xi|Xi−1)

n(4.61)

=H(X1, X2, . . . , Xn)

n. (4.62)

7. Entropy rates of Markov chains.

(a) Find the entropy rate of the two-state Markov chain with transition matrix

P =

[

1 − p01 p01

p10 1 − p10

]

.

(b) What values of p01, p10 maximize the rate of part (a)?

(c) Find the entropy rate of the two-state Markov chain with transition matrix

P =

[

1 − p p1 0

]

.

(d) Find the maximum value of the entropy rate of the Markov chain of part (c). Weexpect that the maximizing value of p should be less than 1/2 , since the 0 statepermits more information to be generated than the 1 state.

(e) Let N(t) be the number of allowable state sequences of length t for the Markovchain of part (c). Find N(t) and calculate

H0 = limt→∞

1

tlogN(t) .

Hint: Find a linear recurrence that expresses N(t) in terms of N(t − 1) andN(t− 2) . Why is H0 an upper bound on the entropy rate of the Markov chain?Compare H0 with the maximum entropy found in part (d).

Page 70: Solution to Information Theory

Entropy Rates of a Stochastic Process 69

Solution: Entropy rates of Markov chains.

(a) The stationary distribution is easily calculated. (See EIT pp. 62–63.)

µ0 =p10

p01 + p10, µ0 =

p01

p01 + p10.

Therefore the entropy rate is

H(X2|X1) = µ0H(p01) + µ1H(p10) =p10H(p01) + p01H(p10)

p01 + p10.

(b) The entropy rate is at most 1 bit because the process has only two states. Thisrate can be achieved if (and only if) p01 = p10 = 1/2 , in which case the process isactually i.i.d. with Pr(Xi = 0) = Pr(Xi = 1) = 1/2 .

(c) As a special case of the general two-state Markov chain, the entropy rate is

H(X2|X1) = µ0H(p) + µ1H(1) =H(p)

p+ 1.

(d) By straightforward calculus, we find that the maximum value of H(X) of part (c)occurs for p = (3 −

√5)/2 = 0.382 . The maximum value is

H(p) = H(1 − p) = H

(√5 − 1

2

)

= 0.694 bits .

Note that (√

5 − 1)/2 = 0.618 is (the reciprocal of) the Golden Ratio.

(e) The Markov chain of part (c) forbids consecutive ones. Consider any allowablesequence of symbols of length t . If the first symbol is 1, then the next symbolmust be 0; the remaining N(t − 2) symbols can form any allowable sequence. Ifthe first symbol is 0, then the remaining N(t − 1) symbols can be any allowablesequence. So the number of allowable sequences of length t satisfies the recurrence

N(t) = N(t− 1) +N(t− 2) N(1) = 2, N(2) = 3

(The initial conditions are obtained by observing that for t = 2 only the sequence11 is not allowed. We could also choose N(0) = 1 as an initial condition, sincethere is exactly one allowable sequence of length 0, namely, the empty sequence.)

The sequence N(t) grows exponentially, that is, N(t) ≈ cλt , where λ is themaximum magnitude solution of the characteristic equation

1 = z−1 + z−2 .

Solving the characteristic equation yields λ = (1+√

5)/2 , the Golden Ratio. (Thesequence {N(t)} is the sequence of Fibonacci numbers.) Therefore

H0 = limn→∞

1

tlogN(t) = log(1 +

√5)/2 = 0.694 bits .

Since there are only N(t) possible outcomes for X1, . . . , Xt , an upper bound onH(X1, . . . , Xt) is logN(t) , and so the entropy rate of the Markov chain of part (c)is at most H0 . In fact, we saw in part (d) that this upper bound can be achieved.

Page 71: Solution to Information Theory

70 Entropy Rates of a Stochastic Process

8. Maximum entropy process. A discrete memoryless source has alphabet {1, 2} wherethe symbol 1 has duration 1 and the symbol 2 has duration 2. The probabilities of 1and 2 are p1 and p2 , respectively. Find the value of p1 that maximizes the sourceentropy per unit time H(X)/ElX . What is the maximum value H ?

Solution: Maximum entropy process. The entropy per symbol of the source is

H(p1) = −p1 log p1 − (1 − p1) log(1 − p1)

and the average symbol duration (or time per symbol) is

T (p1) = 1 · p1 + 2 · p2 = p1 + 2(1 − p1) = 2 − p1 = 1 + p2 .

Therefore the source entropy per unit time is

f(p1) =H(p1)

T (p1)=

−p1 log p1 − (1 − p1) log(1 − p1)

2 − p1.

Since f(0) = f(1) = 0 , the maximum value of f(p1) must occur for some point p1

such that 0 < p1 < 1 and ∂f/∂p1 = 0.

∂p1

H(p1)

T (p1)=T (∂H/∂p1) −H(∂T/∂p1)

T 2

After some calculus, we find that the numerator of the above expression (assumingnatural logarithms) is

T (∂H/∂p1) −H(∂T/∂p1) = ln(1 − p1) − 2 ln p1 ,

which is zero when 1− p1 = p21 = p2 , that is, p1 = 1

2 (√

5− 1) = 0.61803 , the reciprocal

of the golden ratio, 12 (√

5 + 1) = 1.61803 . The corresponding entropy per unit time is

H(p1)

T (p1)=

−p1 log p1 − p21 log p2

1

2 − p1=

−(1 + p21) log p1

1 + p21

= − log p1 = 0.69424 bits.

Note that this result is the same as the maximum entropy rate for the Markov chainin problem #4(d) of homework #4. This is because a source in which every 1 must befollowed by a 0 is equivalent to a source in which the symbol 1 has duration 2 and thesymbol 0 has duration 1.

9. Initial conditions. Show, for a Markov chain, that

H(X0|Xn) ≥ H(X0|Xn−1).

Thus initial conditions X0 become more difficult to recover as the future Xn unfolds.

Solution: Initial conditions. For a Markov chain, by the data processing theorem, wehave

I(X0;Xn−1) ≥ I(X0;Xn). (4.63)

ThereforeH(X0) −H(X0|Xn−1) ≥ H(X0) −H(X0|Xn) (4.64)

or H(X0|Xn) increases with n .

Page 72: Solution to Information Theory

Entropy Rates of a Stochastic Process 71

10. Pairwise independence. Let X1, X2, . . . , Xn−1 be i.i.d. random variables taking valuesin {0, 1} , with Pr{Xi = 1} = 1

2 . Let Xn = 1 if∑n−1

i=1 Xi is odd and Xn = 0otherwise. Let n ≥ 3 .

(a) Show that Xi and Xj are independent, for i 6= j , i, j ∈ {1, 2, . . . , n} .

(b) Find H(Xi, Xj) , for i 6= j .

(c) Find H(X1, X2, . . . , Xn) . Is this equal to nH(X1) ?

Solution: (Pairwise Independence) X1, X2, . . . , Xn−1 are i.i.d. Bernoulli(1/2) randomvariables. We will first prove that for any k ≤ n− 1 , the probability that

∑ki=1Xi is

odd is 1/2 . We will prove this by induction. Clearly this is true for k = 1. Assumethat it is true for k − 1 . Let Sk =

∑ki=1Xi . Then

P (Sk odd) = P (Sk−1 odd)P (Xk = 0) + P (Sk−1 even)P (Xk = 1) (4.65)

=1

2

1

2+

1

2

1

2(4.66)

=1

2. (4.67)

Hence for all k ≤ n− 1 , the probability that Sk is odd is equal to the probability thatit is even. Hence,

P (Xn = 1) = P (Xn = 0) =1

2. (4.68)

(a) It is clear that when i and j are both less than n , Xi and Xj are independent.The only possible problem is when j = n . Taking i = 1 without loss of generality,

P (X1 = 1, Xn = 1) = P (X1 = 1,n−1∑

i=2

Xi even) (4.69)

= P (X1 = 1)P (n−1∑

i=2

Xi even) (4.70)

=1

2

1

2(4.71)

= P (X1 = 1)P (Xn = 1) (4.72)

and similarly for other possible values of the pair X1, Xn . Hence X1 and Xn areindependent.

(b) Since Xi and Xj are independent and uniformly distributed on {0, 1} ,

H(Xi, Xj) = H(Xi) +H(Xj) = 1 + 1 = 2 bits. (4.73)

(c) By the chain rule and the independence of X1, X2, . . . , Xn1 , we have

H(X1, X2, . . . , Xn) = H(X1, X2, . . . , Xn−1) +H(Xn|Xn−1, . . . , X1)(4.74)

=n−1∑

i=1

H(Xi) + 0 (4.75)

= n− 1, (4.76)

Page 73: Solution to Information Theory

72 Entropy Rates of a Stochastic Process

since Xn is a function of the previous Xi ’s. The total entropy is not n , which iswhat would be obtained if the Xi ’s were all independent. This example illustratesthat pairwise independence does not imply complete independence.

11. Stationary processes. Let . . . , X−1, X0, X1, . . . be a stationary (not necessarily Markov)stochastic process. Which of the following statements are true? Prove or provide acounterexample.

(a) H(Xn|X0) = H(X−n|X0) .

(b) H(Xn|X0) ≥ H(Xn−1|X0) .

(c) H(Xn|X1, X2, . . . , Xn−1, Xn+1) is nonincreasing in n .

(d) H(Xn|X1, . . . , Xn−1, Xn+1, . . . , X2n) is non-increasing in n .

Solution: Stationary processes.

(a) H(Xn|X0) = H(X−n|X0) .

This statement is true, since

H(Xn|X0) = H(Xn, X0) −H(X0) (4.77)

H(X−n|X0) = H(X−n, X0) −H(X0) (4.78)

and H(Xn, X0) = H(X−n, X0) by stationarity.

(b) H(Xn|X0) ≥ H(Xn−1|X0) .

This statement is not true in general, though it is true for first order Markov chains.A simple counterexample is a periodic process with period n . Let X0, X1, X2, . . . , Xn−1

be i.i.d. uniformly distributed binary random variables and let Xk = Xk−n fork ≥ n . In this case, H(Xn|X0) = 0 and H(Xn−1|X0) = 1 , contradicting thestatement H(Xn|X0) ≥ H(Xn−1|X0) .

(c) H(Xn|Xn−11 , Xn+1) is non-increasing in n .

This statement is true, since by stationarity H(Xn|Xn−11 , Xn+1) = H(Xn+1|Xn

2 , Xn+2) ≥H(Xn+1|Xn

1 , Xn+2) where the inequality follows from the fact that conditioningreduces entropy.

12. The entropy rate of a dog looking for a bone. A dog walks on the integers, possiblyreversing direction at each step with probability p = .1. Let X0 = 0. The first step isequally likely to be positive or negative. A typical walk might look like this:

(X0, X1, . . .) = (0,−1,−2,−3,−4,−3,−2,−1, 0, 1, . . .).

(a) Find H(X1, X2, . . . , Xn).

(b) Find the entropy rate of this browsing dog.

(c) What is the expected number of steps the dog takes before reversing direction?

Solution: The entropy rate of a dog looking for a bone.

Page 74: Solution to Information Theory

Entropy Rates of a Stochastic Process 73

(a) By the chain rule,

H(X0, X1, . . . , Xn) =n∑

i=0

H(Xi|Xi−1)

= H(X0) +H(X1|X0) +n∑

i=2

H(Xi|Xi−1, Xi−2),

since, for i > 1 , the next position depends only on the previous two (i.e., thedog’s walk is 2nd order Markov, if the dog’s position is the state). Since X0 = 0deterministically, H(X0) = 0 and since the first step is equally likely to be positiveor negative, H(X1|X0) = 1 . Furthermore for i > 1 ,

H(Xi|Xi−1, Xi−2) = H(.1, .9).

Therefore,H(X0, X1, . . . , Xn) = 1 + (n− 1)H(.1, .9).

(b) From a),

H(X0, X1, . . . Xn)

n+ 1=

1 + (n− 1)H(.1, .9)

n+ 1→ H(.1, .9).

(c) The dog must take at least one step to establish the direction of travel from whichit ultimately reverses. Letting S be the number of steps taken between reversals,we have

E(S) =∞∑

s=1

s(.9)s−1(.1)

= 10.

Starting at time 0, the expected number of steps to the first reversal is 11.

13. The past has little to say about the future. For a stationary stochastic process X1, X2, . . . , Xn, . . . ,show that

limn→∞

1

2nI(X1, X2, . . . , Xn;Xn+1, Xn+2, . . . , X2n) = 0. (4.79)

Thus the dependence between adjacent n -blocks of a stationary process does not growlinearly with n .

Solution:

I(X1, X2, . . . , Xn;Xn+1, Xn+2, . . . , X2n)

= H(X1, X2, . . . , Xn) +H(Xn+1, Xn+2, . . . , X2n) −H(X1, X2, . . . , Xn, Xn+1, Xn+2, . . . , X2n)

= 2H(X1, X2, . . . , Xn) −H(X1, X2, . . . , Xn, Xn+1, Xn+2, . . . , X2n) (4.80)

since H(X1, X2, . . . , Xn) = H(Xn+1, Xn+2, . . . , X2n) by stationarity.

Page 75: Solution to Information Theory

74 Entropy Rates of a Stochastic Process

Thus

limn→∞

1

2nI(X1, X2, . . . , Xn;Xn+1, Xn+2, . . . , X2n)

= limn→∞

1

2n2H(X1, X2, . . . , Xn) − lim

n→∞1

2nH(X1, X2, . . . , Xn, Xn+1, Xn+2, . . . , X2n)(4.81)

= limn→∞

1

nH(X1, X2, . . . , Xn) − lim

n→∞1

2nH(X1, X2, . . . , Xn, Xn+1, Xn+2, . . . , X2n)(4.82)

Now limn→∞1nH(X1, X2, . . . , Xn) = limn→∞

12nH(X1, X2, . . . , Xn, Xn+1, Xn+2, . . . , X2n)

since both converge to the entropy rate of the process, and therefore

limn→∞

1

2nI(X1, X2, . . . , Xn;Xn+1, Xn+2, . . . , X2n) = 0. (4.83)

14. Functions of a stochastic process.

(a) Consider a stationary stochastic process X1, X2, . . . , Xn , and let Y1, Y2, . . . , Yn bedefined by

Yi = φ(Xi), , i = 1, 2, . . . (4.84)

for some function φ . Prove that

H(Y) ≤ H(X ) (4.85)

(b) What is the relationship between the entropy rates H(Z) and H(X ) if

Zi = ψ(Xi, Xi+1), i = 1, 2, . . . (4.86)

for some function ψ .

Solution: The key point is that functions of a random variable have lower entropy.Since (Y1, Y2, . . . , Yn) is a function of (X1, X2, . . . , Xn) (each Yi is a function of thecorresponding Xi ), we have (from Problem 2.4)

H(Y1, Y2, . . . , Yn) ≤ H(X1, X2, . . . , Xn) (4.87)

Dividing by n , and taking the limit as n→ ∞ , we have

limn→∞

H(Y1, Y2, . . . , Yn)

n≤ lim

n→∞H(X1, X2, . . . , Xn)

n(4.88)

orH(Y) ≤ H(X ) (4.89)

15. Entropy rate. Let {Xi} be a discrete stationary stochastic process with entropy rateH(X ). Show

1

nH(Xn, . . . , X1 | X0, X−1, . . . , X−k) → H(X ), (4.90)

for k = 1, 2, . . . .

Page 76: Solution to Information Theory

Entropy Rates of a Stochastic Process 75

����

��������

����

Figure 4.1: Entropy rate of constrained sequence

Solution: Entropy rate of a stationary process. By the Cesaro mean theorem, therunning average of the terms tends to the same limit as the limit of the terms. Hence

1

nH(X1, X2, . . . , Xn|X0, X−1, . . . , X−k) =

1

n

n∑

i=1

H(Xi|Xi−1, Xi−2, . . . , X−k)(4.91)

→ limH(Xn|Xn−1, Xn−2, . . . , X−k)(4.92)

= H, (4.93)

the entropy rate of the process.

16. Entropy rate of constrained sequences. In magnetic recording, the mechanism of record-ing and reading the bits imposes constraints on the sequences of bits that can berecorded. For example, to ensure proper sychronization, it is often necessary to limitthe length of runs of 0’s between two 1’s. Also to reduce intersymbol interference, itmay be necessary to require at least one 0 between any two 1’s. We will consider asimple example of such a constraint.

Suppose that we are required to have at least one 0 and at most two 0’s between anypair of 1’s in a sequences. Thus, sequences like 101001 and 0101001 are valid sequences,but 0110010 and 0000101 are not. We wish to calculate the number of valid sequencesof length n .

(a) Show that the set of constrained sequences is the same as the set of allowed pathson the following state diagram:

(b) Let Xi(n) be the number of valid paths of length n ending at state i . Argue that

Page 77: Solution to Information Theory

76 Entropy Rates of a Stochastic Process

X(n) = [X1(n) X2(n) X3(n)]t satisfies the following recursion:

X1(n)X2(n)X3(n)

=

0 1 11 0 00 1 0

X1(n− 1)X2(n− 1)X3(n− 1)

, (4.94)

with initial conditions X(1) = [1 1 0]t .

(c) Let

A =

0 1 11 0 00 1 0

. (4.95)

Then we have by induction

X(n) = AX(n− 1) = A2X(n− 2) = · · · = An−1X(1). (4.96)

Using the eigenvalue decomposition of A for the case of distinct eigenvalues, wecan write A = U−1ΛU , where Λ is the diagonal matrix of eigenvalues. ThenAn−1 = U−1Λn−1U . Show that we can write

X(n) = λn−11 Y1 + λn−1

2 Y2 + λn−13 Y3, (4.97)

where Y1,Y2,Y3 do not depend on n . For large n , this sum is dominated bythe largest term. Therefore argue that for i = 1, 2, 3 , we have

1

nlogXi(n) → log λ, (4.98)

where λ is the largest (positive) eigenvalue. Thus the number of sequences oflength n grows as λn for large n . Calculate λ for the matrix A above. (Thecase when the eigenvalues are not distinct can be handled in a similar manner.)

(d) We will now take a different approach. Consider a Markov chain whose statediagram is the one given in part (a), but with arbitrary transition probabilities.Therefore the probability transition matrix of this Markov chain is

P =

0 1 0α 0 1 − α1 0 0

. (4.99)

Show that the stationary distribution of this Markov chain is

µ =

[1

3 − α,

1

3 − α,

1 − α

3 − α

]

. (4.100)

(e) Maximize the entropy rate of the Markov chain over choices of α . What is themaximum entropy rate of the chain?

(f) Compare the maximum entropy rate in part (e) with log λ in part (c). Why arethe two answers the same?

Page 78: Solution to Information Theory

Entropy Rates of a Stochastic Process 77

Solution:

Entropy rate of constrained sequences.

(a) The sequences are constrained to have at least one 0 and at most two 0’s betweentwo 1’s. Let the state of the system be the number of 0’s that has been seen sincethe last 1. Then a sequence that ends in a 1 is in state 1, a sequence that ends in10 in is state 2, and a sequence that ends in 100 is in state 3. From state 1, it isonly possible to go to state 2, since there has to be at least one 0 before the next1. From state 2, we can go to either state 1 or state 3. From state 3, we have togo to state 1, since there cannot be more than two 0’s in a row. Thus we can thestate diagram in the problem.

(b) Any valid sequence of length n that ends in a 1 must be formed by taking a validsequence of length n− 1 that ends in a 0 and adding a 1 at the end. The numberof valid sequences of length n−1 that end in a 0 is equal to X2(n−1)+X3(n−1)and therefore,

X1(n) = X2(n− 1) +X3(n− 1). (4.101)

By similar arguments, we get the other two equations, and we have

X1(n)X2(n)X3(n)

=

0 1 11 0 00 1 0

X1(n− 1)X2(n− 1)X3(n− 1)

. (4.102)

The initial conditions are obvious, since both sequences of length 1 are valid andtherefore X(1) = [1 1 0]T .

(c) The induction step is obvious. Now using the eigenvalue decomposition of A =U−1ΛU , it follows that A2 = U−1ΛUU−1ΛU = U−1Λ2U , etc. and therefore

X(n) = An−1X(1) = U−1Λn−1UX(1) (4.103)

= U−1

λn−11 0 0

0 λn−12 0

0 0 λn−13

U

110

(4.104)

= λn−11 U−1

1 0 00 0 00 0 0

U

110

+ λn−1

2 U−1

0 0 00 1 00 0 0

U

110

+λn−13 U−1

0 0 00 0 00 0 1

U

110

(4.105)

= λn−11 Y1 + λn−1

2 Y2 + λn−13 Y3, (4.106)

where Y1,Y2,Y3 do not depend on n . Without loss of generality, we can assumethat λ1 > λ2 > λ3 . Thus

X1(n) = λn−11 Y11 + λn−1

2 Y21 + λn−13 Y31 (4.107)

X2(n) = λn−11 Y12 + λn−1

2 Y22 + λn−13 Y32 (4.108)

X3(n) = λn−11 Y13 + λn−1

2 Y23 + λn−13 Y33 (4.109)

Page 79: Solution to Information Theory

78 Entropy Rates of a Stochastic Process

For large n , this sum is dominated by the largest term. Thus if Y1i > 0 , we have

1

nlogXi(n) → log λ1. (4.110)

To be rigorous, we must also show that Y1i > 0 for i = 1, 2, 3 . It is not difficultto prove that if one of the Y1i is positive, then the other two terms must also bepositive, and therefore either

1

nlogXi(n) → log λ1. (4.111)

for all i = 1, 2, 3 or they all tend to some other value.

The general argument is difficult since it is possible that the initial conditions ofthe recursion do not have a component along the eigenvector that corresponds tothe maximum eigenvalue and thus Y1i = 0 and the above argument will fail. Inour example, we can simply compute the various quantities, and thus

A =

0 1 11 0 00 1 0

= U−1ΛU, (4.112)

where

Λ =

1.3247 0 00 −0.6624 + 0.5623i 00 0 −0.6624 − 0.5623i

, (4.113)

and

U =

−0.5664 −0.7503 −0.42760.6508 − 0.0867i −0.3823 + 0.4234i −0.6536 − 0.4087i0.6508 + 0.0867i −0.3823i0.4234i −0.6536 + 0.4087i

, (4.114)

and therefore

Y1 =

0.95660.72210.5451

, (4.115)

which has all positive components. Therefore,

1

nlogXi(n) → log λi = log 1.3247 = 0.4057 bits. (4.116)

(d) To verify the that

µ =

[1

3 − α,

1

3 − α,

1 − α

3 − α

]T

. (4.117)

is the stationary distribution, we have to verify that Pµ = µ . But this is straight-forward.

Page 80: Solution to Information Theory

Entropy Rates of a Stochastic Process 79

(e) The entropy rate of the Markov chain (in nats) is

H =∑

i

µi

j

Pij lnPij =1

3 − α(−α lnα− (1 − α) ln(1 − α)) , (4.118)

and differentiating with respect to α to find the maximum, we find that

dHdα

=1

(3 − α)2(−α lnα− (1 − α) ln(1 − α))+

1

3 − α(−1 − lnα+ 1 + ln(1 − α)) = 0,

(4.119)or

(3 − α) (ln a− ln(1 − α)) = (−α lnα− (1 − α) ln(1 − α)) (4.120)

which reduces to3 lnα = 2 ln(1 − α), (4.121)

i.e.,α3 = α2 − 2α+ 1, (4.122)

which can be solved (numerically) to give α = 0.5698 and the maximum entropyrate as 0.2812 nats = 0.4057 bits.

(f) The answers in parts (c) and (f) are the same. Why? A rigorous argument isquite involved, but the essential idea is that both answers give the asymptotics ofthe number of sequences of length n for the state diagram in part (a). In part (c)we used a direct argument to calculate the number of sequences of length n andfound that asymptotically, X(n) ≈ λn

1 .

If we extend the ideas of Chapter 3 (typical sequences) to the case of Markovchains, we can see that there are approximately 2nH typical sequences of lengthn for a Markov chain of entropy rate H . If we consider all Markov chains withstate diagram given in part (a), the number of typical sequences should be lessthan the total number of sequences of length n that satisfy the state constraints.Thus, we see that 2nH ≤ λn

1 or H ≤ log λ1 .

To complete the argument, we need to show that there exists an Markov transitionmatrix that achieves the upper bound. This can be done by two different methods.One is to derive the Markov transition matrix from the eigenvalues, etc. of parts(a)–(c). Instead, we will use an argument from the method of types. In Chapter 12,we show that there are at most a polynomial number of types, and that therefore,the largest type class has the same number of sequences (to the first order inthe exponent) as the entire set. The same arguments can be applied to Markovtypes. There are only a polynomial number of Markov types and therefore of allthe Markov type classes that satisfy the state constraints of part (a), at least oneof them has the same exponent as the total number of sequences that satisfy thestate constraint. For this Markov type, the number of sequences in the typeclassis 2nH , and therefore for this type class, H = log λ1 .

This result is a very curious one that connects two apparently unrelated objects -the maximum eigenvalue of a state transition matrix, and the maximum entropy

Page 81: Solution to Information Theory

80 Entropy Rates of a Stochastic Process

rate for a probability transition matrix with the same state diagram. We don’tknow a reference for a formal proof of this result.

17. Waiting times are insensitive to distributions. Let X0, X1, X2, . . . be drawn i.i.d. ∼p(x), x ∈ X = {1, 2, . . . ,m} and let N be the waiting time to the next occurrence ofX0 , where N = minn{Xn = X0} .

(a) Show that EN = m .

(b) Show that E logN ≤ H(X) .

(c) (Optional) Prove part (a) for {Xi} stationary and ergodic.

Solution: Waiting times are insensitive to distributions. Since X0, X1, X2, . . . , Xn aredrawn i.i.d. ∼ p(x) , the waiting time for the next occurrence of X0 has a geometricdistribution with probability of success p(x0) .

(a) Given X0 = i , the expected time until we see it again is 1/p(i) . Therefore,

EN = E[E(N |X0)] =∑

p(X0 = i)

(1

p(i)

)

= m. (4.123)

(b) There is a typographical error in the problem. The problem should read E logN ≤H(X) .

By the same argument, since given X0 = i , N has a geometric distribution withmean p(i) and

E(N |X0 = i) =1

p(i). (4.124)

Then using Jensen’s inequality, we have

E logN =∑

i

p(X0 = i)E(logN |X0 = i) (4.125)

≤∑

i

p(X0 = i) logE(N |X0 = i) (4.126)

=∑

i

p(i) log1

p(i)(4.127)

= H(X). (4.128)

(c) The property that EN = m is essentially a combinatorial property rather thana statement about expectations. We prove this for stationary ergodic sources. Inessence, we will calculate the empirical average of the waiting time, and show thatthis converges to m . Since the process is ergodic, the empirical average convergesto the expected value, and thus the expected value must be m .

To simplify matters, we will consider X1, X2, . . . , Xn arranged in a circle, so thatX1 follows Xn . Then we can get rid of the edge effects (namely that the waitingtime is not defined for Xn , etc) and we can define the waiting time Nk at time

Page 82: Solution to Information Theory

Entropy Rates of a Stochastic Process 81

k as min{n > k : Xn = Xk} . With this definition, we can write the empiricalaverage of Nk for a particular sample sequence

N =1

n

n∑

i=1

Ni (4.129)

=1

n

n∑

i=1

min{n>i:xn=xi}∑

j=i+1

1

. (4.130)

Now we can rewrite the outer sum by grouping together all the terms which cor-respond to xi = l . Thus we obtain

N =1

n

m∑

l=1

i:xi=l

min{n>i:xn=l}∑

j=i+1

1

(4.131)

But the inner two sums correspond to summing 1 over all the n terms, and thus

N =1

n

m∑

l=1

n = m (4.132)

Thus the empirical average of N over any sample sequence is m and thus theexpected value of N must also be m .

18. Stationary but not ergodic process. A bin has two biased coins, one with probability ofheads p and the other with probability of heads 1 − p . One of these coins is chosenat random (i.e., with probability 1/2), and is then tossed n times. Let X denote theidentity of the coin that is picked, and let Y1 and Y2 denote the results of the first twotosses.

(a) Calculate I(Y1;Y2|X) .

(b) Calculate I(X;Y1, Y2) .

(c) Let H(Y) be the entropy rate of the Y process (the sequence of coin tosses).Calculate H(Y) . (Hint: Relate this to limn→∞

1nH(X,Y1, Y2, . . . , Yn) ).

You can check the answer by considering the behavior as p→ 1/2 .

Solution:

(a) SInce the coin tosses are indpendent conditional on the coin chosen, I(Y1;Y2|X) =0 .

(b) The key point is that if we did not know the coin being used, then Y1 and Y2

are not independent. The joint distribution of Y1 and Y2 can be easily calculatedfrom the following table

Page 83: Solution to Information Theory

82 Entropy Rates of a Stochastic Process

X Y1 Y2 Probability

1 H H p2

1 H T p(1 − p)1 T H p(1 − p)1 T T (1 − p)2

2 H H (1 − p)2

2 H T p(1 − p)2 T H p(1 − p)2 T T p2

Thus the joint distribution of (Y1, Y2) is (12(p2 +(1−p)2), p(1−p), p(1−p), 1

2(p2 +(1 − p)2)) , and we can now calculate

I(X;Y1, Y2) = H(Y1, Y2) −H(Y1, Y2|X) (4.133)

= H(Y1, Y2) −H(Y1|X) −H(Y2|X) (4.134)

= H(Y1, Y2) − 2H(p) (4.135)

= H

(1

2(p2 + (1 − p)2), p(1 − p), p(1 − p),

1

2(p2 + (1 − p)2)

)

− 2H(p)

= H(p(1 − p)) + 1 − 2H(p) (4.136)

where the last step follows from using the grouping rule for entropy.

(c)

H(Y) = limH(Y1, Y2, . . . , Yn)

n(4.137)

= limH(X,Y1, Y2, . . . , Yn) −H(X|Y1, Y2, . . . , Yn)

n(4.138)

= limH(X) +H(Y1, Y2, . . . , Yn|X) −H(X|Y1, Y2, . . . , Yn)

n(4.139)

Since 0 ≤ H(X|Y1, Y2, . . . , Yn) ≤ H(X) ≤ 1 , we have lim 1nH(X) = 0 and sim-

ilarly 1nH(X|Y1, Y2, . . . , Yn) = 0 . Also, H(Y1, Y2, . . . , Yn|X) = nH(p) , since the

Yi ’s are i.i.d. given X . Combining these terms, we get

H(Y) = limnH(p)

n= H(p) (4.140)

19. Random walk on graph. Consider a random walk on the graph

Page 84: Solution to Information Theory

Entropy Rates of a Stochastic Process 83

t

t

t

t

t

1

2

5

3

4

@@

@@

@

XXXXXXXXXX

�����

����������

@@@

BBBBBBBB

(a) Calculate the stationary distribution.

(b) What is the entropy rate?

(c) Find the mutual information I(Xn+1;Xn) assuming the process is stationary.

Solution:

(a) The stationary distribution for a connected graph of undirected edges with equalweight is given as µi = Ei

2E where Ei denotes the number of edges emanatingfrom node i and E is the total number of edges in the graph. Hence, the station-ary distribution is [ 3

16 ,316 ,

316 ,

316 ,

416 ] ; i.e., the first four nodes exterior nodes have

steady state probability of 316 while node 5 has steady state probability of 1

4 .

(b) Thus, the entropy rate of the random walk on this graph is 4 316 log2(3)+

416 log2(4) =

34 log2(3) + 1

2 = log 16 −H(3/16, 3/16, 3/16, 3/16, 1/4)

(c) The mutual information

I(Xn+1;Xn) = H(Xn+1) −H(Xn+1|Xn) (4.141)

= H(3/16, 3/16, 3/16, 3/16, 1/4) − (log16 −H(3/16, 3/16, 3/16, 3/16, 1/4))(4.142)

= 2H(3/16, 3/16, 3/16, 3/16, 1/4) − log16 (4.143)

= 2(3

4log

16

3+

1

4log 4) − log16 (4.144)

= 3 − 3

2log 3 (4.145)

20. Random walk on chessboard. Find the entropy rate of the Markov chain associated witha random walk of a king on the 3 × 3 chessboard

1 2 3

4 5 6

7 8 9

Page 85: Solution to Information Theory

84 Entropy Rates of a Stochastic Process

What about the entropy rate of rooks, bishops and queens? There are two types ofbishops.

Solution:

Random walk on the chessboard.

Notice that the king cannot remain where it is. It has to move from one state to thenext. The stationary distribution is given by µi = Ei/E , where Ei = number of edgesemanating from node i and E =

∑9i=1Ei . By inspection, E1 = E3 = E7 = E9 = 3,

E2 = E4 = E6 = E8 = 5, E5 = 8 and E = 40 , so µ1 = µ3 = µ7 = µ9 = 3/40 ,µ2 = µ4 = µ6 = µ8 = 5/40 and µ5 = 8/40 . In a random walk the next state ischosen with equal probability among possible choices, so H(X2|X1 = i) = log 3 bitsfor i = 1, 3, 7, 9 , H(X2|X1 = i) = log 5 for i = 2, 4, 6, 8 and H(X2|X1 = i) = log 8bits for i = 5. Therefore, we can calculate the entropy rate of the king as

H =9∑

i=1

µiH(X2|X1 = i) (4.146)

= 0.3 log 3 + 0.5 log 5 + 0.2 log 8 (4.147)

= 2.24 bits. (4.148)

21. Maximal entropy graphs. Consider a random walk on a connected graph with 4 edges.

(a) Which graph has the highest entropy rate?

(b) Which graph has the lowest?

Solution: Graph entropy.

There are five graphs with four edges.

Page 86: Solution to Information Theory

Entropy Rates of a Stochastic Process 85

Where the entropy rates are 1/2+3/8 log(3) ≈ 1.094, 1, .75, 1 and 1/4+3/8 log(3) ≈.844.

(a) From the above we see that the first graph maximizes entropy rate with andentropy rate of 1.094.

(b) From the above we see that the third graph minimizes entropy rate with andentropy rate of .75.

22. 3-D Maze.A bird is lost in a 3 × 3 × 3 cubical maze. The bird flies from room to room going toadjoining rooms with equal probability through each of the walls. To be specific, thecorner rooms have 3 exits.

Page 87: Solution to Information Theory

86 Entropy Rates of a Stochastic Process

(a) What is the stationary distribution?

(b) What is the entropy rate of this random walk?

Solution: 3D Maze.The entropy rate of a random walk on a graph with equal weights is given by equation4.41 in the text:

H(X ) = log(2E) −H

(E1

2E, . . . ,

Em

2E

)

There are 8 corners, 12 edges, 6 faces and 1 center. Corners have 3 edges, edges have4 edges, faces have 5 edges and centers have 6 edges. Therefore, the total number ofedges E = 54 . So,

H(X ) = log(108) + 8

(3

108log

3

108

)

+ 12

(4

108log

4

108

)

+ 6

(5

108log

5

108

)

+ 1

(6

108log

6

108

)

= 2.03 bits

23. Entropy rateLet {Xi} be a stationary stochastic process with entropy rate H(X ) .

(a) Argue that H(X ) ≤ H(X1) .

(b) What are the conditions for equality?

Solution: Entropy Rate

(a) From Theorem 4.2.1

H(X ) = H(X1|X0, X−1, . . .) ≤ H(X1) (4.149)

since conditioning reduces entropy

(b) We have equality only if X1 is independent of the past X0, X−1, . . . , i.e., if andonly if Xi is an i.i.d. process.

24. Entropy rates

Let {Xi} be a stationary process. Let Yi = (Xi, Xi+1) . Let Zi = (X2i, X2i+1) . LetVi = X2i . Consider the entropy rates H(X ) , H(Y) , H(Z) , and H(V) of the processes{Xi} ,{Yi} , {Zi} , and {Vi} . What is the inequality relationship ≤ , =, or ≥ betweeneach of the pairs listed below:

(a) H(X )≥≤ H(Y) .

(b) H(X )≥≤ H(Z) .

(c) H(X )≥≤ H(V) .

(d) H(Z)≥≤ H(X ) .

Page 88: Solution to Information Theory

Entropy Rates of a Stochastic Process 87

Solution: Entropy rates

{Xi} is a stationary process, Yi = (Xi, Xi+1) . Let Zi = (X2i, X2i+1) . Let Vi = X2i .Consider the entropy rates H(X ) , H(Y) , H(Z) , and H(V) of the processes {Xi} ,{Zi} , and {Vi} .

(a) H(X ) = H (Y) , since H(X1, X2, . . . , Xn, Xn+1) = H(Y1, Y2, . . . , Yn) , and dividingby n and taking the limit, we get equality.

(b) H(X ) < H (Z) , since H(X1, . . . , X2n) = H(Z1, . . . , Zn) , and dividing by n andtaking the limit, we get 2H(X ) = H(Z) .

(c) H(X ) > H (V) , since H(V1|V0, . . .) = H(X2|X0, X−2, . . .) ≤ H(X2|X1, X0, X−1, . . .) .

(d) H(Z) = 2H (X ) since H(X1, . . . , X2n) = H(Z1, . . . , Zn) , and dividing by n andtaking the limit, we get 2H(X ) = H(Z) .

25. Monotonicity.

(a) Show that I(X;Y1, Y2, . . . , Yn) is non-decreasing in n .

(b) Under what conditions is the mutual information constant for all n?

Solution: Monotonicity

(a) Since conditioning reduces entropy,

H(X|Y1, Y2, . . . , Yn) ≥ H(X|Y1, Y2, . . . , Yn, Yn+1) (4.150)

and hence

I(X;Y1, Y2, . . . , Yn) = H(X) −H(X|Y1, Y2, . . . , Yn) (4.151)

≤ H(X) −H(X|Y1, Y2, . . . , Yn,n+1 ) (4.152)

= I(X;Y1, Y2, . . . , Yn, Yn+1) (4.153)

(b) We have equality if and only if H(X|Y1, Y2, . . . , Yn) = H(X|Y1) for all n , i.e., ifX is conditionally independent of Y2, . . . given Y1 .

26. Transitions in Markov chains. Suppose {Xi} forms an irreducible Markov chain withtransition matrix P and stationary distribution µ . Form the associated “edge-process”{Yi} by keeping track only of the transitions. Thus the new process {Yi} takes valuesin X × X , and Yi = (Xi−1, Xi) .

For exampleX = 3, 2, 8, 5, 7, . . .

becomesY = (∅, 3), (3, 2), (2, 8), (8, 5), (5, 7), . . .

Find the entropy rate of the edge process {Yi} .

Solution: Edge Process H(X ) = H (Y) , since H(X1, X2, . . . , Xn, Xn+1) = H(Y1, Y2, . . . , Yn) ,and dividing by n and taking the limit, we get equality.

Page 89: Solution to Information Theory

88 Entropy Rates of a Stochastic Process

27. Entropy rate

Let {Xi} be a stationary {0, 1} valued stochastic process obeying

Xk+1 = Xk ⊕Xk−1 ⊕ Zk+1,

where {Zi} is Bernoulli(p ) and ⊕ denotes mod 2 addition. What is the entropy rateH(X ) ?

Solution: Entropy Rate

H(X ) = H(Xk+1|Xk, Xk−1, . . .) = H(Xk+1|Xk, Xk−1) = H(Zk+1) = H(p) (4.154)

28. Mixture of processes

Suppose we observe one of two stochastic processes but don’t know which. What is theentropy rate? Specifically, let X11, X12, X13, . . . be a Bernoulli process with parameterp1 and let X21, X22, X23, . . . be Bernoulli (p2) . Let

θ =

1, with probability 12

2, with probability 12

and let Yi = Xθi, i = 1, 2, . . . , be the observed stochastic process. Thus Y observesthe process {X1i} or {X2i} . Eventually Y will know which.

(a) Is {Yi} stationary?

(b) Is {Yi} an i.i.d. process?

(c) What is the entropy rate H of {Yi}?

(d) Does

− 1

nlog p(Y1, Y2, . . . Yn) −→ H?

(e) Is there a code that achieves an expected per-symbol description length 1nELn −→

H ?

Now let θi be Bern( 12 ). Observe

Zi = Xθii, i = 1, 2, . . . ,

Thus θ is not fixed for all time, as it was in the first part, but is chosen i.i.d. each time.Answer (a), (b), (c), (d), (e) for the process {Zi} , labeling the answers (a ′ ), (b ′ ), (c ′ ),(d ′ ), (e ′ ).

Solution: Mixture of processes.

(a) YES, {Yi} is stationary, since the scheme that we use to generate the Yi s doesn’tchange with time.

Page 90: Solution to Information Theory

Entropy Rates of a Stochastic Process 89

(b) NO, it is not IID, since there’s dependence now – all Yi s have been generatedaccording to the same parameter θ .

Alternatively, we can arrive at the result by examining I(Yn+1;Yn) . If the process

were to be IID, then the expression I(Yn+1;Yn) would have to be 0 . However,

if we are given Y n , then we can estimate what θ is, which in turn allows us topredict Yn+1 . Thus, I(Yn+1;Y

n) is nonzero.

(c) The process {Yi} is the mixture of two Bernoulli processes with different param-eters, and its entropy rate is the mixture of the two entropy rates of the twoprocesses so it’s given by

H(p1) +H(p2)

2.

More rigorously,

H = limn→∞

1

nH(Y n)

= limn→∞

1

n(H(θ) +H(Y n|θ) −H(θ|Y n))

=H(p1) +H(p2)

2

Note that only H(Y n|θ) grows with n . The rest of the term is finite and will goto 0 as n goes to ∞ .

(d) The process {Yi} is NOT ergodic, so the AEP does not apply and the quantity−(1/n) log P (Y1, Y2, . . . , Yn) does NOT converge to the entropy rate. (But it doesconverge to a random variable that equals H(p1) w.p. 1/2 and H(p2) w.p. 1/2.)

(e) Since the process is stationary, we can do Huffman coding on longer and longerblocks of the process. These codes will have an expected per-symbol lengthbounded above by H(X1,X2,...,Xn)+1

n and this converges to H(X ) .

(a’) YES, {Yi} is stationary, since the scheme that we use to generate the Yi ’s doesn’tchange with time.

(b’) YES, it is IID, since there’s no dependence now – each Yi is generated accordingto an independent parameter θi , and Yi ∼Bernoulli( (p1 + p2)/2) .

(c’) Since the process is now IID, its entropy rate is

H

(p1 + p2

2

)

.

(d’) YES, the limit exists by the AEP.

(e’) YES, as in (e) above.

29. Waiting times.Let X be the waiting time for the first heads to appear in successive flips of a fair coin.Thus, for example, Pr{X = 3} = ( 1

2)3 .

Page 91: Solution to Information Theory

90 Entropy Rates of a Stochastic Process

Let Sn be the waiting time for the nth head to appear.Thus,

S0 = 0

Sn+1 = Sn +Xn+1

where X1, X2, X3, . . . are i.i.d according to the distribution above.

(a) Is the process {Sn} stationary?

(b) Calculate H(S1, S2, . . . , Sn) .

(c) Does the process {Sn} have an entropy rate? If so, what is it? If not, why not?

(d) What is the expected number of fair coin flips required to generate a randomvariable having the same distribution as Sn ?

Solution: Waiting time process.

For the process to be stationary, the distribution must be time invariant. It turns outthat process {Sn} is not stationary. There are several ways to show this.

(a) S0 is always 0 while Si , i 6= 0 can take on several values. Since the marginalsfor S0 and S1 , for example, are not the same, the process can’t be stationary.

(b) It’s clear that the variance of Sn grows with n , which again implies that themarginals are not time-invariant.

(c) Process {Sn} is an independent increment process. An independent incrementprocess is not stationary (not even wide sense stationary), since var(Sn) = var(Xn)+var(Sn−1) > var(Sn−1) .

(d) We can use chain rule and Markov properties to obtain the following results.

H(S1, S2, . . . , Sn) = H(S1) +n∑

i=2

H(Si|Si−1)

= H(S1) +n∑

i=2

H(Si|Si−1)

= H(X1) +n∑

i=2

H(Xi)

=n∑

i=1

H(Xi)

= 2n

(e) It follows trivially from (e) that

H(S) = limn→∞

H(Sn)

n

= limn→∞

2n

n= 2

Page 92: Solution to Information Theory

Entropy Rates of a Stochastic Process 91

Note that the entropy rate can still exist even when the process is not stationary.Furthermore, the entropy rate (for this problem) is the same as the entropy of X.

(f) The expected number of flips required can be lower-bounded by H(Sn) and upper-bounded by H(Sn) + 2 (Theorem 5.12.3, page 115). Sn has a negative binomial

distribution; i.e., Pr(Sn = k) =

(

k − 1n− 1

)

(12 )k for k ≥ n . (We have the n th

success at the k th trial if and only if we have exactly n − 1 successes in k − 1trials and a suceess at the k th trial.)

Since computing the exact value of H(Sn) is difficult (and fruitless in the examsetting), it would be sufficient to show that the expected number of flips requiredis between H(Sn) and H(Sn) + 2 , and set up the expression of H(Sn) in termsof the pmf of Sn .

Note, however, that for large n , however, the distribution of Sn will tend toGaussian with mean n

p = 2n and variance n(1 − p)/p2 = 2n .

Let pk = Pr(Sn = k+ESn) = Pr(Sn = k+2n) . Let φ(x) be the normal densityfunction with mean zero and variance 2n , i.e. φ(x) = exp (−x2/2σ2)/

√2πσ2 ,

where σ2 = 2n .

Then for large n , since the entropy is invariant under any constant shift of arandom variable and φ(x) log φ(x) is Riemann integrable,

H(Sn) = H(Sn −E(Sn))

= −∑

pklogpk

≈ −∑

φ(k) log φ(k)

≈ −∫

φ(x) log φ(x)dx

= (− log e)

φ(x) lnφ(x)dx

= (− log e)

φ(x)(− x2

2σ2− ln

√2πσ2)

= (log e)(1

2+

1

2ln 2πσ2)

=1

2log 2πeσ2

=1

2log nπe+ 1.

(Refer to Chapter 9 for a more general discussion of the entropy of a continuousrandom variable and its relation to discrete entropy.)

Here is a specific example for n = 100 . Based on earlier discussion, Pr(S100 =

k) =

(

k − 1100 − 1

)

(12)k . The Gaussian approximation of H(Sn) is 5.8690 while

Page 93: Solution to Information Theory

92 Entropy Rates of a Stochastic Process

the exact value of H(Sn) is 5.8636 . The expected number of flips required issomewhere between 5.8636 and 7.8636 .

30. Markov chain transitions.

P = [Pij ] =

12

14

14

14

12

14

14

14

12

Let X1 be uniformly distributed over the states {0, 1, 2}. Let {Xi}∞1 be a Markovchain with transition matrix P , thus P (Xn+1 = j|Xn = i) = Pij , i, j ∈ {0, 1, 2}.

(a) Is {Xn} stationary?

(b) Find limn→∞1nH(X1, . . . , Xn).

Now consider the derived process Z1, Z2, . . . , Zn, where

Z1 = X1

Zi = Xi −Xi−1 (mod 3), i = 2, . . . , n.

Thus Zn encodes the transitions, not the states.

(c) Find H(Z1, Z2, ..., Zn).

(d) Find H(Zn) and H(Xn), for n ≥ 2 .

(e) Find H(Zn|Zn−1) for n ≥ 2 .

(f) Are Zn−1 and Zn independent for n ≥ 2?

Solution:

(a) Let µn denote the probability mass function at time n . Since µ1 = (13 ,

13 ,

13) and

µ2 = µ1P = µ1 , µn = µ1 = (13 ,

13 ,

13) for all n and {Xn} is stationary.

Alternatively, the observation P is doubly stochastic will lead the same conclusion.

(b) Since {Xn} is stationary Markov,

limn→∞

H(X1, . . . , Xn) = H(X2|X1)

=2∑

k=0

P (X1 = k)H(X2|X1 = k)

= 3 × 1

3×H(

1

2,1

4,1

4)

=3

2.

Page 94: Solution to Information Theory

Entropy Rates of a Stochastic Process 93

(c) Since (X1, . . . , Xn) and (Z1, . . . , Zn) are one-to-one, by the chain rule of entropyand the Markovity,

H(Z1, . . . , Zn) = H(X1, . . . , Xn)

=n∑

k=1

H(Xk|X1, . . . , Xk−1)

= H(X1) +n∑

k=2

H(Xk|Xk−1)

= H(X1) + (n− 1)H(X2|X1)

= log 3 +3

2(n− 1).

Alternatively, we can use the results of parts (d), (e), and (f). Since Z1, . . . , Zn

are independent and Z2, . . . , Zn are identically distributed with the probabilitydistribution ( 1

2 ,14 ,

14) ,

H(Z1, . . . , Zn) = H(Z1) +H(Z2) + . . . +H(Zn)

= H(Z1) + (n− 1)H(Z2)

= log 3 +3

2(n− 1).

(d) Since {Xn} is stationary with µn = (13 ,

13 ,

13) ,

H(Xn) = H(X1) = H(1

3,1

3,1

3) = log 3.

For n ≥ 2 , Zn =

0, 12,

1, 14,

2, 14.

Hence, H(Zn) = H(12 ,

14 ,

14) = 3

2 .

(e) Due to the symmetry of P , P (Zn|Zn−1) = P (Zn) for n ≥ 2. Hence, H(Zn|Zn−1) =H(Zn) = 3

2 .

Alternatively, using the result of part (f), we can trivially reach the same conclu-sion.

(f) Let k ≥ 2 . First observe that by the symmetry of P , Zk+1 = Xk+1 − Xk isindependent of Xk . Now that

P (Zk+1|Xk, Xk−1) = P (Xk+1 −Xk|Xk, Xk−1)

= P (Xk+1 −Xk|Xk)

= P (Xk+1 −Xk)

= P (Zk+1),

Zk+1 is independent of (Xk, Xk−1) and hence independent of Zk = Xk −Xk−1 .

For k = 1, again by the symmetry of P , Z2 is independent of Z1 = X1 trivially.

Page 95: Solution to Information Theory

94 Entropy Rates of a Stochastic Process

31. Markov.

Let {Xi} ∼ Bernoulli(p) . Consider the associated Markov chain {Yi}ni=1 where

Yi = (the number of 1’s in the current run of 1’s) . For example, if Xn = 101110 . . . ,we have Y n = 101230 . . . .

(a) Find the entropy rate of Xn .

(b) Find the entropy rate of Y n .

Solution: Markov solution.

(a) For an i.i.d. source, H(X ) = H(X) = H(p) .

(b) Observe that Xn and Y n have a one-to-one mapping. Thus, H(Y) = H(X ) =H(p) .

32. Time symmetry.Let {Xn} be a stationary Markov process. We condition on (X0, X1) and look intothe past and future. For what index k is

H(X−n|X0, X1) = H(Xk|X0, X1)?

Give the argument.

Solution: Time symmetry.The trivial solution is k = −n. To find other possible values of k we expand

H(X−n|X0, X1) = H(X−n, X0, X1) −H(X0, X1)

= H(X−n) +H(X0, X1|X−n) −H(X0, X1)

= H(X−n) +H(X0|X−n) +H(X1|X0, X−n) −H(X0, X1)

(a)= H(X−n) +H(X0|X−n) +H(X1|X0) −H(X0, X1)

= H(X−n) +H(X0|X−n) −H(X0)

(b)= H(X0) +H(X0|X−n) −H(X0)

(c)= H(Xn|X0)

(d)= H(Xn|X0, X−1)

(e)= H(Xn+1|X1, X0)

where (a) and (d) come from Markovity and (b), (c) and (e) come from stationarity.Hence k = n + 1 is also a solution. There are no other solution since for any otherk, we can construct a periodic Markov process as a counter example. Therefore k ∈{−n, n+ 1}.

Page 96: Solution to Information Theory

Entropy Rates of a Stochastic Process 95

33. Chain inequality: Let X1 → X2 → X3 → X4 form a Markov chain. Show that

I(X1;X3) + I(X2;X4) ≤ I(X1;X4) + I(X2;X3) (4.155)

Solution: Chain inequality X1 → X2 → X3toX4

I(X1;X4) +I(X2;X3) − I(X1;X3) − I(X2;X4) (4.156)

= H(X1) −H(X1|X4) +H(X2) −H(X2|X3) − (H(X1) −H(X1|X3))

−(H(X2) −H(X2|X4)) (4.157)

= H(X1|X3) −H(X1|X4) +H(X2|X4) −H(X2|X3) (4.158)

= H(X1, X2|X3) −H(X2|X1, X3) −H(X1, X2|X4) +H(X2|X1, X4)(4.159)

+H(X1, X2|X4) −H(X1|X2, X4) −H(X1, X2|X3) +H(X1|X2, X3))(4.160)

= −H(X2|X1, X3) +H(X2|X1, X4) (4.161)

− H(X2|X1, X4) −H(X2|X1, X3, X4) (4.162)

= I(X2;X3|X1, X4) (4.163)

≥ 0 (4.164)

where H(X1|X2, X3) = H(X1|X2, X4) by the Markovity of the random variables.

34. Broadcast channel. Let X → Y → (Z,W ) form a Markov chain, i.e., p(x, y, z, w) =p(x)p(y|x)p(z, w|y) for all x, y, z, w . Show that

I(X;Z) + I(X;W ) ≤ I(X;Y ) + I(Z;W ) (4.165)

Solution: Broadcast Channel

X → Y → (Z,W ) , hence by the data processing inequality, I(X;Y ) ≥ I(X; (Z,W )) ,and hence

I(X : Y ) +I(Z;W ) − I(X;Z) − I(X;W ) (4.166)

≥ I(X : Z,W ) + I(Z;W ) − I(X;Z) − I(X;W ) (4.167)

= H(Z,W ) +H(X) −H(X,W,Z) +H(W ) +H(Z) −H(W,Z)

−H(Z) −H(X) +H(X,Z)) −H(W ) −H(X) +H(W,X)(4.168)

= −H(X,W,Z) +H(X,Z) +H(X,W ) −H(X) (4.169)

= H(W |X) −H(W |X,Z) (4.170)

= I(W ;Z|X) (4.171)

≥ 0 (4.172)

35. Concavity of second law. Let {Xn}∞−∞ be a stationary Markov process. Show thatH(Xn|X0) is concave in n . Specifically show that

H(Xn|X0) −H(Xn−1|X0) − (H(Xn−1|X0) −H(Xn−2|X0)) = −I(X1;Xn−1|X0, Xn)(4.173)

≤ 0 (4.174)

Page 97: Solution to Information Theory

96 Entropy Rates of a Stochastic Process

Thus the second difference is negative, establishing that H(Xn|X0) is a concave func-tion of n .

Solution: Concavity of second law of thermodynamics

Since X0 → Xn−2 → Xn−1 → Xn is a Markov chain

H(Xn|X0) −H(Xn−1|X0) − (H(Xn−1|X0) −H(Xn−2|X0) (4.175)

= H(Xn|X0) −H(Xn−1|X0, X−1) − (H(Xn−1|X0, X−1) −H(Xn−2|X0, X−1)(4.176)

= H(Xn|X0) −H(Xn|X1, X0) − (H(Xn−1|X0) −H(Xn−1|X1, X0) (4.177)

= I(X1;Xn|X0) − I(X1;Xn−1|X0) (4.178)

= H(X1|X0) −H(X1|Xn, X0) −H(X1|X0) +H(X1|Xn−1, X0) (4.179)

= H(X1|Xn−1, X0) −H(X1|Xn, X0) (4.180)

= H(X1, Xn−1, Xn, X0) −H(X1|Xn, X0) (4.181)

= −I(X1;Xn−1|Xn, X0) (4.182)

≤ 0 (4.183)

where (4.176) and (4.181) follows from Markovity and (4.177) follows from stationarityof the Markov chain.

If we define∆n = H(Xn|X0) −H(Xn−1|X0) (4.184)

then the above chain of inequalities implies that ∆n − ∆n−1 ≤ 0 , which implies thatH(Xn|X0) is a concave function of n .

Page 98: Solution to Information Theory

Chapter 5

Data Compression

1. Uniquely decodable and instantaneous codes. Let L =∑m

i=1 pil100i be the expected value

of the 100th power of the word lengths associated with an encoding of the randomvariable X. Let L1 = minL over all instantaneous codes; and let L2 = minL over alluniquely decodable codes. What inequality relationship exists between L1 and L2?

Solution: Uniquely decodable and instantaneous codes.

L =m∑

i=1

pin100i (5.1)

L1 = minInstantaneous codes

L (5.2)

L2 = minUniquely decodable codes

L (5.3)

Since all instantaneous codes are uniquely decodable, we must have L2 ≤ L1 . Any setof codeword lengths which achieve the minimum of L2 will satisfy the Kraft inequalityand hence we can construct an instantaneous code with the same codeword lengths,and hence the same L . Hence we have L1 ≤ L2 . From both these conditions, we musthave L1 = L2 .

2. How many fingers has a Martian? Let

S =

(

S1, . . . , Sm

p1, . . . , pm

)

.

The Si ’s are encoded into strings from a D -symbol output alphabet in a uniquely de-codable manner. If m = 6 and the codeword lengths are (l1, l2, . . . , l6) = (1, 1, 2, 3, 2, 3),find a good lower bound on D. You may wish to explain the title of the problem.

Solution: How many fingers has a Martian?

Uniquely decodable codes satisfy Kraft’s inequality. Therefore

f(D) = D−1 +D−1 +D−2 +D−3 +D−2 +D−3 ≤ 1. (5.4)

97

Page 99: Solution to Information Theory

98 Data Compression

We have f(2) = 7/4 > 1 , hence D > 2 . We have f(3) = 26/27 < 1 . So a possiblevalue of D is 3. Our counting system is base 10, probably because we have 10 fingers.Perhaps the Martians were using a base 3 representation because they have 3 fingers.(Maybe they are like Maine lobsters ?)

3. Slackness in the Kraft inequality. An instantaneous code has word lengths l1, l2, . . . , lmwhich satisfy the strict inequality

m∑

i=1

D−li < 1.

The code alphabet is D = {0, 1, 2, . . . , D − 1}. Show that there exist arbitrarily longsequences of code symbols in D∗ which cannot be decoded into sequences of codewords.

Solution:

Slackness in the Kraft inequality. Instantaneous codes are prefix free codes, i.e., nocodeword is a prefix of any other codeword. Let nmax = max{n1, n2, ..., nq}. Thereare Dnmax sequences of length nmax . Of these sequences, Dnmax−ni start with thei -th codeword. Because of the prefix condition no two sequences can start with thesame codeword. Hence the total number of sequences which start with some codewordis∑q

i=1Dnmax−ni = Dnmax

∑qi=1D

−ni < Dnmax . Hence there are sequences which donot start with any codeword. These and all longer sequences with these length nmax

sequences as prefixes cannot be decoded. (This situation can be visualized with the aidof a tree.)

Alternatively, we can map codewords onto dyadic intervals on the real line correspond-ing to real numbers whose decimal expansions start with that codeword. Since thelength of the interval for a codeword of length ni is D−ni , and

∑D−ni < 1 , there ex-

ists some interval(s) not used by any codeword. The binary sequences in these intervalsdo not begin with any codeword and hence cannot be decoded.

4. Huffman coding. Consider the random variable

X =

(

x1 x2 x3 x4 x5 x6 x7

0.49 0.26 0.12 0.04 0.04 0.03 0.02

)

(a) Find a binary Huffman code for X.

(b) Find the expected codelength for this encoding.

(c) Find a ternary Huffman code for X.

Solution: Examples of Huffman codes.

(a) The Huffman tree for this distribution is

Page 100: Solution to Information Theory

Data Compression 99

Codeword1 x1 0.49 0.49 0.49 0.49 0.49 0.51 100 x2 0.26 0.26 0.26 0.26 0.26 0.49011 x3 0.12 0.12 0.12 0.13 0.2501000 x4 0.04 0.05 0.08 0.1201001 x5 0.04 0.04 0.0501010 x6 0.03 0.0401011 x7 0.02

(b) The expected length of the codewords for the binary Huffman code is 2.02 bits.(H(X) = 2.01 bits)

(c) The ternary Huffman tree is

Codeword0 x1 0.49 0.49 0.49 1.01 x2 0.26 0.26 0.2620 x3 0.12 0.12 0.2522 x4 0.04 0.09210 x5 0.04 0.04211 x6 0.03212 x7 0.02

This code has an expected length 1.34 ternary symbols. (H3(X) = 1.27 ternarysymbols).

5. More Huffman codes. Find the binary Huffman code for the source with probabilities(1/3, 1/5, 1/5, 2/15, 2/15) . Argue that this code is also optimal for the source withprobabilities (1/5, 1/5, 1/5, 1/5, 1/5).

Solution: More Huffman codes. The Huffman code for the source with probabilities(13 ,

15 ,

15 ,

215 ,

215 ) has codewords {00,10,11,010,011}.

To show that this code (*) is also optimal for (1/5, 1/5, 1/5, 1/5, 1/5) we have toshow that it has minimum expected length, that is, no shorter code can be constructedwithout violating H(X) ≤ EL .

H(X) = log 5 = 2.32 bits. (5.5)

E(L(∗)) = 2 × 3

5+ 3 × 2

5=

12

5bits. (5.6)

Since

E(L(any code)) =5∑

i=1

li5

=k

5bits (5.7)

for some integer k , the next lowest possible value of E(L) is 11/5 = 2.2 bits ¡ 2.32bits. Hence (*) is optimal.

Note that one could also prove the optimality of (*) by showing that the Huffmancode for the (1/5, 1/5, 1/5, 1/5, 1/5) source has average length 12/5 bits. (Since eachHuffman code produced by the Huffman encoding algorithm is optimal, they all havethe same average length.)

Page 101: Solution to Information Theory

100 Data Compression

6. Bad codes. Which of these codes cannot be Huffman codes for any probability assign-ment?

(a) {0, 10, 11}.(b) {00, 01, 10, 110}.(c) {01, 10}.

Solution: Bad codes

(a) {0,10,11} is a Huffman code for the distribution (1/2,1/4,1/4).

(b) The code {00,01,10, 110} can be shortened to {00,01,10, 11} without losing itsinstantaneous property, and therefore is not optimal, so it cannot be a Huffmancode. Alternatively, it is not a Huffman code because there is a unique longestcodeword.

(c) The code {01,10} can be shortened to {0,1} without losing its instantaneous prop-erty, and therefore is not optimal and not a Huffman code.

7. Huffman 20 questions. Consider a set of n objects. Let Xi = 1 or 0 accordingly asthe i-th object is good or defective. Let X1, X2, . . . , Xn be independent with Pr{Xi =1} = pi ; and p1 > p2 > . . . > pn > 1/2 . We are asked to determine the set of alldefective objects. Any yes-no question you can think of is admissible.

(a) Give a good lower bound on the minimum average number of questions required.

(b) If the longest sequence of questions is required by nature’s answers to our questions,what (in words) is the last question we should ask? And what two sets are wedistinguishing with this question? Assume a compact (minimum average length)sequence of questions.

(c) Give an upper bound (within 1 question) on the minimum average number ofquestions required.

Solution: Huffman 20 Questions.

(a) We will be using the questions to determine the sequence X1, X2, . . . , Xn , whereXi is 1 or 0 according to whether the i -th object is good or defective. Thus themost likely sequence is all 1’s, with a probability of

∏ni=1 pi , and the least likely

sequence is the all 0’s sequence with probability∏n

i=1(1 − pi) . Since the optimalset of questions corresponds to a Huffman code for the source, a good lower boundon the average number of questions is the entropy of the sequence X1, X2, . . . , Xn .But since the Xi ’s are independent Bernoulli random variables, we have

EQ ≥ H(X1, X2, . . . , Xn) =∑

H(Xi) =∑

H(pi). (5.8)

Page 102: Solution to Information Theory

Data Compression 101

(b) The last bit in the Huffman code distinguishes between the least likely sourcesymbols. (By the conditions of the problem, all the probabilities are different,and thus the two least likely sequences are uniquely defined.) In this case, thetwo least likely sequences are 000 . . . 00 and 000 . . . 01 , which have probabilities(1−p1)(1−p2) . . . (1−pn) and (1−p1)(1−p2) . . . (1−pn−1)pn respectively. Thusthe last question will ask “Is Xn = 1”, i.e., “Is the last item defective?”.

(c) By the same arguments as in Part (a), an upper bound on the minimum averagenumber of questions is an upper bound on the average length of a Huffman code,namely H(X1, X2, . . . , Xn) + 1 =

∑H(pi) + 1 .

8. Simple optimum compression of a Markov source. Consider the 3-state Markov processU1, U2, . . . , having transition matrix

Un−1\Un S1 S2 S3

S1 1/2 1/4 1/4S2 1/4 1/2 1/4S3 0 1/2 1/2

Thus the probability that S1 follows S3 is equal to zero. Design 3 codes C1, C2, C3 (onefor each state 1, 2 and 3 , each code mapping elements of the set of Si ’s into sequencesof 0’s and 1’s, such that this Markov process can be sent with maximal compression bythe following scheme:

(a) Note the present symbol Xn = i .

(b) Select code Ci.

(c) Note the next symbol Xn+1 = j and send the codeword in Ci corresponding toj .

(d) Repeat for the next symbol.

What is the average message length of the next symbol conditioned on the previousstate Xn = i using this coding scheme? What is the unconditional average numberof bits per source symbol? Relate this to the entropy rate H(U) of the Markovchain.

Solution: Simple optimum compression of a Markov source.

It is easy to design an optimal code for each state. A possible solution is

Next state S1 S2 S3

Code C1 0 10 11 E(L|C1) = 1.5 bits/symbolcode C2 10 0 11 E(L|C2) = 1.5 bits/symbolcode C3 - 0 1 E(L|C3) = 1 bit/symbol

The average message lengths of the next symbol conditioned on the previous statebeing Si are just the expected lengths of the codes Ci . Note that this code assignmentachieves the conditional entropy lower bound.

Page 103: Solution to Information Theory

102 Data Compression

To find the unconditional average, we have to find the stationary distribution on thestates. Let µ be the stationary distribution. Then

µ = µ

1/2 1/4 1/41/4 1/2 1/40 1/2 1/2

(5.9)

We can solve this to find that µ = (2/9, 4/9, 1/3) . Thus the unconditional averagenumber of bits per source symbol

EL =3∑

i=1

µiE(L|Ci) (5.10)

=2

9× 1.5 +

4

9× 1.5 +

1

3× 1 (5.11)

=4

3bits/symbol. (5.12)

The entropy rate H of the Markov chain is

H = H(X2|X1) (5.13)

=3∑

i=1

µiH(X2|X1 = Si) (5.14)

= 4/3 bits/symbol. (5.15)

Thus the unconditional average number of bits per source symbol and the entropy rateH of the Markov chain are equal, because the expected length of each code Ci equalsthe entropy of the state after state i , H(X2|X1 = Si) , and thus maximal compressionis obtained.

9. Optimal code lengths that require one bit above entropy. The source coding theoremshows that the optimal code for a random variable X has an expected length less thanH(X) + 1 . Give an example of a random variable for which the expected length ofthe optimal code is close to H(X) + 1 , i.e., for any ε > 0 , construct a distribution forwhich the optimal code has L > H(X) + 1 − ε .

Solution: Optimal code lengths that require one bit above entropy. There is a trivialexample that requires almost 1 bit above its entropy. Let X be a binary randomvariable with probability of X = 1 close to 1. Then entropy of X is close to 0 , butthe length of its optimal code is 1 bit, which is almost 1 bit above its entropy.

10. Ternary codes that achieve the entropy bound. A random variable X takes on m valuesand has entropy H(X) . An instantaneous ternary code is found for this source, withaverage length

L =H(X)

log2 3= H3(X). (5.16)

Page 104: Solution to Information Theory

Data Compression 103

(a) Show that each symbol of X has a probability of the form 3−i for some i .

(b) Show that m is odd.

Solution: Ternary codes that achieve the entropy bound.

(a) We will argue that an optimal ternary code that meets the entropy bound corre-sponds to complete ternary tree, with the probability of each leaf of the form 3−i .To do this, we essentially repeat the arguments of Theorem 5.3.1. We achieve theternary entropy bound only if D(p||r) = 0 and c = 1, in (5.25). Thus we achievethe entropy bound if and only if pi = 3−j for all i .

(b) We will show that any distribution that has pi = 3−li for all i must have anodd number of symbols. We know from Theorem 5.2.1, that given the set oflengths, li , we can construct a ternary tree with nodes at the depths li . Now,since

∑3−li = 1, the tree must be complete. A complete ternary tree has an

odd number of leaves (this can be proved by induction on the number of internalnodes). Thus the number of source symbols is odd.

Another simple argument is to use basic number theory. We know that forthis distribution,

∑3−li = 1. We can write this as 3−lmax

∑3lmax−li = 1 or

∑3lmax−li = 3lmax . Each of the terms in the sum is odd, and since their sum is

odd, the number of terms in the sum has to be odd (the sum of an even numberof odd terms is even). Thus there are an odd number of source symbols for anycode that meets the ternary entropy bound.

11. Suffix condition. Consider codes that satisfy the suffix condition, which says that nocodeword is a suffix of any other codeword. Show that a suffix condition code is uniquelydecodable, and show that the minimum average length over all codes satisfying thesuffix condition is the same as the average length of the Huffman code for that randomvariable.

Solution: Suffix condition. The fact that the codes are uniquely decodable can beseen easily be reversing the order of the code. For any received sequence, we workbackwards from the end, and look for the reversed codewords. Since the codewordssatisfy the suffix condition, the reversed codewords satisfy the prefix condition, and thewe can uniquely decode the reversed code.

The fact that we achieve the same minimum expected length then follows directly fromthe results of Section 5.5. But we can use the same reversal argument to argue thatcorresponding to every suffix code, there is a prefix code of the same length and viceversa, and therefore we cannot achieve any lower codeword lengths with a suffix codethan we can with a prefix code.

12. Shannon codes and Huffman codes. Consider a random variable X which takes on fourvalues with probabilities ( 1

3 ,13 ,

14 ,

112) .

(a) Construct a Huffman code for this random variable.

Page 105: Solution to Information Theory

104 Data Compression

(b) Show that there exist two different sets of optimal lengths for the codewords,namely, show that codeword length assignments (1, 2, 3, 3) and (2, 2, 2, 2) areboth optimal.

(c) Conclude that there are optimal codes with codeword lengths for some symbolsthat exceed the Shannon code length dlog 1

p(x)e .

Solution: Shannon codes and Huffman codes.

(a) Applying the Huffman algorithm gives us the following table

Code Symbol Probability0 1 1/3 1/3 2/3 111 2 1/3 1/3 1/3101 3 1/4 1/3100 4 1/12

which gives codeword lengths of 1,2,3,3 for the different codewords.

(b) Both set of lengths 1,2,3,3 and 2,2,2,2 satisfy the Kraft inequality, and they bothachieve the same expected length (2 bits) for the above distribution. Thereforethey are both optimal.

(c) The symbol with probability 1/4 has an Huffman code of length 3, which is greaterthan dlog 1

pe . Thus the Huffman code for a particular symbol may be longer thanthe Shannon code for that symbol. But on the average, the Huffman code cannotbe longer than the Shannon code.

13. Twenty questions. Player A chooses some object in the universe, and player B attemptsto identify the object with a series of yes-no questions. Suppose that player B is cleverenough to use the code achieving the minimal expected length with respect to playerA’s distribution. We observe that player B requires an average of 38.5 questions todetermine the object. Find a rough lower bound to the number of objects in theuniverse.

Solution: Twenty questions.

37.5 = L∗ − 1 < H(X) ≤ log |X | (5.17)

and hence number of objects in the universe > 237.5 = 1.94 × 1011.

14. Huffman code. Find the (a) binary and (b) ternary Huffman codes for the randomvariable X with probabilities

p = (1

21,

2

21,

3

21,

4

21,

5

21,

6

21) .

(c) Calculate L =∑pili in each case.

Solution: Huffman code.

Page 106: Solution to Information Theory

Data Compression 105

(a) The Huffman tree for this distribution is

Codeword00 x1 6/21 6/21 6/21 9/21 12/21 110 x2 5/21 5/21 6/21 6/21 9/2111 x3 4/21 4/21 5/21 6/21010 x4 3/21 3/21 4/210110 x5 2/21 3/210111 x6 1/21

(b) The ternary Huffman tree is

Codeword1 x1 6/21 6/21 10/21 12 x2 5/21 5/21 6/2100 x3 4/21 4/21 5/2101 x4 3/21 3/21020 x5 2/21 3/21021 x6 1/21022 x7 0/21

(c) The expected length of the codewords for the binary Huffman code is 51/21 = 2.43bits.

The ternary code has an expected length of 34/21 = 1.62 ternary symbols.

15. Huffman codes.

(a) Construct a binary Huffman code for the following distribution on 5 symbols p =(0.3, 0.3, 0.2, 0.1, 0.1) . What is the average length of this code?

(b) Construct a probability distribution p′ on 5 symbols for which the code that youconstructed in part (a) has an average length (under p′ ) equal to its entropyH(p′) .

Solution: Huffman codes

(a) The code constructed by the standard Huffman procedure

Codeword X Probability10 1 0.3 0.3 0.4 0.6 111 2 0.3 0.3 0.3 0.400 3 0.2 0.2 0.3010 4 0.1 0.2011 5 0.1

The average length = 2 ∗

0.8 + 3 ∗ 0.2 = 2.2 bits/symbol.

(b) The code would have a rate equal to the entropy if each of the codewords was oflength 1/p(X) . In this case, the code constructed above would be efficient for thedistrubution (0.25.0.25,0.25,0.125,0.125).

16. Huffman codes: Consider a random variable X which takes 6 values {A,B,C,D,E, F}with probabilities (0.5, 0.25, 0.1, 0.05, 0.05, 0.05) respectively.

Page 107: Solution to Information Theory

106 Data Compression

(a) Construct a binary Huffman code for this random variable. What is its averagelength?

(b) Construct a quaternary Huffman code for this random variable, i.e., a code overan alphabet of four symbols (call them a, b, c and d ). What is the average lengthof this code?

(c) One way to construct a binary code for the random variable is to start with aquaternary code, and convert the symbols into binary using the mapping a→ 00 ,b → 01 , c → 10 and d → 11 . What is the average length of the binary code forthe above random variable constructed by this process?

(d) For any random variable X , let LH be the average length of the binary Huffmancode for the random variable, and let LQB be the average length code constructedby first building a quaternary Huffman code and converting it to binary. Showthat

LH ≤ LQB < LH + 2 (5.18)

(e) The lower bound in the previous example is tight. Give an example where thecode constructed by converting an optimal quaternary code is also the optimalbinary code.

(f) The upper bound, i.e., LQB < LH + 2 is not tight. In fact, a better bound isLQB ≤ LH + 1. Prove this bound, and provide an example where this bound istight.

Solution: Huffman codes: Consider a random variable X which takes 6 values {A,B,C,D,E, F}with probabilities (0.5, 0.25, 0.1, 0.05, 0.05, 0.05) respectively.

(a) Construct a binary Huffman code for this random variable. What is its averagelength?

Solution:

Code Source symbol Prob.0 A 0.5 0.5 0.5 0.5 0.5 1.010 B 0.25 0.25 0.25 0.25 0.51100 C 0.1 0.1 0.15 0.251101 D 0.05 0.1 0.11110 E 0.05 0.051111 F 0.05

The average length of this code is 1×0.5+2×0.25+4×(0.1+0.05+0.05+0.05) = 2bits. The entropy H(X) in this case is 1.98 bits.

(b) Construct a quaternary Huffman code for this random variable, i.e., a code overan alphabet of four symbols (call them a, b, c and d ). What is the average lengthof this code?

Solution:Since the number of symbols, i.e., 6 is not of the form 1 + k(D − 1) ,we need to add a dummy symbol of probability 0 to bring it to this form. In thiscase, drawing up the Huffman tree is straightforward.

Page 108: Solution to Information Theory

Data Compression 107

Code Symbol Prob.a A 0.5 0.5 1.0b B 0.25 0.25d C 0.1 0.15ca D 0.05 0.1cb E 0.05cc F 0.05cd G 0.0

The average length of this code is 1× 0.85 + 2× 0.15 = 1.15 quaternary symbols.

(c) One way to construct a binary code for the random variable is to start with aquaternary code, and convert the symbols into binary using the mapping a→ 00 ,b → 01 , c → 10 and d → 11 . What is the average length of the binary code forthe above random variable constructed by this process?

Solution:The code constructed by the above process is A → 00 , B → 01 , C →11 , D → 1000 , E → 1001 , and F → 1010 , and the average length is 2 × 0.85 +4 × 0.15 = 2.3 bits.

(d) For any random variable X , let LH be the average length of the binary Huffmancode for the random variable, and let LQB be the average length code constructedby firsting building a quaternary Huffman code and converting it to binary. Showthat

LH ≤ LQB < LH + 2 (5.19)

Solution:Since the binary code constructed from the quaternary code is also in-stantaneous, its average length cannot be better than the average length of thebest instantaneous code, i.e., the Huffman code. That gives the lower bound ofthe inequality above.

To prove the upper bound, the LQ be the length of the optimal quaternary code.Then from the results proved in the book, we have

H4(X) ≤ LQ < H4(X) + 1 (5.20)

Also, it is easy to see that LQB = 2LQ , since each symbol in the quaternary codeis converted into two bits. Also, from the properties of entropy, it follows thatH4(X) = H2(X)/2 . Substituting these in the previous equation, we get

H2(X) ≤ LQB < H2(X) + 2. (5.21)

Combining this with the bound that H2(X) ≤ LH , we obtain LQB < LH + 2.

(e) The lower bound in the previous example is tight. Give an example where thecode constructed by converting an optimal quaternary code is also the optimalbinary code?

Solution:Consider a random variable that takes on four equiprobable values.Then the quaternary Huffman code for this is 1 quaternary symbol for each sourcesymbol, with average length 1 quaternary symbol. The average length LQB forthis code is then 2 bits. The Huffman code for this case is also easily seen to assign2 bit codewords to each symbol, and therefore for this case, LH = LQB .

Page 109: Solution to Information Theory

108 Data Compression

(f) (Optional, no credit) The upper bound, i.e., LQB < LH +2 is not tight. In fact, abetter bound is LQB ≤ LH +1. Prove this bound, and provide an example wherethis bound is tight.

Solution:Consider a binary Huffman code for the random variable X and considerall codewords of odd length. Append a 0 to each of these codewords, and we willobtain an instantaneous code where all the codewords have even length. Then wecan use the inverse of the mapping mentioned in part (c) to construct a quaternarycode for the random variable - it is easy to see that the quatnerary code is alsoinstantaneous. Let LBQ be the average length of this quaternary code. Since thelength of the quaternary codewords of BQ are half the length of the correspondingbinary codewords, we have

LBQ =1

2

LH +∑

i:li is odd

pi

<LH + 1

2(5.22)

and since the BQ code is at best as good as the quaternary Huffman code, wehave

LBQ ≥ LQ (5.23)

Therefore LQB = 2LQ ≤ 2LBQ < LH + 1.

An example where this upper bound is tight is the case when we have only twopossible symbols. Then LH = 1, and LQB = 2.

17. Data compression. Find an optimal set of binary codeword lengths l1, l2, . . . (mini-mizing

∑pili ) for an instantaneous code for each of the following probability mass

functions:

(a) p = ( 1041 ,

941 ,

841 ,

741 ,

741)

(b) p = ( 910 , (

910 )( 1

10 ), ( 910 )( 1

10 )2, ( 910 )( 1

10 )3, . . .)

Solution: Data compression

(a)

Code Source symbol Prob.10 A 10/41 14/41 17/41 24/41 41/4100 B 9/41 10/41 14/41 17/4101 C 8/41 9/41 10/41110 D 7/41 8/41111 E 7/41

(b) This is case of an Huffman code on an infinite alphabet. If we consider an initialsubset of the symbols, we can see that the cumulative probability of all symbols{x : x > i} is

j>i 0.9 ∗ (0.1)j−1 = 0.9(0.1)i−1(1/(1 − 0.1)) = (0.1)i−1 . Sincethis is less than 0.9 ∗ (0.1)i−1 , the cumulative sum of all the remaining terms isless than the last term used. Thus Huffman coding will always merge the last twoterms. This in terms implies that the Huffman code in this case is of the form1,01,001,0001, etc.

Page 110: Solution to Information Theory

Data Compression 109

18. Classes of codes. Consider the code {0, 01}

(a) Is it instantaneous?

(b) Is it uniquely decodable?

(c) Is it nonsingular?

Solution: Codes.

(a) No, the code is not instantaneous, since the first codeword, 0, is a prefix of thesecond codeword, 01.

(b) Yes, the code is uniquely decodable. Given a sequence of codewords, first isolateoccurrences of 01 (i.e., find all the ones) and then parse the rest into 0’s.

(c) Yes, all uniquely decodable codes are non-singular.

19. The game of Hi-Lo.

(a) A computer generates a number X according to a known probability mass functionp(x), x ∈ {1, 2, . . . , 100} . The player asks a question, “Is X = i ?” and is told“Yes”, “You’re too high,” or “You’re too low.” He continues for a total of sixquestions. If he is right (i.e., he receives the answer “Yes”) during this sequence,he receives a prize of value v(X). How should the player proceed to maximize hisexpected winnings?

(b) The above doesn’t have much to do with information theory. Consider the fol-lowing variation: X ∼ p(x), prize = v(x) , p(x) known, as before. But arbitraryYes-No questions are asked sequentially until X is determined. (“Determined”doesn’t mean that a “Yes” answer is received.) Questions cost one unit each. Howshould the player proceed? What is the expected payoff?

(c) Continuing (b), what if v(x) is fixed, but p(x) can be chosen by the computer(and then announced to the player)? The computer wishes to minimize the player’sexpected return. What should p(x) be? What is the expected return to the player?

Solution: The game of Hi-Lo.

(a) The first thing to recognize in this problem is that the player cannot cover morethan 63 values of X with 6 questions. This can be easily seen by induction.With one question, there is only one value of X that can be covered. With twoquestions, there is one value of X that can be covered with the first question,and depending on the answer to the first question, there are two possible valuesof X that can be asked in the next question. By extending this argument, we seethat we can ask at more 63 different questions of the form “Is X = i?” with 6questions. (The fact that we have narrowed the range at the end is irrelevant, ifwe have not isolated the value of X .)

Thus if the player seeks to maximize his return, he should choose the 63 mostvaluable outcomes for X , and play to isolate these values. The probabilities are

Page 111: Solution to Information Theory

110 Data Compression

irrelevant to this procedure. He will choose the 63 most valuable outcomes, andhis first question will be “Is X = i?” where i is the median of these 63 numbers.After isolating to either half, his next question will be “Is X = j ?”, where j isthe median of that half. Proceeding this way, he will win if X is one of the 63most valuable outcomes, and lose otherwise. This strategy maximizes his expectedwinnings.

(b) Now if arbitrary questions are allowed, the game reduces to a game of 20 questionsto determine the object. The return in this case to the player is

x p(x)(v(x) −l(x)) , where l(x) is the number of questions required to determine the object.Maximizing the return is equivalent to minimizing the expected number of ques-tions, and thus, as argued in the text, the optimal strategy is to construct aHuffman code for the source and use that to construct a question strategy. Hisexpected return is therefore between

∑p(x)v(x) −H and

∑p(x)v(x) −H − 1 .

(c) A computer wishing to minimize the return to player will want to minimize∑p(x)v(x) −H(X) over choices of p(x) . We can write this as a standard mini-

mization problem with constraints. Let

J(p) =∑

pivi +∑

pi log pi + λ∑

pi (5.24)

and differentiating and setting to 0, we obtain

vi + log pi + 1 + λ = 0 (5.25)

or after normalizing to ensure that pi ’s forms a probability distribution,

pi =2−vi

j 2−vj. (5.26)

To complete the proof, we let ri = 2−vi∑

j2−vj

, and rewrite the return as

pivi +∑

pi log pi =∑

pi log pi −∑

pi log 2−vi (5.27)

=∑

pi log pi −∑

pi log ri − log(∑

2−vj ) (5.28)

= D(p||r) − log(∑

2−vj ), (5.29)

and thus the return is minimized by choosing pi = ri . This is the distributionthat the computer must choose to minimize the return to the player.

20. Huffman codes with costs. Words like Run! Help! and Fire! are short, not because theyare frequently used, but perhaps because time is precious in the situations in whichthese words are required. Suppose that X = i with probability pi, i = 1, 2, . . . ,m. Letli be the number of binary symbols in the codeword associated with X = i, and let ci

denote the cost per letter of the codeword when X = i. Thus the average cost C ofthe description of X is C =

∑mi=1 picili.

Page 112: Solution to Information Theory

Data Compression 111

(a) Minimize C over all l1, l2, . . . , lm such that∑

2−li ≤ 1. Ignore any implied in-teger constraints on li. Exhibit the minimizing l∗1, l

∗2, . . . , l

∗m and the associated

minimum value C∗.

(b) How would you use the Huffman code procedure to minimize C over all uniquelydecodable codes? Let CHuffman denote this minimum.

(c) Can you show that

C∗ ≤ CHuffman ≤ C∗ +m∑

i=1

pici?

Solution: Huffman codes with costs.

(a) We wish to minimize C =∑picini subject to

∑2−ni ≤ 1 . We will assume

equality in the constraint and let ri = 2−ni and let Q =∑

i pici . Let qi =(pici)/Q . Then q also forms a probability distribution and we can write C as

C =∑

picini (5.30)

= Q∑

qi log1

ri(5.31)

= Q

(∑

qi logqiri

−∑

qi log qi

)

(5.32)

= Q(D(q||r) +H(q)). (5.33)

Since the only freedom is in the choice of ri , we can minimize C by choosingr = q or

n∗i = − logpici∑pjcj

, (5.34)

where we have ignored any integer constraints on ni . The minimum cost C∗ forthis assignment of codewords is

C∗ = QH(q) (5.35)

(b) If we use q instead of p for the Huffman procedure, we obtain a code minimizingexpected cost.

(c) Now we can account for the integer constraints.

Let

ni = d− log qie (5.36)

Then

− log qi ≤ ni < − log qi + 1 (5.37)

Multiplying by pici and summing over i , we get the relationship

C∗ ≤ CHuffman < C∗ +Q. (5.38)

Page 113: Solution to Information Theory

112 Data Compression

21. Conditions for unique decodability. Prove that a code C is uniquely decodable if (andonly if) the extension

Ck(x1, x2, . . . , xk) = C(x1)C(x2) · · ·C(xk)

is a one-to-one mapping from X k to D∗ for every k ≥ 1 . (The only if part is obvious.)

Solution: Conditions for unique decodability. If Ck is not one-to-one for some k , thenC is not UD, since there exist two distinct sequences, (x1, . . . , xk) and (x′1, . . . , x

′k) such

thatCk(x1, . . . , xk) = C(x1) · · ·C(xk) = C(x′1) · · ·C(x′k) = C(x′1, . . . , x

′k) .

Conversely, if C is not UD then by definition there exist distinct sequences of sourcesymbols, (x1, . . . , xi) and (y1, . . . , yj) , such that

C(x1)C(x2) · · ·C(xi) = C(y1)C(y2) · · ·C(yj) .

Concatenating the input sequences (x1, . . . , xi) and (y1, . . . , yj) , we obtain

C(x1) · · ·C(xi)C(y1) · · ·C(yj) = C(y1) · · ·C(yj)C(x1) · · ·C(xi) ,

which shows that Ck is not one-to-one for k = i+ j .

22. Average length of an optimal code. Prove that L(p1, . . . , pm) , the average codewordlength for an optimal D -ary prefix code for probabilities {p1, . . . , pm} , is a continuousfunction of p1, . . . , pm . This is true even though the optimal code changes discontinu-ously as the probabilities vary.

Solution: Average length of an optimal code. The longest possible codeword in anoptimal code has n−1 binary digits. This corresponds to a completely unbalanced treein which each codeword has a different length. Using a D -ary alphabet for codewordscan only decrease its length. Since we know the maximum possible codeword length,there are only a finite number of possible codes to consider. For each candidate code C ,the average codeword length is determined by the probability distribution p1, p2, . . . , pn :

L(C) =n∑

i=1

pi`i.

This is a linear, and therefore continuous, function of p1, p2, . . . , pn . The optimalcode is the candidate code with the minimum L , and its length is the minimum of afinite number of continuous functions and is therefore itself a continuous function ofp1, p2, . . . , pn .

23. Unused code sequences. Let C be a variable length code that satisfies the Kraft in-equality with equality but does not satisfy the prefix condition.

(a) Prove that some finite sequence of code alphabet symbols is not the prefix of anysequence of codewords.

(b) (Optional) Prove or disprove: C has infinite decoding delay.

Page 114: Solution to Information Theory

Data Compression 113

Solution: Unused code sequences. Let C be a variable length code that satisfies theKraft inequality with equality but does not satisfy the prefix condition.

(a) When a prefix code satisfies the Kraft inequality with equality, every (infinite)sequence of code alphabet symbols corresponds to a sequence of codewords, sincethe probability that a random generated sequence begins with a codeword is

m∑

i=1

D−`i = 1 .

If the code does not satisfy the prefix condition, then at least one codeword, sayC(x1) , is a prefix of another, say C(xm) . Then the probability that a randomgenerated sequence begins with a codeword is at most

m−1∑

i=1

D−`i ≤ 1 −D−`m < 1 ,

which shows that not every sequence of code alphabet symbols is the beginning ofa sequence of codewords.

(b) (Optional) A reference to a paper proving that C has infinite decoding delay willbe supplied later. It is easy to see by example that the decoding delay cannot befinite. An simple example of a code that satisfies the Kraft inequality, but not theprefix condition is a suffix code (see problem 11). The simplest non-trivial suffixcode is one for three symbols {0, 01, 11} . For such a code, consider decoding astring 011111 . . . 1110. If the number of one’s is even, then the string must beparsed 0,11,11, . . . ,11,0, whereas if the number of 1’s is odd, the string must beparsed 01,11, . . . ,11. Thus the string cannot be decoded until the string of 1’s hasended, and therefore the decoding delay could be infinite.

24. Optimal codes for uniform distributions. Consider a random variable with m equiprob-able outcomes. The entropy of this information source is obviously log2m bits.

(a) Describe the optimal instantaneous binary code for this source and compute theaverage codeword length Lm .

(b) For what values of m does the average codeword length Lm equal the entropyH = log2m?

(c) We know that L < H + 1 for any probability distribution. The redundancy of avariable length code is defined to be ρ = L−H . For what value(s) of m , where2k ≤ m ≤ 2k+1 , is the redundancy of the code maximized? What is the limitingvalue of this worst case redundancy as m→ ∞?

Solution: Optimal codes for uniform distributions.

(a) For uniformly probable codewords, there exists an optimal binary variable lengthprefix code such that the longest and shortest codewords differ by at most one bit.If two codes differ by 2 bits or more, call ms the message with the shorter codeword

Page 115: Solution to Information Theory

114 Data Compression

Cs and m` the message with the longer codeword C` . Change the codewordsfor these two messages so that the new codeword C ′

s is the old Cs with a zeroappended (C ′

s = Cs0) and C ′` is the old Cs with a one appended (C ′

` = Cs1) . C ′s

and C ′` are legitimate codewords since no other codeword contained Cs as a prefix

(by definition of a prefix code), so obviously no other codeword could contain C ′s

or C ′` as a prefix. The length of the codeword for ms increases by 1 and the

length of the codeword for m` decreases by at least 1. Since these messages areequally likely, L′ ≤ L . By this method we can transform any optimal code into acode in which the length of the shortest and longest codewords differ by at mostone bit. (In fact, it is easy to see that every optimal code has this property.)

For a source with n messages, `(ms) = blog2 nc and `(m`) = dlog2 ne . Let d bethe difference between n and the next smaller power of 2:

d = n− 2blog2 nc .

Then the optimal code has 2d codewords of length dlog2 ne and n−2d codewordsof length blog2 nc . This gives

L =1

n(2ddlog2 ne + (n− 2d)blog2 nc)

=1

n(nblog2 nc + 2d)

= blog2 nc +2d

n.

Note that d = 0 is a special case in the above equation.

(b) The average codeword length equals entropy if and only if n is a power of 2. Tosee this, consider the following calculation of L :

L =∑

i

pi`i = −∑

i

pi log2 2−`i = H +D(p‖q) ,

where qi = 2−`i . Therefore L = H only if pi = qi , that is, when all codewordshave equal length, or when d = 0.

(c) For n = 2m + d , the redundancy r = L−H is given by

r = L− log2 n

= blog2 nc +2d

n− log2 n

= m+2d

n− log2(2

m + d)

= m+2d

2m + d− ln(2m + d)

ln 2.

Therefore∂r

∂d=

(2m + d)(2) − 2d

(2m + d)2− 1

ln 2· 1

2m + d

Page 116: Solution to Information Theory

Data Compression 115

Setting this equal to zero implies d∗ = 2m(2 ln 2 − 1) . Since there is only onemaximum, and since the function is convex ∩ , the maximizing d is one of the twointegers nearest (.3862)(2m) . The corresponding maximum redundancy is

r∗ ≈ m+2d∗

2m + d∗− ln(2m + d∗)

ln 2

= m+2(.3862)(2m)

2m + (.3862)(2m)− ln(2m + (.3862)2m)

ln 2

= .0861 .

This is achieved with arbitrary accuracy as n→ ∞ . (The quantity σ = 0.0861 isone of the lesser fundamental constants of the universe. See Robert Gallager[3].

25. Optimal codeword lengths. Although the codeword lengths of an optimal variable lengthcode are complicated functions of the message probabilities {p1, p2, . . . , pm} , it can besaid that less probable symbols are encoded into longer codewords. Suppose that themessage probabilities are given in decreasing order p1 > p2 ≥ · · · ≥ pm .

(a) Prove that for any binary Huffman code, if the most probable message symbol hasprobability p1 > 2/5 , then that symbol must be assigned a codeword of length 1.

(b) Prove that for any binary Huffman code, if the most probable message symbolhas probability p1 < 1/3 , then that symbol must be assigned a codeword oflength ≥ 2 .

Solution: Optimal codeword lengths. Let {c1, c2, . . . , cm} be codewords of respectivelengths {`1, `2, . . . , `m} corresponding to probabilities {p1, p2, . . . , pm} .

(a) We prove that if p1 > p2 and p1 > 2/5 then `1 = 1. Suppose, for the sake ofcontradiction, that `1 ≥ 2 . Then there are no codewords of length 1; otherwisec1 would not be the shortest codeword. Without loss of generality, we can assumethat c1 begins with 00. For x, y ∈ {0, 1} let Cxy denote the set of codewordsbeginning with xy . Then the sets C01 , C10 , and C11 have total probability1 − p1 < 3/5 , so some two of these sets (without loss of generality, C10 and C11 )have total probability less 2/5. We can now obtain a better code by interchangingthe subtree of the decoding tree beginning with 1 with the subtree beginning with00; that is, we replace codewords of the form 1x . . . by 00x . . . and codewords ofthe form 00y . . . by 1y . . . . This improvement contradicts the assumption that`1 ≥ 2 , and so `1 = 1. (Note that p1 > p2 was a hidden assumption for thisproblem; otherwise, for example, the probabilities {.49, .49, .02} have the optimalcode {00, 1, 01} .)

(b) The argument is similar to that of part (a). Suppose, for the sake of contradiction,that `1 = 1. Without loss of generality, assume that c1 = 0. The total probabilityof C10 and C11 is 1 − p1 > 2/3 , so at least one of these two sets (without lossof generality, C10 ) has probability greater than 2/3. We can now obtain a bettercode by interchanging the subtree of the decoding tree beginning with 0 with the

Page 117: Solution to Information Theory

116 Data Compression

subtree beginning with 10; that is, we replace codewords of the form 10x . . . by0x . . . and we let c1 = 10 . This improvement contradicts the assumption that`1 = 1, and so `1 ≥ 2 .

26. Merges. Companies with values W1,W2, . . . ,Wm are merged as follows. The two leastvaluable companies are merged, thus forming a list of m − 1 companies. The valueof the merge is the sum of the values of the two merged companies. This continuesuntil one supercompany remains. Let V equal the sum of the values of the merges.Thus V represents the total reported dollar volume of the merges. For example, ifW = (3, 3, 2, 2) , the merges yield (3, 3, 2, 2) → (4, 3, 3) → (6, 4) → (10) , and V =4 + 6 + 10 = 20 .

(a) Argue that V is the minimum volume achievable by sequences of pair-wise mergesterminating in one supercompany. (Hint: Compare to Huffman coding.)

(b) Let W =∑Wi, Wi = Wi/W , and show that the minimum merge volume V

satisfiesWH(W) ≤ V ≤WH(W) +W (5.39)

Solution: Problem: Merges

(a) We first normalize the values of the companies to add to one. The total volume ofthe merges is equal to the sum of value of each company times the number of timesit takes part in a merge. This is identical to the average length of a Huffman code,with a tree which corresponds to the merges. Since Huffman coding minimizesaverage length, this scheme of merges minimizes total merge volume.

(b) Just as in the case of Huffman coding, we have

H ≤ EL < H + 1, (5.40)

we have in this case for the corresponding merge scheme

WH(W) ≤ V ≤WH(W) +W (5.41)

27. The Sardinas-Patterson test for unique decodability. A code is not uniquely decodableif and only if there exists a finite sequence of code symbols which can be resolved intwo different ways into sequences of codewords. That is, a situation such as

| A1 | A2 | A3 . . . Am || B1 | B2 | B3 . . . Bn |

must occur where each Ai and each Bi is a codeword. Note that B1 must be aprefix of A1 with some resulting “dangling suffix.” Each dangling suffix must in turnbe either a prefix of a codeword or have another codeword as its prefix, resulting inanother dangling suffix. Finally, the last dangling suffix in the sequence must also bea codeword. Thus one can set up a test for unique decodability (which is essentiallythe Sardinas-Patterson test[5]) in the following way: Construct a set S of all possibledangling suffixes. The code is uniquely decodable if and only if S contains no codeword.

Page 118: Solution to Information Theory

Data Compression 117

(a) State the precise rules for building the set S .

(b) Suppose the codeword lengths are li , i = 1, 2, . . . ,m . Find a good upper boundon the number of elements in the set S .

(c) Determine which of the following codes is uniquely decodable:

i. {0, 10, 11} .

ii. {0, 01, 11} .

iii. {0, 01, 10} .

iv. {0, 01} .

v. {00, 01, 10, 11} .

vi. {110, 11, 10} .

vii. {110, 11, 100, 00, 10} .

(d) For each uniquely decodable code in part (c), construct, if possible, an infiniteencoded sequence with a known starting point, such that it can be resolved intocodewords in two different ways. (This illustrates that unique decodability doesnot imply finite decodability.) Prove that such a sequence cannot arise in a prefixcode.

Solution: Test for unique decodability.

The proof of the Sardinas-Patterson test has two parts. In the first part, we will showthat if there is a code string that has two different interpretations, then the code will failthe test. The simplest case is when the concatenation of two codewords yields anothercodeword. In this case, S2 will contain a codeword, and hence the test will fail.

In general, the code is not uniquely decodeable, iff there exists a string that admits twodifferent parsings into codewords, e.g.

x1x2x3x4x5x6x7x8 = x1x2, x3x4x5, x6x7x8 = x1x2x3x4, x5x6x7x8. (5.42)

In this case, S2 will contain the string x3x4 , S3 will contain x5 , S4 will containx6x7x8 , which is a codeword. It is easy to see that this procedure will work for anystring that has two different parsings into codewords; a formal proof is slightly moredifficult and using induction.

In the second part, we will show that if there is a codeword in one of the sets Si, i ≥ 2 ,then there exists a string with two different possible interpretations, thus showing thatthe code is not uniquely decodeable. To do this, we essentially reverse the constructionof the sets. We will not go into the details - the reader is referred to the original paper.

(a) Let S1 be the original set of codewords. We construct Si+1 from Si as follows:A string y is in Si+1 iff there is a codeword x in S1 , such that xy is in Si or ifthere exists a z ∈ Si such that zy is in S1 (i.e., is a codeword). Then the codeis uniquely decodable iff none of the Si , i ≥ 2 contains a codeword. Thus the setS = ∪i≥2Si .

Page 119: Solution to Information Theory

118 Data Compression

(b) A simple upper bound can be obtained from the fact that all strings in the setsSi have length less than lmax , and therefore the maximum number of elements inS is less than 2lmax .

(c) i. {0, 10, 11} . This code is instantaneous and hence uniquely decodable.

ii. {0, 01, 11} . This code is a suffix code (see problem 11). It is therefore uniquelydecodable. The sets in the Sardinas-Patterson test are S1 = {0, 01, 11} ,S2 = {1} = S3 = S4 = . . . .

iii. {0, 01, 10} . This code is not uniquely decodable. The sets in the test areS1 = {0, 01, 10} , S2 = {1} , S3 = {0} , . . . . Since 0 is codeword, this codefails the test. It is easy to see otherwise that the code is not UD - the string010 has two valid parsings.

iv. {0, 01} . This code is a suffix code and is therefore UD. THe test producessets S1 = {0, 01} , S2 = {1} , S3 = φ .

v. {00, 01, 10, 11} . This code is instantaneous and therefore UD.

vi. {110, 11, 10} . This code is uniquely decodable, by the Sardinas-Pattersontest, since S1 = {110, 11, 10} , S2 = {0} , S3 = φ .

vii. {110, 11, 100, 00, 10} . This code is UD, because by the Sardinas Pattersontest, S1 = {110, 11, 100, 00, 10} , S2 = {0} , S3 = {0} , etc.

(d) We can produce infinite strings which can be decoded in two ways only for exampleswhere the Sardinas Patterson test produces a repeating set. For example, in part(ii), the string 011111 . . . could be parsed either as 0,11,11, . . . or as 01,11,11, . . . .Similarly for (viii), the string 10000 . . . could be parsed as 100,00,00, . . . or as10,00,00, . . . . For the instantaneous codes, it is not possible to construct such astring, since we can decode as soon as we see a codeword string, and there is noway that we would need to wait to decode.

28. Shannon code. Consider the following method for generating a code for a randomvariable X which takes on m values {1, 2, . . . ,m} with probabilities p1, p2, . . . , pm .Assume that the probabilities are ordered so that p1 ≥ p2 ≥ · · · ≥ pm . Define

Fi =i−1∑

k=1

pk, (5.43)

the sum of the probabilities of all symbols less than i . Then the codeword for i is thenumber Fi ∈ [0, 1] rounded off to li bits, where li = dlog 1

pie .

(a) Show that the code constructed by this process is prefix-free and the average lengthsatisfies

H(X) ≤ L < H(X) + 1. (5.44)

(b) Construct the code for the probability distribution (0.5, 0.25, 0.125, 0.125) .

Solution: Shannon code.

Page 120: Solution to Information Theory

Data Compression 119

(a) Since li = dlog 1pie , we have

log1

pi≤ li < log

1

pi+ 1 (5.45)

which implies that

H(X) ≤ L =∑

pili < H(X) + 1. (5.46)

The difficult part is to prove that the code is a prefix code. By the choice of li ,we have

2−li ≤ pi < 2−(li−1). (5.47)

Thus Fj , j > i differs from Fi by at least 2−li , and will therefore differ from Fi

is at least one place in the first li bits of the binary expansion of Fi . Thus thecodeword for Fj , j > i , which has length lj ≥ li , differs from the codeword forFi at least once in the first li places. Thus no codeword is a prefix of any othercodeword.

(b) We build the following table

Symbol Probability Fi in decimal Fi in binary li Codeword1 0.5 0.0 0.0 1 02 0.25 0.5 0.10 2 103 0.125 0.75 0.110 3 1104 0.125 0.875 0.111 3 111

The Shannon code in this case achieves the entropy bound (1.75 bits) and isoptimal.

29. Optimal codes for dyadic distributions. For a Huffman code tree, define the probabilityof a node as the sum of the probabilities of all the leaves under that node. Let therandom variable X be drawn from a dyadic distribution, i.e., p(x) = 2−i , for some i ,for all x ∈ X . Now consider a binary Huffman code for this distribution.

(a) Argue that for any node in the tree, the probability of the left child is equal to theprobability of the right child.

(b) Let X1, X2, . . . , Xn be drawn i.i.d. ∼ p(x) . Using the Huffman code for p(x) , wemap X1, X2, . . . , Xn to a sequence of bits Y1, Y2, . . . , Yk(X1,X2,...,Xn) . (The lengthof this sequence will depend on the outcome X1, X2, . . . , Xn .) Use part (a) toargue that the sequence Y1, Y2, . . . , forms a sequence of fair coin flips, i.e., thatPr{Yi = 0} = Pr{Yi = 1} = 1

2 , independent of Y1, Y2, . . . , Yi−1 .

Thus the entropy rate of the coded sequence is 1 bit per symbol.

(c) Give a heuristic argument why the encoded sequence of bits for any code thatachieves the entropy bound cannot be compressible and therefore should have anentropy rate of 1 bit per symbol.

Solution: Optimal codes for dyadic distributions.

Page 121: Solution to Information Theory

120 Data Compression

(a) For a dyadic distribution, the Huffman code acheives the entropy bound. Thecode tree constructed be the Huffman algorithm is a complete tree with leaves atdepth li with probability pi = 2−li .

For such a complete binary tree, we can prove the following properties

• The probability of any internal node at depth k is 2−k .We can prove this by induction. Clearly, it is true for a tree with 2 leaves.Assume that it is true for all trees with n leaves. For any tree with n + 1leaves, at least two of the leaves have to be siblings on the tree (else the treewould not be complete). Let the level of these siblings be j . The probability ofthe parent of these two siblings (at level j−1 ) has probability 2j +2j = 2j−1 .We can now replace the two siblings with their parent, without changing theprobability of any other internal node. But now we have a tree with n leaveswhich satisfies the required property. Thus, by induction, the property is truefor all complete binary trees.

• From the above property, it follows immediately the the probability of the leftchild is equal to the probability of the right child.

(b) For a sequence X1, X2 , we can construct a code tree by first constructing theoptimal tree for X1 , and then attaching the optimal tree for X2 to each leaf ofthe optimal tree for X1 . Proceeding this way, we can construct the code tree forX1, X2, . . . , Xn . When Xi are drawn i.i.d. according to a dyadic distribution, itis easy to see that the code tree constructed will be also be a complete binary treewith the properties in part (a). Thus the probability of the first bit being 1 is 1/2,and at any internal node, the probability of the next bit produced by the codebeing 1 is equal to the probability of the next bit being 0. Thus the bits producedby the code are i.i.d. Bernoulli(1/2), and the entropy rate of the coded sequenceis 1 bit per symbol.

(c) Assume that we have a coded sequence of bits from a code that met the entropybound with equality. If the coded sequence were compressible, then we could usedthe compressed version of the coded sequence as our code, and achieve an averagelength less than the entropy bound, which will contradict the bound. Thus thecoded sequence cannot be compressible, and thus must have an entropy rate of 1bit/symbol.

30. Relative entropy is cost of miscoding: Let the random variable X have five possibleoutcomes {1, 2, 3, 4, 5} . Consider two distributions p(x) and q(x) on this randomvariable

Symbol p(x) q(x) C1(x) C2(x)

1 1/2 1/2 0 02 1/4 1/8 10 1003 1/8 1/8 110 1014 1/16 1/8 1110 1105 1/16 1/8 1111 111

(a) Calculate H(p) , H(q) , D(p||q) and D(q||p) .

Page 122: Solution to Information Theory

Data Compression 121

(b) The last two columns above represent codes for the random variable. Verify thatthe average length of C1 under p is equal to the entropy H(p) . Thus C1 isoptimal for p . Verify that C2 is optimal for q .

(c) Now assume that we use code C2 when the distribution is p . What is the averagelength of the codewords. By how much does it exceed the entropy p?

(d) What is the loss if we use code C1 when the distribution is q ?

Solution: Cost of miscoding

(a) H(p) = 12 log 2 + 1

4 log 4 + 18 log 8 + 1

16 log 16 + 116 log 16 = 1.875 bits.

H(q) = 12 log 2 + 1

8 log 8 + 18 log 8 + 1

8 log 8 + 18 log 8 = 2 bits.

D(p||q) = 12 log 1/2

1/2 + 14 log 1/4

1/8 + 18 log 1/8

1/8 + 116 log 1/16

1/8 + 116 log 1/16

1/8 = 0.125 bits.

D(p||q) = 12 log 1/2

1/2 + 18 log 1/8

1/4 + 18 log 1/8

1/8 + 18 log 1/8

1/16 + 18 log 1/8

1/16 = 0.125 bits.

(b) The average length of C1 for p(x) is 1.875 bits, which is the entropy of p . ThusC1 is an efficient code for p(x) . Similarly, the average length of code C2 underq(x) is 2 bits, which is the entropy of q . Thus C2 is an efficient code for q .

(c) If we use code C2 for p(x) , then the average length is 12 ∗ 1 + 1

4 ∗ 3 + 18 ∗ 3 + 1

16 ∗3 + 1

16 ∗ 3 = 2 bits. It exceeds the entropy by 0.125 bits, which is the same asD(p||q) .

(d) Similary, using code C1 for q has an average length of 2.125 bits, which exceedsthe entropy of q by 0.125 bits, which is D(q||p) .

31. Non-singular codes: The discussion in the text focused on instantaneous codes, withextensions to uniquely decodable codes. Both these are required in cases when thecode is to be used repeatedly to encode a sequence of outcomes of a random variable.But if we need to encode only one outcome and we know when we have reached theend of a codeword, we do not need unique decodability - only the fact that the code isnon-singular would suffice. For example, if a random variable X takes on 3 values a,b and c, we could encode them by 0, 1, and 00. Such a code is non-singular but notuniquely decodable.

In the following, assume that we have a random variable X which takes on m valueswith probabilities p1, p2, . . . , pm and that the probabilities are ordered so that p1 ≥p2 ≥ . . . ≥ pm .

(a) By viewing the non-singular binary code as a ternary code with three symbols,0,1 and “STOP”, show that the expected length of a non-singular code L1:1 for arandom variable X satisfies the following inequality:

L1:1 ≥ H2(X)

log2 3− 1 (5.48)

where H2(X) is the entropy of X in bits. Thus the average length of a non-singular code is at least a constant fraction of the average length of an instanta-neous code.

Page 123: Solution to Information Theory

122 Data Compression

(b) Let LINST be the expected length of the best instantaneous code and L∗1:1 be the

expected length of the best non-singular code for X . Argue that L∗1:1 ≤ L∗

INST ≤H(X) + 1 .

(c) Give a simple example where the average length of the non-singular code is lessthan the entropy.

(d) The set of codewords available for an non-singular code is {0, 1, 00, 01, 10, 11, 000, . . .} .Since L1:1 =

∑mi=1 pili , show that this is minimized if we allot the shortest code-

words to the most probable symbols.

Thus l1 = l2 = 1, l3 = l4 = l5 = l6 = 2, etc. Show that in general li =

dlog(

i2 + 1

)

e , and therefore L∗1:1 =

∑mi=1 pidlog

(i2 + 1

)

e .

(e) The previous part shows that it is easy to find the optimal non-singular code fora distribution. However, it is a little more tricky to deal with the average lengthof this code. We now bound this average length. It follows from the previous part

that L∗1:1 ≥ L

4=∑m

i=1 pi log(

i2 + 1

)

. Consider the difference

F (p) = H(X) − L = −m∑

i=1

pi log pi −m∑

i=1

pi log

(i

2+ 1

)

. (5.49)

Prove by the method of Lagrange multipliers that the maximum of F (p) occurswhen pi = c/(i+2) , where c = 1/(Hm+2−H2) and Hk is the sum of the harmonicseries, i.e.,

Hk4=

k∑

i=1

1

i(5.50)

(This can also be done using the non-negativity of relative entropy.)

(f) Complete the arguments for

H(X) − L∗1:1 ≤ H(X) − L (5.51)

≤ log(2(Hm+2 −H2)) (5.52)

Now it is well known (see, e.g. Knuth, “Art of Computer Programming”, Vol.1) that Hk ≈ lnk (more precisely, Hk = ln k + γ + 1

2k − 112k2 + 1

120k4 − ε where0 < ε < 1/252n6 , and γ = Euler’s constant = 0.577 . . . ). Either using this or asimple approximation that Hk ≤ ln k + 1, which can be proved by integration of1x , it can be shown that H(X) − L∗

1:1 < log logm+ 2. Thus we have

H(X) − log log |X | − 2 ≤ L∗1:1 ≤ H(X) + 1. (5.53)

A non-singular code cannot do much better than an instantaneous code!

Solution:

(a) In the text, it is proved that the average length of any prefix-free code in a D -aryalphabet was greater than HD(X) , the D -ary entropy. Now if we start with any

Page 124: Solution to Information Theory

Data Compression 123

binary non-singular code and add the additional symbol “STOP” at the end, thenew code is prefix-free in the alphabet of 0,1, and “STOP” (since “STOP” occursonly at the end of codewords, and every codeword has a “STOP” symbol, so theonly way a code word can be a prefix of another is if they were equal). Thus eachcode word in the new alphabet is one symbol longer than the binary codewords,and the average length is 1 symbol longer.

Thus we have L1:1 + 1 ≥ H3(X) , or L1:1 ≥ H2(X)log 3 − 1 = 0.63H(X) − 1 .

(b) Since an instantaneous code is also a non-singular code, the best non-singular codeis at least as good as the best instantaneous code. Since the best instantaneouscode has average length ≤ H(X) + 1 , we have L∗

1:1 ≤ L∗INST ≤ H(X) + 1 .

(c) For a 2 symbol alphabet, the best non-singular code and the best instantaneouscode are the same. So the simplest example where they differ is when |X | = 3.In this case, the simplest (and it turns out, optimal) non-singular code has threecodewords 0, 1, 00 . Assume that each of the symbols is equally likely. ThenH(X) = log 3 = 1.58 bits, whereas the average length of the non-singular codeis 1

3 .1 + 13 .1 + 1

3 .2 = 4/3 = 1.3333 < H(X) . Thus a non-singular code could dobetter than entropy.

(d) For a given set of codeword lengths, the fact that allotting the shortest codewordsto the most probable symbols is proved in Lemma 5.8.1, part 1 of EIT.

This result is a general version of what is called the Hardy-Littlewood-Polya in-equality, which says that if a < b , c < d , then ad + bc < ac + bd . The generalversion of the Hardy-Littlewood-Polya inequality states that if we were given twosets of numbers A = {aj} and B = {bj} each of size m , and let a[i] be the i -thlargest element of A and b[i] be the i -th largest element of set B . Then

m∑

i=1

a[i]b[m+1−i] ≤m∑

i=1

aibi ≤m∑

i=1

a[i]b[i] (5.54)

An intuitive explanation of this inequality is that you can consider the ai ’s to theposition of hooks along a rod, and bi ’s to be weights to be attached to the hooks.To maximize the moment about one end, you should attach the largest weights tothe furthest hooks.

The set of available codewords is the set of all possible sequences. Since the onlyrestriction is that the code be non-singular, each source symbol could be allotedto any codeword in the set {0, 1, 00, . . .} .

Thus we should allot the codewords 0 and 1 to the two most probable sourcesymbols, i.e., to probablities p1 and p2 . Thus l1 = l2 = 1. Similarly, l3 = l4 =l5 = l6 = 2 (corresponding to the codewords 00, 01, 10 and 11). The next 8symbols will use codewords of length 3, etc.

We will now find the general form for li . We can prove it by induction, but we willderive the result from first principles. Let ck =

∑k−1j=1 2j . Then by the arguments of

the previous paragraph, all source symbols of index ck+1, ck+2, . . . , ck+2k = ck+1

Page 125: Solution to Information Theory

124 Data Compression

use codewords of length k . Now by using the formula for the sum of the geometricseries, it is easy to see that

ck =∑

j = 1k−12j = 2∑

j = 0k−22j = 22k−1 − 1

2 − 1= 2k − 2 (5.55)

Thus all sources with index i , where 2k − 1 ≤ i ≤ 2k − 2 + 2k = 2k+1 − 2 usecodewords of length k . This corresponds to 2k < i+2 ≤ 2k+1 or k < log(i+2) ≤k + 1 or k − 1 < log i+2

2 ≤ k . Thus the length of the codeword for the i -th symbol is k = dlog i+2

2 e . Thus the best non-singular code assigns codewordlength l∗i = dlog(i/2+1)e to symbol i , and therefore L∗

1:1 =∑m

i=1 pidlog(i/2+1)e .

(e) Since dlog(i/2+1)e ≥ log(i/2+1) , it follows that L∗1:1 ≥ L

4=∑m

i=1 pi log(

i2 + 1

)

.

Consider the difference

F (p) = H(X) − L = −m∑

i=1

pi log pi −m∑

i=1

pi log

(i

2+ 1

)

. (5.56)

We want to maximize this function over all probability distributions, and thereforewe use the method of Lagrange multipliers with the constraint

∑pi = 1.

Therefore let

J(p) = −m∑

i=1

pi log pi −m∑

i=1

pi log

(i

2+ 1

)

+ λ(m∑

i=1

pi − 1) (5.57)

Then differentiating with respect to pi and setting to 0, we get

∂J

∂pi= −1 − log pi − log

(i

2+ 1

)

+ λ = 0 (5.58)

log pi = λ− 1 − logi+ 2

2(5.59)

pi = 2λ−1 2

i+ 2(5.60)

Now substituting this in the constraint that∑pi = 1, we get

2λm∑

i=1

1

i+ 2= 1 (5.61)

or 2λ = 1/(∑

i1

i+2 . Now using the definition Hk =∑k

j=11j , it is obvious that

m∑

i=1

1

i+ 2=

m+2∑

i=1

1

i− 1 − 1

2= Hm+2 −H2. (5.62)

Thus 2λ = 1Hm+2−H2

, and

pi =1

Hm+2 −H2

1

i+ 2(5.63)

Page 126: Solution to Information Theory

Data Compression 125

Substituting this value of pi in the expression for F (p) , we obtain

F (p) = −m∑

i=1

pi log pi −m∑

i=1

pi log

(i

2+ 1

)

(5.64)

= −m∑

i=1

pi log pii+ 2

2(5.65)

= −m∑

i=1

pi log1

2(Hm+2 −H2)(5.66)

= log 2(Hm+2 −H2) (5.67)

Thus the extremal value of F (p) is log 2(Hm+2 −H2) . We have not showed thatit is a maximum - that can be shown be taking the second derivative. But as usual,it is easier to see it using relative entropy. Looking at the expressions above, we cansee that if we define qi = 1

Hm+2−H2

1i+2 , then qi is a probability distribution (i.e.,

qi ≥ 0 ,∑qi = 1). Also, i+2

2= 12(Hm+2−H2)

1qi

, and substuting this in the expression

for F (p) , we obtain

F (p) = −m∑

i=1

pi log pi −m∑

i=1

pi log

(i

2+ 1

)

(5.68)

= −m∑

i=1

pi log pii+ 2

2(5.69)

= −m∑

i=1

pi log pi1

2(Hm+2 −H2)

1

qi(5.70)

= −m∑

i=1

pi logpi

qi−

m∑

i=1

pi log1

2(Hm+2 −H2)(5.71)

= log 2(Hm+2 −H2) −D(p||q) (5.72)

≤ log 2(Hm+2 −H2) (5.73)

with equality iff p = q . Thus the maximum value of F (p) is log 2(Hm+2 −H2)

(f)

H(X) − L∗1:1 ≤ H(X) − L (5.74)

≤ log 2(Hm+2 −H2) (5.75)

The first inequality follows from the definition of L and the second from the resultof the previous part.

To complete the proof, we will use the simple inequality Hk ≤ ln k+1, which canbe shown by integrating 1

x between 1 and k . Thus Hm+2 ≤ ln(m+ 2) + 1 , and2(Hm+2 −H2) = 2(Hm+2 − 1 − 1

2) ≤ 2(ln(m + 2) + 1 − 1 − 12) ≤ 2(ln(m + 2)) =

2 log(m + 2)/ log e ≤ 2 log(m + 2) ≤ 2 logm2 = 4 logm where the last inequalityis true for m ≥ 2 . Therefore

H(X) − L1:1 ≤ log 2(Hm+2 −H2) ≤ log(4 logm) = log logm+ 2 (5.76)

Page 127: Solution to Information Theory

126 Data Compression

We therefore have the following bounds on the average length of a non-singularcode

H(X) − log log |X | − 2 ≤ L∗1:1 ≤ H(X) + 1 (5.77)

A non-singular code cannot do much better than an instantaneous code!

32. Bad wine. One is given 6 bottles of wine. It is known that precisely one bottle has gonebad (tastes terrible). From inspection of the bottles it is determined that the probabilitypi that the ith bottle is bad is given by (p1, p2, . . . , p6) = ( 8

23 ,623 ,

423 ,

223 ,

223 ,

123) . Tasting

will determine the bad wine.

Suppose you taste the wines one at a time. Choose the order of tasting to minimizethe expected number of tastings required to determine the bad bottle. Remember, ifthe first 5 wines pass the test you don’t have to taste the last.

(a) What is the expected number of tastings required?

(b) Which bottle should be tasted first?

Now you get smart. For the first sample, you mix some of the wines in a fresh glass andsample the mixture. You proceed, mixing and tasting, stopping when the bad bottlehas been determined.

(c) What is the minimum expected number of tastings required to determine the badwine?

(d) What mixture should be tasted first?

Solution: Bad Wine

(a) If we taste one bottle at a time, to minimize the expected number of tastings theorder of tasting should be from the most likely wine to be bad to the least. Theexpected number of tastings required is

6∑

i=1

pili = 1 × 8

23+ 2 × 6

23+ 3 × 4

23+ 4 × 2

23+ 5 × 2

23+ 5 × 1

23

=55

23= 2.39

(b) The first bottle to be tasted should be the one with probability 823 .

(c) The idea is to use Huffman coding. With Huffman coding, we get codeword lengthsas (2, 2, 2, 3, 4, 4) . The expected number of tastings required is

6∑

i=1

pili = 2 × 8

23+ 2 × 6

23+ 2 × 4

23+ 3 × 2

23+ 4 × 2

23+ 4 × 1

23

=54

23= 2.35

Page 128: Solution to Information Theory

Data Compression 127

(d) The mixture of the first and second bottles should be tasted first.

33. Huffman vs. Shannon. A random variable X takes on three values with probabilities0.6, 0.3, and 0.1.

(a) What are the lengths of the binary Huffman codewords for X ? What are thelengths of the binary Shannon codewords (l(x) = dlog( 1

p(x))e) for X ?

(b) What is the smallest integer D such that the expected Shannon codeword lengthwith a D -ary alphabet equals the expected Huffman codeword length with a D -ary alphabet?

Solution: Huffman vs. Shannon

(a) It is obvious that an Huffman code for the distribution (0.6,0.3,0.1) is (1,01,00),with codeword lengths (1,2,2). The Shannon code would use lengths dlog 1

pe ,which gives lengths (1,2,4) for the three symbols.

(b) For any D > 2 , the Huffman code for the three symbols are all one character. TheShannon code length dlogD

1pe would be equal to 1 for all symbols if logD

10.1 = 1,

i.e., if D = 10 . Hence for D ≥ 10 , the Shannon code is also optimal.

34. Huffman algorithm for tree construction. Consider the following problem: m binarysignals S1, S2, . . . , Sm are available at times T1 ≤ T2 ≤ . . . ≤ Tm , and we wouldlike to find their sum S1 ⊕ S2 ⊕ · · · ⊕ Sm using 2-input gates, each gate with 1 timeunit delay, so that the final result is available as quickly as possible. A simple greedyalgorithm is to combine the earliest two results, forming the partial result at timemax(T1, T2) + 1 . We now have a new problem with S1 ⊕ S2, S3, . . . , Sm , available attimes max(T1, T2)+1, T3, . . . , Tm . We can now sort this list of T’s, and apply the samemerging step again, repeating this until we have the final result.

(a) Argue that the above procedure is optimal, in that it constructs a circuit for whichthe final result is available as quickly as possible.

(b) Show that this procedure finds the tree that minimizes

C(T ) = maxi

(Ti + li) (5.78)

where Ti is the time at which the result alloted to the i -th leaf is available, andli is the length of the path from the i -th leaf to the root.

(c) Show that

C(T ) ≥ log2

(∑

i

2Ti

)

(5.79)

for any tree T .

(d) Show that there exists a tree such that

C(T ) ≤ log2

(∑

i

2Ti

)

+ 1 (5.80)

Page 129: Solution to Information Theory

128 Data Compression

Thus log2

(∑

i 2Ti

)

is the analog of entropy for this problem.

Solution:

Tree construction:

(a) The proof is identical to the proof of optimality of Huffman coding. We first showthat for the optimal tree if Ti < Tj , then li ≥ lj . The proof of this is, as in thecase of Huffman coding, by contradiction. Assume otherwise, i.e., that if Ti < Tj

and li < lj , then by exchanging the inputs, we obtain a tree with a lower totalcost, since

max{Ti + li, Tj + lj} ≥ max{Ti + lj , Tj + li} (5.81)

Thus the longest branches are associated with the earliest times.

The rest of the proof is identical to the Huffman proof. We show that the longestbranches correspond to the two earliest times, and that they could be taken assiblings (inputs to the same gate). Then we can reduce the problem to constructingthe optimal tree for a smaller problem. By induction, we extend the optimality tothe larger problem, proving the optimality of the above algorithm.

Given any tree of gates, the earliest that the output corresponding to a particularsignal would be available is Ti + li , since the signal undergoes li gate delays. Thusmaxi(Ti + li) is a lower bound on the time at which the final answer is available.

The fact that the tree achieves this bound can be shown by induction. For anyinternal node of the tree, the output is available at time equal to the maximum ofthe input times plus 1. Thus for the gates connected to the inputs Ti and Tj , theoutput is available at time max(Ti, Tj) + 1 . For any node, the output is availableat time equal to maximum of the times at the leaves plus the gate delays to getfrom the leaf to the node. This result extneds to the complete tree, and for theroot, the time at which the final result is available is maxi(Ti + li) . The abovealgorithm minimizes this cost.

(b) Let c1 =∑

i 2Ti and c2 =

i 2−li . By the Kraft inequality, c2 ≤ 1 . Now let

pi = 2Ti∑

j2Tj

, and let ri = 2−li∑

j2−lj

. Clearly, pi and ri are probability mass

functions. Also, we have Ti = log(pic1) and li = − log(ric2) . Then

C(T ) = maxi

(Ti + li) (5.82)

= maxi

(log(pic1) − log(ric2)) (5.83)

= log c1 − log c2 + maxi

logpi

ri(5.84)

Now the maximum of any random variable is greater than its average under anydistribution, and therefore

C(T ) ≥ log c1 − log c2 +∑

i

pi logpi

ri(5.85)

≥ log c1 − log c2 +D(p||r) (5.86)

Page 130: Solution to Information Theory

Data Compression 129

Since −logc2 ≥ 0 and D(p||r) ≥ 0 , we have

C(T ) ≥ log c1 (5.87)

which is the desired result.

(c) From the previous part, we achieve the lower bound if pi = ri and c2 = 1.However, since the li ’s are constrained to be integers, we cannot achieve equalityin all cases.

Instead, if we let

li =

log1

pi

=

log

j 2Tj

2Ti

, (5.88)

it is easy to verify that∑

2−li ≤∑pi = 1, and that thus we can construct a tree

that achieves

Ti + li ≤ log(∑

j

2Tj ) + 1 (5.89)

for all i . Thus this tree achieves within 1 unit of the lower bound.

Clearly, log(∑

j 2Tj ) is the equivalent of entropy for this problem!

35. Generating random variables. One wishes to generate a random variable X

X =

{

1, with probability p0, with probability 1 − p

(5.90)

You are given fair coin flips Z1, Z2, . . . . Let N be the (random) number of flips neededto generate X . Find a good way to use Z1, Z2, . . . to generate X . Show that EN ≤ 2 .

Solution: We expand p = 0.p1p2 . . . as a binary number. Let U = 0.Z1Z2 . . . , the se-quence Z treated as a binary number. It is well known that U is uniformly distributedon [0, 1) . Thus, we generate X = 1 if U < p and 0 otherwise.

The procedure for generated X would therefore examine Z1, Z2, . . . and compare withp1, p2, . . . , and generate a 1 at the first time one of the Zi ’s is less than the correspond-ing pi and generate a 0 the first time one of the Zi ’s is greater than the correspondingpi ’s. Thus the probability that X is generated after seeing the first bit of Z is theprobability that Z1 6= p1 , i.e., with probability 1/2. Similarly, X is generated after 2bits of Z if Z1 = p1 and Z2 6= p2 , which occurs with probability 1/4. Thus

EN = 1.1

2+ 2

1

4+ 3

1

8+ . . .+ (5.91)

= 2 (5.92)

36. Optimal word lengths.

(a) Can l = (1, 2, 2) be the word lengths of a binary Huffman code. What about(2,2,3,3)?

Page 131: Solution to Information Theory

130 Data Compression

(b) What word lengths l = (l1, l2, . . .) can arise from binary Huffman codes?

Solution: Optimal Word Lengths

We first answer (b) and apply the result to (a).

(b) Word lengths of a binary Huffman code must satisfy the Kraft inequality withequality, i.e.,

i 2−li = 1. An easy way to see this is the following: every node in

the tree has a sibling (property of optimal binary code), and if we assign each node a‘weight’, namely 2−li , then 2 × 2−li is the weight of the father (mother) node. Thus,‘collapsing’ the tree back, we have that

i 2−li = 1.

(a) Clearly, (1, 2, 2) satisfies Kraft with equality, while (2, 2, 3, 3) does not. Thus,(1, 2, 2) can arise from Huffman code, while (2, 2, 3, 3) cannot.

37. Codes. Which of the following codes are

(a) uniquely decodable?

(b) instantaneous?

C1 = {00, 01, 0}C2 = {00, 01, 100, 101, 11}C3 = {0, 10, 110, 1110, . . .}C4 = {0, 00, 000, 0000}

Solution: Codes.

(a) C1 = {00, 01, 0} is uniquely decodable (suffix free) but not instantaneous.

(b) C2 = {00, 01, 100, 101, 11} is prefix free (instantaneous).

(c) C3 = {0, 10, 110, 1110, . . .} is instantaneous

(d) C4 = {0, 00, 000, 0000} is neither uniquely decodable or instantaneous.

38. Huffman. Find the Huffman D -ary code for (p1, p2, p3, p4, p5, p6) = ( 625 ,

625 ,

425 ,

425 ,

325 ,

225 )

and the expected word length

(a) for D = 2.

(b) for D = 4.

Solution: Huffman Codes.

(a) D=2

6 6 6 8 11 14 256 6 6 6 8 114 4 5 6 64 4 4 52 3 42 21

Page 132: Solution to Information Theory

Data Compression 131

pi625

625

425

425

225

225

125

li 2 2 3 3 3 4 4

E(l) =7∑

i=1

pili

=1

25(6 × 2 + 6 × 2 + 4 × 3 + 4 × 3 + 2 × 3 + 2 × 4 + 1 × 4)

=66

25= 2.66

(b) D=4

6 9 256 64 64 4221

pi625

625

425

425

225

225

125

li 1 1 1 2 2 2 2

E(l) =7∑

i=1

pili

=1

25(6 × 1 + 6 × 1 + 4 × 1 + 4 × 2 + 2 × 2 + 2 × 2 + 1 × 2)

=34

25= 1.36

39. Entropy of encoded bits. Let C : X −→ {0, 1}∗ be a nonsingular but nonuniquelydecodable code. Let X have entropy H(X).

(a) Compare H(C(X)) to H(X) .

(b) Compare H(C(Xn)) to H(Xn) .

Solution: Entropy of encoded bits

Page 133: Solution to Information Theory

132 Data Compression

(a) Since the code is non-singular, the function X → C(X) is one to one, and henceH(X) = H(C(X)) . (Problem 2.4)

(b) Since the code is not uniquely decodable, the function Xn → C(Xn) is many toone, and hence H(Xn) ≥ H(C(Xn)) .

40. Code rate.Let X be a random variable with alphabet {1, 2, 3} and distribution

X =

1, with probability 1/22, with probability 1/43, with probability 1/4.

The data compression code for X assigns codewords

C(x) =

0, if x = 110, if x = 211, if x = 3.

Let X1, X2, . . . be independent identically distributed according to this distributionand let Z1Z2Z3 . . . = C(X1)C(X2) . . . be the string of binary symbols resulting fromconcatenating the corresponding codewords. For example, 122 becomes 01010 .

(a) Find the entropy rate H(X ) and the entropy rate H(Z) in bits per symbol. Notethat Z is not compressible further.

(b) Now let the code be

C(x) =

00, if x = 110, if x = 201, if x = 3.

and find the entropy rate H(Z).

(c) Finally, let the code be

C(x) =

00, if x = 11, if x = 201, if x = 3.

and find the entropy rate H(Z).

Solution: Code rate.

This is a slightly tricky question. There’s no straightforward rigorous way to calculatethe entropy rates, so you need to do some guessing.

(a) First, since the Xi ’s are independent, H(X ) = H(X1) = 1/2 log 2+2(1/4) log(4) =3/2.

Now we observe that this is an optimal code for the given distribution on X ,and since the probabilities are dyadic there is no gain in coding in blocks. So the

Page 134: Solution to Information Theory

Data Compression 133

resulting process has to be i.i.d. Bern(1/2), (for otherwise we could get furthercompression from it).

Therefore H(Z) = H(Bern(1/2)) = 1 .

(b) Here it’s easy.

H(Z) = limn→∞

H(Z1, Z2, . . . , Zn)

n

= limn→∞

H(X1, X2, . . . , Xn/2)

n

= limn→∞

H(X )n2

n= 3/4.

(We’re being a little sloppy and ignoring the fact that n above may not be a even,but in the limit as n→ ∞ this doesn’t make a difference).

(c) This is the tricky part.

Suppose we encode the first n symbols X1X2 · · ·Xn into

Z1Z2 · · ·Zm = C(X1)C(X2) · · ·C(Xn).

Here m = L(C(X1))+L(C(X2))+· · ·+L(C(Xn)) is the total length of the encodedsequence (in bits), and L is the (binary) length function. Since the concatenatedcodeword sequence is an invertible function of (X1, . . . , Xn) , it follows that

nH(X ) = H(X1X2 · · ·Xn) = H(Z1Z2 · · ·Z∑n

1L(C(Xi))

) (5.93)

The first equality above is trivial since the Xi ’s are independent. Similarly, mayguess that the right-hand-side above can be written as

H(Z1Z2 · · ·Z∑n

1L(C(Xi))

) = E[n∑

i=1

L(C(Xi))]H(Z)

= nE[L(C(X1))]H(Z) (5.94)

(This is not trivial to prove, but it is true.)

Combining the left-hand-side of (5.93) with the right-hand-side of (5.94) yields

H(Z) =H(X )

E[L(C(X1))]

=3/2

7/4

=6

7,

where E[L(C(X1))] =∑3

x=1 p(x)L(C(x)) = 7/4.

Page 135: Solution to Information Theory

134 Data Compression

41. Optimal codes. Let l1, l2, . . . , l10 be the binary Huffman codeword lengths for theprobabilities p1 ≥ p2 ≥ . . . ≥ p10 . Suppose we get a new distribution by splitting thelast probability mass. What can you say about the optimal binary codeword lengthsl1, l2, . . . , ˜l11 for the probabilities p1, p2, . . . , p9, αp10, (1 − α)p10 , where 0 ≤ α ≤ 1 .

Solution: Optimal codes.

To construct a Huffman code, we first combine the two smallest probabilities. In thiscase, we would combine αp10 and (1 − α)p10 . The result of the sum of these twoprobabilities is p10 . Note that the resulting probability distribution is now exactly thesame as the original probability distribution. The key point is that an optimal codefor p1, p2, . . . , p10 yields an optimal code (when expanded) for p1, p2, . . . , p9, αp10, (1−α)p10 . In effect, the first 9 codewords will be left unchanged, while the 2 new code-words will be XXX0 and XXX1 where XXX represents the last codeword of theoriginal distribution.

In short, the lengths of the first 9 codewords remain unchanged, while the lengths ofthe last 2 codewords (new codewords) are equal to l10 + 1.

42. Ternary codes. Which of the following codeword lengths can be the word lengths of a3-ary Huffman code and which cannot?

(a) (1, 2, 2, 2, 2)

(b) (2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3)

Solution: Ternary codes.

(a) The word lengths (1, 2, 2, 2, 2) CANNOT be the word lengths for a 3-ary Huffmancode. This can be seen by drawing the tree implied by these lengths, and seeingthat one of the codewords of length 2 can be reduced to a codeword of length 1which is shorter. Since the Huffman tree produces the minimum expected lengthtree, these codeword lengths cannot be the word lengths for a Huffman tree.

(b) The word lengths (2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3) ARE the word lengths for a 3-aryHuffman code. Again drawing the tree will verify this. Also,

i 3−li = 8 × 3−2 +3 × 3−3 = 1, so these word lengths satisfy the Kraft inequality with equality.Therefore the word lengths are optimal for some distribution, and are the wordlengths for a 3-ary Huffman code.

43. Piecewise Huffman. Suppose the codeword that we use to describe a random variableX ∼ p(x) always starts with a symbol chosen from the set {A,B,C} , followed bybinary digits {0, 1} . Thus we have a ternary code for the first symbol and binarythereafter. Give the optimal uniquely decodeable code (minimum expected number ofsymbols) for the probability distribution

p =

(16

69,15

69,12

69,10

69,

8

69,

8

69

)

. (5.95)

Page 136: Solution to Information Theory

Data Compression 135

Solution: Piecewise Huffman.

Codeworda x1 16 16 22 31 69b1 x2 15 16 16 22c1 x3 12 15 16 16c0 x4 10 12 15b01 x5 8 10b00 x6 8

Note that the above code is not only uniquely decodable, but it is also instantaneouslydecodable. Generally given a uniquely decodable code, we can construct an instan-taneous code with the same codeword lengths. This is not the case with the piece-wise Huffman construction. There exists a code with smaller expected lengths that isuniquely decodable, but not instantaneous.

Codewordabca0b0c0

44. Huffman. Find the word lengths of the optimal binary encoding of p =(

1100 ,

1100 , . . . ,

1100

)

.

Solution: Huffman.

Since the distribution is uniform the Huffman tree will consist of word lengths ofdlog(100)e = 7 and blog(100)c = 6. There are 64 nodes of depth 6, of which (64-k ) will be leaf nodes; and there are k nodes of depth 6 which will form 2k leaf nodesof depth 7. Since the total number of leaf nodes is 100, we have

(64 − k) + 2k = 100 ⇒ k = 36.

So there are 64 - 36 = 28 codewords of word length 6, and 2 × 36 = 72 codewords ofword length 7.

45. Random “20” questions. Let X be uniformly distributed over {1, 2, . . . ,m} . Assumem = 2n . We ask random questions: Is X ∈ S1 ? Is X ∈ S2 ?...until only one integerremains. All 2m subsets of {1, 2, . . . ,m} are equally likely.

(a) How many deterministic questions are needed to determine X ?

(b) Without loss of generality, suppose that X = 1 is the random object. What isthe probability that object 2 yields the same answers for k questions as object 1?

(c) What is the expected number of objects in {2, 3, . . . ,m} that have the sameanswers to the questions as does the correct object 1?

(d) Suppose we ask n +√n random questions. What is the expected number of

wrong objects agreeing with the answers?

Page 137: Solution to Information Theory

136 Data Compression

(e) Use Markov’s inequality Pr{X ≥ tµ} ≤ 1t , to show that the probability of error

(one or more wrong object remaining) goes to zero as n −→ ∞ .

Solution: Random “20” questions.

(a) Obviously, Huffman codewords for X are all of length n . Hence, with n deter-ministic questions, we can identify an object out of 2n candidates.

(b) Observe that the total number of subsets which include both object 1 and object2 or neither of them is 2m−1 . Hence, the probability that object 2 yields the sameanswers for k questions as object 1 is (2m−1/2m)k = 2−k .

More information theoretically, we can view this problem as a channel codingproblem through a noiseless channel. Since all subsets are equally likely, theprobability the object 1 is in a specific random subset is 1/2 . Hence, the questionwhether object 1 belongs to the k th subset or not corresponds to the k th bit ofthe random codeword for object 1, where codewords Xk are Bern(1/2 ) randomk -sequences.

Object Codeword1 0110 . . . 12 0010 . . . 0...

Now we observe a noiseless output Y k of Xk and figure out which object wassent. From the same line of reasoning as in the achievability proof of the channelcoding theorem, i.e. joint typicality, it is obvious the probability that object 2 hasthe same codeword as object 1 is 2−k .

(c) Let

1j =

{

1, object j yields the same answers for k questions as object 10, otherwise

,

for j = 2, . . . ,m.

Then,

E(# of objects in {2, 3, . . . ,m} with the same answers) = E(m∑

j=2

1j)

=m∑

j=2

E(1j)

=m∑

j=2

2−k

= (m− 1)2−k

= (2n − 1)2−k.

(d) Plugging k = n+√n into (c) we have the expected number of (2n − 1)2−n−√

n .

Page 138: Solution to Information Theory

Data Compression 137

(e) Let N by the number of wrong objects remaining. Then, by Markov’s inequality

P (N ≥ 1) ≤ EN

= (2n − 1)2−n−√n

≤ 2−√

n

→ 0,

where the first equality follows from part (d).

Page 139: Solution to Information Theory

138 Data Compression

Page 140: Solution to Information Theory

Chapter 6

Gambling and Data Compression

1. Horse race. Three horses run a race. A gambler offers 3-for-1 odds on each of thehorses. These are fair odds under the assumption that all horses are equally likely towin the race. The true win probabilities are known to be

p = (p1, p2, p3) =

(1

2,1

4,1

4

)

. (6.1)

Let b = (b1, b2, b3) , bi ≥ 0 ,∑bi = 1, be the amount invested on each of the horses.

The expected log wealth is thus

W (b) =3∑

i=1

pi log 3bi. (6.2)

(a) Maximize this over b to find b∗ and W ∗ . Thus the wealth achieved in repeatedhorse races should grow to infinity like 2nW ∗

with probability one.

(b) Show that if instead we put all of our money on horse 1, the most likely winner,we will eventually go broke with probability one.

Solution: Horse race.

(a) The doubling rate

W (b) =∑

i

pi log bioi (6.3)

=∑

i

pi log 3bi (6.4)

=∑

pi log 3 +∑

pi log pi −∑

pi logpi

bi(6.5)

= log 3 −H(p) −D(p||b) (6.6)

≤ log 3 −H(p), (6.7)

139

Page 141: Solution to Information Theory

140 Gambling and Data Compression

with equality iff p = b . Hence b∗ = p = (12 ,

14 ,

14) and W ∗ = log 3−H( 1

2 ,14 ,

14 ) =

12 log 9

8 = 0.085 .

By the strong law of large numbers,

Sn =∏

j

3b(Xj) (6.8)

= 2n( 1

n

jlog 3b(Xj)) (6.9)

→ 2nE log 3b(X) (6.10)

= 2nW (b) (6.11)

(6.12)

When b = b∗ , W (b) = W ∗ and Sn.=2nW ∗

= 20.085n = (1.06)n .

(b) If we put all the money on the first horse, then the probability that we do notgo broke in n races is ( 1

2 )n . Since this probability goes to zero with n , theprobability of the set of outcomes where we do not ever go broke is zero, and wewill go broke with probability 1.

Alternatively, if b = (1, 0, 0) , then W (b) = −∞ and

Sn → 2nW = 0 w.p.1 (6.13)

by the strong law of large numbers.

2. Horse race with subfair odds. If the odds are bad (due to a track take) the gamblermay wish to keep money in his pocket. Let b(0) be the amount in his pocket and letb(1), b(2), . . . , b(m) be the amount bet on horses 1, 2, . . . ,m , with odds o(1), o(2), . . . , o(m) ,and win probabilities p(1), p(2), . . . , p(m) . Thus the resulting wealth is S(x) = b(0) +b(x)o(x), with probability p(x), x = 1, 2, . . . ,m.

(a) Find b∗ maximizing E log S if∑

1/o(i) < 1.

(b) Discuss b∗ if∑

1/o(i) > 1. (There isn’t an easy closed form solution in this case,but a “water-filling” solution results from the application of the Kuhn-Tuckerconditions.)

Solution: (Horse race with a cash option).

Since in this case, the gambler is allowed to keep some of the money as cash, themathematics becomes more complicated. In class, we used two different approaches toprove the optimality of proportional betting when the gambler is not allowed keep anyof the money as cash. We will use both approaches for this problem. But in the caseof subfair odds, the relative entropy approach breaks down, and we have to use thecalculus approach.

The setup of the problem is straight-forward. We want to maximize the expected logreturn, i.e.,

W (b,p) = E log S(X) =m∑

i=1

pi log(b0 + bioi) (6.14)

Page 142: Solution to Information Theory

Gambling and Data Compression 141

over all choices b with bi ≥ 0 and∑m

i=0 bi = 1.

Approach 1: Relative Entropy

We try to express W (b,p) as a sum of relative entropies.

W (b,p) =∑

pi log(b0 + bioi) (6.15)

=∑

pi log

( b0oi

+ bi1oi

)

(6.16)

=∑

pi log

( b0oi

+ bi

pi

pi1oi

)

(6.17)

=∑

pi log pioi + logK −D(p||r), (6.18)

where

K =∑

(b0oi

+ bi) = b0∑ 1

oi+∑

bi = b0(∑ 1

oi− 1) + 1, (6.19)

and

ri =b0oi

+ bi

K(6.20)

is a kind of normalized portfolio. Now both K and r depend on the choice of b . Tomaximize W (b,p) , we must maximize logK and at the same time minimize D(p||r) .Let us consider the two cases:

(a)∑ 1

oi≤ 1 . This is the case of superfair or fair odds. In these cases, it seems intu-

itively clear that we should put all of our money in the race. For example, in thecase of a superfair gamble, one could invest any cash using a “Dutch book” (in-vesting inversely proportional to the odds) and do strictly better with probability1.

Examining the expression for K , we see that K is maximized for b0 = 0. In thiscase, setting bi = pi would imply that ri = pi and hence D(p||r) = 0 . We havesucceeded in simultaneously maximizing the two variable terms in the expressionfor W (b,p) and this must be the optimal solution.

Hence, for fair or superfair games, the gambler should invest all his money in therace using proportional gambling, and not leave anything aside as cash.

(b) 1oi> 1 . In this case, sub-fair odds, the argument breaks down. Looking at the

expression for K , we see that it is maximized for b0 = 1. However, we cannotsimultaneously minimize D(p||r) .

If pioi ≤ 1 for all horses, then the first term in the expansion of W (b,p) , thatis,∑pi log pioi is negative. With b0 = 1, the best we can achieve is proportional

betting, which sets the last term to be 0. Hence, with b0 = 1, we can only achieve anegative expected log return, which is strictly worse than the 0 log return achievedbe setting b0 = 1. This would indicate, but not prove, that in this case, one shouldleave all one’s money as cash. A more rigorous approach using calculus will provethis.

Page 143: Solution to Information Theory

142 Gambling and Data Compression

We can however give a simple argument to show that in the case of sub-fair odds,the gambler should leave at least some of his money as cash and that there isat least one horse on which he does not bet any money. We will prove this bycontradiction—starting with a portfolio that does not satisfy these criteria, we willgenerate one which does better with probability one.

Let the amount bet on each of the horses be (b1, b2, . . . , bm) with∑m

i=1 bi = 1, sothat there is no money left aside. Arrange the horses in order of decreasing bioi ,so that the m -th horse is the one with the minimum product.

Consider a new portfolio with

b′i = bi −bmom

oi(6.21)

for all i . Since bioi ≥ bmom for all i , b′i ≥ 0 . We keep the remaining money, i.e.,

1 −m∑

i=1

b′i = 1 −m∑

i=1

(

bi −bmom

oi

)

(6.22)

=m∑

i=1

bmom

oi(6.23)

as cash.

The return on the new portfolio if horse i wins is

b′ioi =

(

bi −bmom

oi

)

oi +m∑

i=1

bmom

oi(6.24)

= bioi + bmom

(m∑

i=1

1

oi− 1

)

(6.25)

> bioi, (6.26)

since∑

1/oi > 1 . Hence irrespective of which horse wins, the new portfolio doesbetter than the old one and hence the old portfolio could not be optimal.

Approach 2: Calculus

We set up the functional using Lagrange multipliers as before:

J(b) =m∑

i=1

pi log(b0 + bioi) + λ

(m∑

i=0

bi

)

(6.27)

Differentiating with respect to bi , we obtain

∂J

∂bi=

pioi

b0 + bioi+ λ = 0. (6.28)

Differentiating with respect to b0 , we obtain

∂J

∂b0=

m∑

i=1

pi

b0 + bioi+ λ = 0. (6.29)

Page 144: Solution to Information Theory

Gambling and Data Compression 143

Differentiating w.r.t. λ , we get the constraint

bi = 1. (6.30)

The solution to these three equations, if they exist, would give the optimal portfolio b .But substituting the first equation in the second, we obtain the following equation

λ∑ 1

oi= λ. (6.31)

Clearly in the case when∑ 1

oi6= 1, the only solution to this equation is λ = 0,

which indicates that the solution is on the boundary of the region over which themaximization is being carried out. Actually, we have been quite cavalier with thesetup of the problem—in addition to the constraint

∑bi = 1, we have the inequality

constraints bi ≥ 0 . We should have allotted a Lagrange multiplier to each of these.Rewriting the functional with Lagrange multipliers

J(b) =m∑

i=1

pi log(b0 + bioi) + λ

(m∑

i=0

bi

)

+∑

γibi (6.32)

Differentiating with respect to bi , we obtain

∂J

∂bi=

pioi

b0 + bioi+ λ+ γi = 0. (6.33)

Differentiating with respect to b0 , we obtain

∂J

∂b0=

m∑

i=1

pi

b0 + bioi+ λ+ γ0 = 0. (6.34)

Differentiating w.r.t. λ , we get the constraint

bi = 1. (6.35)

Now, carrying out the same substitution, we get

λ+ γ0 = λ∑ 1

oi+∑ γi

oi, (6.36)

which indicates that if∑ 1

oi6= 1, at least one of the γ ’s is non-zero, which indicates

that the corresponding constraint has become active, which shows that the solution ison the boundary of the region.

In the case of solutions on the boundary, we have to use the Kuhn-Tucker conditionsto find the maximum. These conditions are described in Gallager[2], pg. 87. Theconditions describe the behavior of the derivative at the maximum of a concave functionover a convex region. For any coordinate which is in the interior of the region, thederivative should be 0. For any coordinate on the boundary, the derivative should be

Page 145: Solution to Information Theory

144 Gambling and Data Compression

negative in the direction towards the interior of the region. More formally, for a concavefunction F (x1, x2, . . . , xn) over the region xi ≥ 0 ,

∂F

∂xi

≤ 0 if xi = 0= 0 if xi > 0

(6.37)

Applying the Kuhn-Tucker conditions to the present maximization, we obtain

pioi

b0 + bioi+ λ

≤ 0 if bi = 0= 0 if bi > 0

(6.38)

and∑ pi

b0 + bioi+ λ

≤ 0 if b0 = 0= 0 if b0 > 0

(6.39)

Theorem 4.4.1 in Gallager[2] proves that if we can find a solution to the Kuhn-Tuckerconditions, then the solution is the maximum of the function in the region. Let usconsider the two cases:

(a)∑ 1

oi≤ 1 . In this case, we try the solution we expect, b0 = 0, and bi = pi .

Setting λ = −1 , we find that all the Kuhn-Tucker conditions are satisfied. Hence,this is the optimal portfolio for superfair or fair odds.

(b)∑ 1

oi> 1 . In this case, we try the expected solution, b0 = 1, and bi = 0. We find

that all the Kuhn-Tucker conditions are satisfied if all pioi ≤ 1 . Hence under thiscondition, the optimum solution is to not invest anything in the race but to keepeverything as cash.

In the case when some pioi > 1 , the Kuhn-Tucker conditions are no longer satisfiedby b0 = 1. We should then invest some money in the race; however, since thedenominator of the expressions in the Kuhn-Tucker conditions also changes, morethan one horse may now violate the Kuhn-Tucker conditions. Hence, the optimumsolution may involve investing in some horses with pioi ≤ 1 . There is no explicitform for the solution in this case.

The Kuhn Tucker conditions for this case do not give rise to an explicit solution.Instead, we can formulate a procedure for finding the optimum distribution ofcapital:

Order the horses according to pioi , so that

p1o1 ≥ p2o2 ≥ · · · ≥ pmom. (6.40)

Define

Ck =

1−∑k

i=1pi

1−∑k

i=11oi

if k ≥ 1

1 if k = 0(6.41)

Definet = min{n|pn+1on+1 ≤ Cn}. (6.42)

Clearly t ≥ 1 since p1o1 > 1 = C0 .

Page 146: Solution to Information Theory

Gambling and Data Compression 145

Claim: The optimal strategy for the horse race when the odds are subfair andsome of the pioi are greater than 1 is: set

b0 = Ct, (6.43)

and for i = 1, 2, . . . , t , set

bi = pi −Ct

oi, (6.44)

and for i = t+ 1, . . . ,m , setbi = 0. (6.45)

The above choice of b satisfies the Kuhn-Tucker conditions with λ = 1. For b0 ,the Kuhn-Tucker condition is

∑ pi

bo + bioi=

t∑

i=1

1

oi+

m∑

i=t+1

pi

Ct=

t∑

i=1

1

oi+

1 −∑ti=1 pi

Ct= 1. (6.46)

For 1 ≤ i ≤ t , the Kuhn Tucker conditions reduce to

pioi

b0 + bioi=pioi

pioi= 1. (6.47)

For t+ 1 ≤ i ≤ m , the Kuhn Tucker conditions reduce to

pioi

b0 + bioi=pioi

Ct≤ 1, (6.48)

by the definition of t . Hence the Kuhn Tucker conditions are satisfied, and thisis the optimal solution.

3. Cards. An ordinary deck of cards containing 26 red cards and 26 black cards is shuffledand dealt out one card at at time without replacement. Let Xi be the color of the ithcard.

(a) Determine H(X1).

(b) Determine H(X2).

(c) Does H(Xk | X1, X2, . . . , Xk−1) increase or decrease?

(d) Determine H(X1, X2, . . . , X52).

Solution:

(a) P(first card red) = P(first card black)= 1/2 . Hence H(X1) = (1/2) log 2 +(1/2) log 2 = log 2 = 1 bit.

(b) P(second card red) = P(second card black) = 1/2 by symmetry. Hence H(X2) =(1/2) log 2+(1/2) log 2 = log 2 = 1 bit. There is no change in the probability fromX1 to X2 (or to Xi , 1 ≤ i ≤ 52 ) since all the permutations of red and blackcards are equally likely.

Page 147: Solution to Information Theory

146 Gambling and Data Compression

(c) Since all permutations are equally likely, the joint distribution of Xk and X1, . . . , Xk−1

is the same as the joint distribution of Xk+1 and X1, . . . , Xk−1 . Therefore

H(Xk|X1, . . . , Xk−1) = H(Xk+1|X1, . . . , Xk−1) ≥ H(Xk+1|X1, . . . , Xk) (6.49)

and so the conditional entropy decreases as we proceed along the sequence.

Knowledge of the past reduces uncertainty and thus means that the conditionalentropy of the k -th card’s color given all the previous cards will decrease as kincreases.

(d) All(5226

)possible sequences of 26 red cards and 26 black cards are equally likely.

Thus

H(X1, X2, . . . , X52) = log

(

52

26

)

= 48.8 bits (3.2 bits less than 52) (6.50)

4. Gambling. Suppose one gambles sequentially on the card outcomes in Problem 3. Evenodds of 2-for-1 are paid. Thus the wealth Sn at time n is Sn = 2nb(x1, x2, . . . , xn),where b(x1, x2, . . . , xn) is the proportion of wealth bet on x1, x2, . . . , xn. Find maxb(·) E log S52.

Solution: Gambling on red and black cards.

E[log Sn] = E[log[2nb(X1, X2, ..., Xn)]] (6.51)

= n log 2 +E[log b(X)] (6.52)

= n+∑

x∈Xn

p(x) log b(x) (6.53)

= n+∑

x∈Xn

p(x)[logb(x)

p(x)− log p(x)] (6.54)

= n+D(p(x)||b(x)) −H(X). (6.55)

Taking p(x) = b(x) makes D(p(x)||b(x)) = 0 and maximizes E log S52 .

maxb(x)

E log S52 = 52 −H(X) (6.56)

= 52 − log52!

26!26!(6.57)

= 3.2 (6.58)

Alternatively, as in the horse race, proportional betting is log-optimal. Thus b(x) =p(x) and, regardless of the outcome,

S52 =252

(5226

) = 9.08. (6.59)

and hencelog S52 = max

b(x)E log S52 = log 9.08 = 3.2. (6.60)

Page 148: Solution to Information Theory

Gambling and Data Compression 147

5. Beating the public odds. Consider a 3-horse race with win probabilities

(p1, p2, p3) = (1

2,1

4,1

4)

and fair odds with respect to the (false) distribution

(r1, r2, r3) = (1

4,1

4,1

2) .

Thus the odds are

(o1, o2, o3) = (4, 4, 2) .

(a) What is the entropy of the race?

(b) Find the set of bets (b1, b2, b3) such that the compounded wealth in repeated playswill grow to infinity.

Solution: Beating the public odds.

(a) The entropy of the race is given by

H(p) =1

2log 2 +

1

4log 4 +

1

4log 4

=3

2.

(b) Compounded wealth will grow to infinity for the set of bets (b1, b2, b3) such thatW (b,p) > 0 where

W (b,p) = D(p‖r) −D(p‖b)

=3∑

i=1

pi logbiri.

Calculating D(p‖r) , this criterion becomes

D(p‖b) <1

4.

6. Horse race: A 3 horse race has win probabilities p = (p1, p2, p3) , and odds o = (1, 1, 1) .The gambler places bets b = (b1, b2, b3) , bi ≥ 0,

∑bi = 1, where bi denotes the

proportion on wealth bet on horse i . These odds are very bad. The gambler gets hismoney back on the winning horse and loses the other bets. Thus the wealth Sn at timen resulting from independent gambles goes expnentially to zero.

(a) Find the exponent.

(b) Find the optimal gambling scheme b , i.e., the bet b∗ that maximizes the expo-nent.

Page 149: Solution to Information Theory

148 Gambling and Data Compression

(c) Assuming b is chosen as in (b), what distribution p causes Sn to go to zero atthe fastest rate?

Solution: Minimizing losses.

(a) Despite the bad odds, the optimal strategy is still proportional gambling. Thusthe optimal bets are b = p , and the exponent in this case is

W ∗ =∑

i

pi log pi = −H(p). (6.61)

(b) The optimal gambling strategy is still proportional betting.

(c) The worst distribution (the one that causes the doubling rate to be as negative aspossible) is that distribution that maximizes the entropy. Thus the worst W ∗ is− log 3 , and the gambler’s money goes to zero as 3−n .

7. Horse race. Consider a horse race with 4 horses. Assume that each of the horses pays4-for-1 if it wins. Let the probabilities of winning of the horses be { 1

2 ,14 ,

18 ,

18} . If you

started with $100 and bet optimally to maximize your long term growth rate, whatare your optimal bets on each horse? Approximately how much money would you haveafter 20 races with this strategy ?

Solution: Horse race. The optimal betting strategy is proportional betting, i.e., divid-ing the investment in proportion to the probabilities of each horse winning. Thus thebets on each horse should be (50%, 25%,12.5%,12.5%), and the growth rate achievedby this strategy is equal to log 4−H(p) = log 4−H( 1

2 ,14 ,

18 ,

18) = 2−1.75 = 0.25 . After

20 races with this strategy, the wealth is approximately 2nW = 25 = 32 , and hence thewealth would grow approximately 32 fold over 20 races.

8. Lotto. The following analysis is a crude approximation to the games of Lotto conductedby various states. Assume that the player of the game is required pay $1 to play and isasked to choose 1 number from a range 1 to 8. At the end of every day, the state lotterycommission picks a number uniformly over the same range. The jackpot, i.e., all themoney collected that day, is split among all the people who chose the same number asthe one chosen by the state. E.g., if 100 people played today, and 10 of them chose thenumber 2, and the drawing at the end of the day picked 2, then the $100 collected issplit among the 10 people, i.e., each of persons who picked 2 will receive $10, and theothers will receive nothing.

The general population does not choose numbers uniformly - numbers like 3 and 7 aresupposedly lucky and are more popular than 4 or 8. Assume that the fraction of peoplechoosing the various numbers 1, 2, . . . , 8 is (f1, f2, . . . , f8) , and assume that n peopleplay every day. Also assume that n is very large, so that any single person’s choicechoice does not change the proportion of people betting on any number.

(a) What is the optimal strategy to divide your money among the various possibletickets so as to maximize your long term growth rate? (Ignore the fact that youcannot buy fractional tickets.)

Page 150: Solution to Information Theory

Gambling and Data Compression 149

(b) What is the optimal growth rate that you can achieve in this game?

(c) If (f1, f2, . . . , f8) = (1/8, 1/8, 1/4, 1/16, 1/16, 1/16, 1/4, 1/16) , and you start with$1, how long will it be before you become a millionaire?

Solution:

(a) The probability of winning does not depend on the number you choose, and there-fore, irrespective of the proportions of the other players, the log optimal strategyis to divide your money uniformly over all the tickets.

(b) If there are n people playing, and fi of them choose number i , then the numberof people sharing the jackpot of n dollars is nfi , and therefore each person getsn/nfi = 1/fi dollars if i is picked at the end of the day. Thus the odds for numberi is 1/fi , and does not depend on the number of people playing.

Using the results of Section 6.1, the optimal growth rate is given by

W ∗(p) =∑

pi log oi −H(p) =∑ 1

8log

1

fi− log 8 (6.62)

(c) Substituing these fraction in the previous equation we get

W ∗(p) =1

8

log1

fi− log 8 (6.63)

=1

8(3 + 3 + 2 + 4 + 4 + 4 + 2 + 4) − 3 (6.64)

= 0.25 (6.65)

and therefore after N days, the amount of money you would have would be approx-imately 20.25N . The number of days before this crosses a million = log2(1, 000, 000)/0.25 =79.7 , i.e., in 80 days, you should have a million dollars.

There are many problems with the analysis, not the least of which is that thestate governments take out about half the money collected, so that the jackpotis only half of the total collections. Also there are about 14 million differentpossible tickets, and it is therefore possible to use a uniform distribution using $1tickets only if we use capital of the order of 14 million dollars. And with suchlarge investments, the proportions of money bet on the different possibilities willchange, which would further complicate the analysis.

However, the fact that people choices are not uniform does leave a loophole that canbe exploited. Under certain conditions, i.e., if the accumulated jackpot has reacheda certain size, the expected return can be greater than 1, and it is worthwhile toplay, despite the 50% cut taken by the state. But under normal circumstances,the 50% cut of the state makes the odds in the lottery very unfair, and it is not aworthwhile investment.

9. Horse race. Suppose one is interested in maximizing the doubling rate for a horse race.Let p1, p2, . . . , pm denote the win probabilities of the m horses. When do the odds(o1, o2, . . . , om) yield a higher doubling rate than the odds (o′1, o

′2, . . . , o

′m) ?

Page 151: Solution to Information Theory

150 Gambling and Data Compression

Solution: Horse Race

Let W and W ′ denote the optimal doubling rates for the odds (o1, o2, . . . , om) and(o′1, o

′2, . . . , o

′m) respectively. By Theorem 6.1.2 in the book,

W =∑

pi log oi −H(p), and

W ′ =∑

pi log o′i −H(p)

where p is the probability vector (p1, p2, . . . , pm) . Then W > W ′ exactly when∑pi log oi >

∑pi log o

′i ; that is, when

E log oi > E log o′i.

10. Horse race with probability estimates

(a) Three horses race. Their probabilities of winning are ( 12 ,

14 ,

14) . The odds are

(4-for-1, 3-for-1 and 3-for-1). Let W ∗ be the optimal doubling rate.

Suppose you believe the probabilities are ( 14 ,

12 ,

14) . If you try to maximize the

doubling rate, what doubling rate W will you achieve? By how much has yourdoubling rate decreased due to your poor estimate of the probabilities, i.e., whatis ∆W = W ∗ −W ?

(b) Now let the horse race be among m horses, with probabilities p = (p1, p2, . . . , pm)and odds o = (o1, o2, . . . , om) . If you believe the true probabilities to be q =(q1, q2, . . . , qm) , and try to maximize the doubling rate W , what is W ∗ −W ?

Solution: Horse race with probability estimates

(a) If you believe that the probabilities of winning are ( 14 ,

12 ,

14 ) , you would bet pro-

portional to this, and would achieve a growth rate∑pi log bioi = 1

2 log 414 +

14 log 31

2 + 14 log 31

4 = 14 log 9

8 . If you bet according to the true probabilities, youwould bet ( 1

2 ,14 ,

14) on the three horses, achieving a growth rate

∑pi log bioi =

12 log 41

2 + 14 log 31

4 + 14 log 31

4 = 12 log 3

2 . The loss in growth rate due to incorrect es-timation of the probabilities is the difference between the two growth rates, whichis 1

4 log 2 = 0.25 .

(b) For m horses, the growth rate with the true distribution is∑pi log pioi , and

with the incorrect estimate is∑pi log qioi . The difference between the two is

∑pi log

p1

qi= D(p||q) .

11. The two envelope problem: One envelope contains b dollars, the other 2b dollars. Theamount b is unknown. An envelope is selected at random. Let X be the amountobserved in this envelope, and let Y be the amount in the other envelope.

Adopt the strategy of switching to the other envelope with probability p(x) , where

p(x) = e−x

e−x+ex . Let Z be the amount that the player receives. Thus

(X,Y ) =

{

(b, 2b), with probability 1/2(2b, b), with probability 1/2

(6.66)

Page 152: Solution to Information Theory

Gambling and Data Compression 151

Z =

{

X, with probability 1 − p(x)Y, with probability p(x)

(6.67)

(a) Show that E(X) = E(Y ) = 3b2 .

(b) Show that E(Y/X) = 5/4 . Since the expected ratio of the amount in the otherenvelope to the one in hand is 5/4, it seems that one should always switch.(This is the origin of the switching paradox.) However, observe that E(Y ) 6=E(X)E(Y/X) . Thus, although E(Y/X) > 1 , it does not follow that E(Y ) >E(X) .

(c) Let J be the index of the envelope containing the maximum amount of money,and let J ′ be the index of the envelope chosen by the algorithm. Show that forany b , I(J ; J ′) > 0 . Thus the amount in the first envelope always contains someinformation about which envelope to choose.

(d) Show that E(Z) > E(X) . Thus you can do better than always staying or alwaysswitching. In fact, this is true for any monotonic decreasing switching functionp(x) . By randomly switching according to p(x) , you are more likely to trade upthan trade down.

Solution: Two envelope problem:

(a) X = b or 2b with prob. 1/2, and therefore E(X) = 1.5b . Y has the sameunconditional distribution.

(b) Given X = x , the other envelope contains 2x with probability 1/2 and containsx/2 with probability 1/2. Thus E(Y/X) = 5/4 .

(c) Without any conditioning, J = 1 or 2 with probability (1/2,1/2). By symmetry,it is not difficult to see that the unconditional probability distribution of J ′ is alsothe same. We will now show that the two random variables are not independent,and therefore I(J ; J ′) 6= 0. To do this, we will calculate the conditional probabilityP (J ′ = 1|J = 1) .

Conditioned on J = 1, the probability that X = b or 2b is still (1/2,1/2). How-ever, conditioned on (J = 1, X = 2b) , the probability that Z = X , and thereforeJ ′ = 1 is p(2b) . Similary, conditioned on (J = 1, X = b) , the probability thatJ ′ = 1 is 1 − p(b) . Thus,

P (J ′ = 1|J = 1) = P (X = b|J = 1)P (J ′ = 1|X = b, J = 1)

+P (X = 2b|J = 1)P (J ′ = 1|X = 2b, J = 1) (6.68)

=1

2(1 − p(b)) +

1

2p(2b) (6.69)

=1

2+

1

2(p(2b) − p(b)) (6.70)

>1

2(6.71)

Page 153: Solution to Information Theory

152 Gambling and Data Compression

Thus the conditional distribution is not equal to the unconditional distributionand J and J ′ are not independent.

(d) We use the above calculation of the conditional distribution to calculate E(Z) .Without loss of generality, we assume that J = 1, i.e., the first envelope contains2b . Then

E(Z|J = 1) = P (X = b|J = 1)E(Z|X = b, J = 1)

+P (X = 2b|J = 1)E(Z|X = 2b, J = 1) (6.72)

=1

2E(Z|X = b, J = 1) +

1

2E(Z|X = 2b, J = 1) (6.73)

=1

2

(p(J ′ = 1|X = b, J = 1)E(Z|J ′ = 1, X = b, J = 1)

+p(J ′ = 2|X = b, J = 1)E(Z|J ′ = 2, X = b, J = 1)

+p(J ′ = 1|X = 2b, J = 1)E(Z|J ′ = 1, X = 2b, J = 1)

+ p(J ′ = 2|X = 2b, J = 1)E(Z|J ′ = 2, X = 2b, J = 1))

(6.74)

=1

2([1 − p(b)]2b+ p(b)b+ p(2b)2b+ [1 − p(2b)]b) (6.75)

=3b

2+

1

2b(p(2b) − p(b)) (6.76)

>3b

2(6.77)

as long as p(2b) − p(b) > 0 . Thus E(Z) > E(X) .

12. Gambling. Find the horse win probabilities p1, p2, . . . , pm

(a) maximizing the doubling rate W ∗ for given fixed known odds o1, o2, . . . , om .

(b) minimizing the doubling rate for given fixed odds o1, o2, . . . , om .

Solution: Gambling

(a) From Theorem 6.1.2, W ∗ =∑pi log oi −H(p) . We can also write this as

W ∗ =∑

i

pi log pioi (6.78)

=∑

i

pi logpi1oi

(6.79)

=∑

i

pi logpi

qi−∑

i

pi log

j

1

oj

(6.80)

=∑

i

pi logpi

qi− log

j

1

oj

(6.81)

where

qi =1oi

j1oj

(6.82)

Page 154: Solution to Information Theory

Gambling and Data Compression 153

Therefore the minimum value of the growth rate occurs when pi = qi . Thisis the distribution that minimizes the growth rate, and the minimum value is

− log(∑

j1oj

)

.

(b) The maximum growth rate occurs when the horse with the maximum odds winsin all the races, i.e., pi = 1 for the horse that provides the maximum odds

13. Dutch book. Consider a horse race with m = 2 horses,

X = 1, 2

p = 1/2, 1/2

Odds (for one) = 10, 30

Bets = b, 1 − b.

The odds are super fair.

(a) There is a bet b which guarantees the same payoff regardless of which horse wins.Such a bet is called a Dutch book. Find this b and the associated wealth factorS(X).

(b) What is the maximum growth rate of the wealth for this gamble? Compare it tothe growth rate for the Dutch book.

Solution: Solution: Dutch book.

(a)

10bD = 30(1 − bD)

40bD = 30

bD = 3/4.

Therefore,

W (bD, P ) =1

2log

(

103

4

)

+1

2log

(

301

4

)

= 2.91

andSD(X) = 2W (bD ,P ) = 7.5.

(b) In general,

W (b, p) =1

2log(10b) +

1

2log(30(1 − b)).

Setting the ∂W∂b to zero we get

1

2

(10

10b∗

)

+1

2

( −30

30 − 30b∗

)

= 0

Page 155: Solution to Information Theory

154 Gambling and Data Compression

1

2b∗+

1

2(b∗ − 1)= 0

(b∗ − 1) + b∗

2b∗(b∗ − 1)= 0

2b∗ − 1

4b∗(1 − b∗)= 0

b∗ =1

2.

Hence

W ∗(p) =1

2log(5) +

1

2log(15) = 3.11

W (bD, p) = 2.91

and

S∗ = 2W ∗

= 8.66

SD = 2WD = 7.5

14. Horse race. Suppose one is interested in maximizing the doubling rate for a horse race.Let p1, p2, . . . , pm denote the win probabilities of the m horses. When do the odds(o1, o2, . . . , om) yield a higher doubling rate than the odds (o′1, o

′2, . . . , o

′m) ?

Solution: Horse Race (Repeat of problem 9)

Let W and W ′ denote the optimal doubling rates for the odds (o1, o2, . . . , om) and(o′1, o

′2, . . . , o

′m) respectively. By Theorem 6.1.2 in the book,

W =∑

pi log oi −H(p), and

W ′ =∑

pi log o′i −H(p)

where p is the probability vector (p1, p2, . . . , pm) . Then W > W ′ exactly when∑pi log oi >

∑pi log o

′i ; that is, when

E log oi > E log o′i.

15. Entropy of a fair horse race. Let X ∼ p(x) , x = 1, 2, . . . ,m , denote the winner ofa horse race. Suppose the odds o(x) are fair with respect to p(x) , i.e., o(x) = 1

p(x) .

Let b(x) be the amount bet on horse x , b(x) ≥ 0 ,∑m

1 b(x) = 1 . Then the resultingwealth factor is S(x) = b(x)o(x) , with probability p(x) .

(a) Find the expected wealth ES(X) .

Page 156: Solution to Information Theory

Gambling and Data Compression 155

(b) Find W ∗ , the optimal growth rate of wealth.

(c) Suppose

Y =

{

1, X = 1 or 20, otherwise

If this side information is available before the bet, how much does it increase thegrowth rate W ∗ ?

(d) Find I(X;Y ) .

Solution: Entropy of a fair horse race.

(a) The expected wealth ES(X) is

ES(X) =m∑

x=1

S(x)p(x) (6.83)

=m∑

x=1

b(x)o(x)p(x) (6.84)

=m∑

x=1

b(x), (since o(x) = 1/p(x)) (6.85)

= 1. (6.86)

(b) The optimal growth rate of wealth, W ∗ , is achieved when b(x) = p(x) for all x ,in which case,

W ∗ = E(log S(X)) (6.87)

=m∑

x=1

p(x) log(b(x)o(x)) (6.88)

=m∑

x=1

p(x) log(p(x)/p(x)) (6.89)

=m∑

x=1

p(x) log(1) (6.90)

= 0, (6.91)

so we maintain our current wealth.

(c) The increase in our growth rate due to the side information is given by I(X;Y ) .Let q = Pr(Y = 1) = p(1) + p(2) .

I(X;Y ) = H(Y ) −H(Y |X) (6.92)

= H(Y ) (since Y is a deterministic function of X) (6.93)

= H(q). (6.94)

(d) Already computed above.

Page 157: Solution to Information Theory

156 Gambling and Data Compression

16. Negative horse race Consider a horse race with m horses with win probabilities p1, p2, . . . pm.Here the gambler hopes a given horse will lose. He places bets (b1, b2, . . . , bm),

∑mi=1 bi =

1, on the horses, loses his bet bi if horse i wins, and retains the rest of his bets.(No odds.) Thus S =

j 6=i bj , with probability pi , and one wishes to maximize∑pi ln(1 − bi) subject to the constraint

∑bi = 1.

(a) Find the growth rate optimal investment strategy b∗ . Do not constrain the betsto be positive, but do constrain the bets to sum to 1. (This effectively allows shortselling and margin.)

(b) What is the optimal growth rate?

Solution: Negative horse race

(a) Let b′i = 1− bi ≥ 0 , and note that∑

i b′i = m− 1 . Let qi = b′i/

j b′j . Then, {qi}

is a probability distribution on {1, 2, . . . ,m} . Now,

W =∑

i

pi log(1 − bi)

=∑

i

pi log qi(m− 1)

= log(m− 1) +∑

i

pi log piqipi

= log(m− 1) −H(p) −D(p‖q) .

Thus, W ∗ is obtained upon setting D(p‖q) = 0 , which means making the betssuch that pi = qi = b′i/(m− 1) , or bi = 1 − (m− 1)pi . Alternatively, one can useLagrange multipliers to solve the problem.

(b) From (a) we directly see that setting D(p‖q) = 0 implies W ∗ = log(m−1)−H(p) .

17. The St. Petersburg paradox. Many years ago in ancient St. Petersburg the followinggambling proposition caused great consternation. For an entry fee of c units, a gamblerreceives a payoff of 2k units with probability 2−k, k = 1, 2, . . . .

(a) Show that the expected payoff for this game is infinite. For this reason, it wasargued that c = ∞ was a “fair” price to pay to play this game. Most people findthis answer absurd.

(b) Suppose that the gambler can buy a share of the game. For example, if he in-vests c/2 units in the game, he receives 1/2 a share and a return X/2 , wherePr(X = 2k) = 2−k, k = 1, 2, . . . . Suppose X1, X2, . . . are i.i.d. according to thisdistribution and the gambler reinvests all his wealth each time. Thus his wealthSn at time n is given by

Sn =n∏

i=1

Xi

c. (6.95)

Show that this limit is ∞ or 0 , with probability one, accordingly as c < c∗ orc > c∗. Identify the “fair” entry fee c∗.

Page 158: Solution to Information Theory

Gambling and Data Compression 157

More realistically, the gambler should be allowed to keep a proportion b = 1 − b of hismoney in his pocket and invest the rest in the St. Petersburg game. His wealth at timen is then

Sn =n∏

i=1

(

b+bXi

c

)

. (6.96)

Let

W (b, c) =∞∑

k=1

2−k log

(

1 − b+b2k

c

)

. (6.97)

We have

Sn.=2nW (b,c) (6.98)

Let

W ∗(c) = max0≤b≤1

W (b, c). (6.99)

Here are some questions about W ∗(c).

(c) For what value of the entry fee c does the optimizing value b∗ drop below 1?

(d) How does b∗ vary with c?

(e) How does W ∗(c) fall off with c?

Note that since W ∗(c) > 0 , for all c , we can conclude that any entry fee c is fair.

Solution: The St. Petersburg paradox.

(a) The expected return,

EX =∞∑

k=1

p(X = 2k)2k =∞∑

k=1

2−k2k =∞∑

k=1

1 = ∞. (6.100)

Thus the expected return on the game is infinite.

(b) By the strong law of large numbers, we see that

1

nlog Sn =

1

n

n∑

i=1

logXi − log c→ E logX − log c,w.p.1 (6.101)

and therefore Sn goes to infinity or 0 according to whether E logX is greater orless than log c . Therefore

log c∗ = E logX =∞∑

k=1

k2−k = 2. (6.102)

Therefore a fair entry fee is 2 units if the gambler is forced to invest all his money.

Page 159: Solution to Information Theory

158 Gambling and Data Compression

1 2 3 4 5 6 7 8 9 10 0

0.5

1

-1-0.5

00.51

1.52

2.5

c

b

W(b,c)

"fileplot"

Figure 6.1: St. Petersburg: W (b, c) as a function of b and c .

(c) If the gambler is not required to invest all his money, then the growth rate is

W (b, c) =∞∑

k=1

2−k log

(

1 − b+b2k

c

)

. (6.103)

For b = 0, W = 1, and for b = 1, W = E logX− log c = 2− log c . Differentiatingto find the optimum value of b , we obtain

∂W (b, c)

∂b=

∞∑

k=1

2−k 1(

1 − b+ b2k

c

)

(

−1 +2k

c

)

(6.104)

Unfortunately, there is no explicit solution for the b that maximizes W for a givenvalue of c , and we have to solve this numerically on the computer.

We have illustrated the results with three plots. The first (Figure 6.1) showsW (b, c) as a function of b and c . The second (Figure 6.2)shows b∗ as a functionof c and the third (Figure 6.3) shows W ∗ as a function of c .

From Figure 2, it is clear that b∗ is less than 1 for c > 3 . We can also see thisanalytically by calculating the slope ∂W (b,c)

∂b at b = 1.

∂W (b, c)

∂b=

∞∑

k=1

2−k 1(

1 − b+ b2k

c

)

(

−1 +2k

c

)

(6.105)

Page 160: Solution to Information Theory

Gambling and Data Compression 159

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10

b

c

"filebstar"

Figure 6.2: St. Petersburg: b∗ as a function of c .

0

0.5

1

1.5

2

2.5

3

1 2 3 4 5 6 7 8 9 10

b

c

"filewstar"

Figure 6.3: St. Petersburg: W ∗(b∗, c) as a function of c .

Page 161: Solution to Information Theory

160 Gambling and Data Compression

=∑

k

2−k

2k

c

(

2k

d− 1

)

(6.106)

=∞∑

k=1

2−k −∞∑

k=1

c2−2k (6.107)

= 1 − c

3(6.108)

which is positive for c < 3 . Thus for c < 3 , the optimal value of b lies on theboundary of the region of b ’s, and for c > 3 , the optimal value of b lies in theinterior.

(d) The variation of b∗ with c is shown in Figure 6.2. As c→ ∞ , b∗ → 0 . We havea conjecture (based on numerical results) that b∗ → 1√

2c2−c as c → ∞ , but we

do not have a proof.

(e) The variation of W ∗ with c is shown in Figure 6.3.

18. Super St. Petersburg. Finally, we have the super St. Petersburg paradox, wherePr(X = 22k

) = 2−k, k = 1, 2, . . . . Here the expected log wealth is infinite for all b > 0 ,for all c , and the gambler’s wealth grows to infinity faster than exponentially for anyb > 0. But that doesn’t mean all investment ratios b are equally good. To see this,we wish to maximize the relative growth rate with respect to some other portfolio, say,b = (1

2 ,12 ). Show that there exists a unique b maximizing

E ln(b+ bX/c)

(12 + 1

2X/c)

and interpret the answer.

Solution: Super St. Petersburg. With Pr(X = 22k) = 2−k, k = 1, 2, . . . , we have

E logX =∑

k

2−k log 22k

= ∞, (6.109)

and thus with any constant entry fee, the gambler’s money grows to infinity faster thanexponentially, since for any b > 0 ,

W (b, c) =∞∑

k=1

2−k log

(

1 − b+b22k

c

)

>∑

2−k logb22k

c= ∞. (6.110)

But if we wish to maximize the wealth relative to the ( 12 ,

12) portfolio, we need to

maximize

J(b, c) =∑

k

2−k log(1 − b) + b22k

c

12 + 1

222k

c

(6.111)

As in the case of the St. Petersburg problem, we cannot solve this problem explicitly.In this case, a computer solution is fairly straightforward, although there are some

Page 162: Solution to Information Theory

Gambling and Data Compression 161

Expected log ratio

5

100.1

0.20.30.4

0.50.60.7

0.80.9

1

-4

-3

-2

-1

0

c

b

ER(b,c)

Figure 6.4: Super St. Petersburg: J(b, c) as a function of b and c .

complications. For example, for k = 6, 22kis outside the normal range of number

representable on a standard computer. However, for k ≥ 6 , we can approximate theratio within the log by b

0.5 without any loss of accuracy. Using this, we can do a simplenumerical computation as in the previous problem.

As before, we have illustrated the results with three plots. The first (Figure 6.4) showsJ(b, c) as a function of b and c . The second (Figure 6.5)shows b∗ as a function of cand the third (Figure 6.6) shows J ∗ as a function of c .

These plots indicate that for large values of c , the optimum strategy is not to put allthe money into the game, even though the money grows at an infinite rate. There existsa unique b∗ which maximizes the expected ratio, which therefore causes the wealth togrow to infinity at the fastest possible rate. Thus there exists an optimal b∗ even whenthe log optimal portfolio is undefined.

Page 163: Solution to Information Theory

162 Gambling and Data Compression

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1 1 10 100 1000

b

c

Figure 6.5: Super St. Petersburg: b∗ as a function of c .

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1 1 10 100 1000

b

c

Figure 6.6: Super St. Petersburg: J ∗(b∗, c) as a function of c .

Page 164: Solution to Information Theory

Bibliography

[1] I. Csiszar and J. Korner. Information Theory: Coding Theorems for Discrete MemorylessSystems. Academic Press, 1981.

[2] R.G. Gallager. Information Theory and Reliable Communication. Wiley, New York, 1968.

[3] R.G. Gallager. Variations on a theme by Huffman. IEEE Trans. Inform. Theory, IT-24:668–674, 1978.

[4] A Renyi. Wahrscheinlichkeitsrechnung, mit einem Anhang uber Informationstheorie. VebDeutscher Verlag der Wissenschaften, Berlin, 1962.

[5] A.A. Sardinas and G.W. Patterson. A necessary and sufficient condition for the uniquedecomposition of coded messages. In IRE Convention Record, Part 8, pages 104–108,1953.

163