Information Theory Entropy Relative Entropy

8/2/2019 Information Theory Entropy Relative Entropy

1/60

Entropy, Relative Entropyand Mutual Information

Prof. Ja-Ling Wu

Department of Computer Scienceand Information EngineeringNational Taiwan University


2/60

Information Theory 2

Definition: The Entropy H(X) of a discrete

random variable X is defined by

entropythechangenotdoesyprobabilitzerooftermsadding:

0)xas(xlogx00log0

bits:H(P)2base:log

)(log)()(

XxPH

xPxPxH


3/60


Note that entropyis a function of the distribution

of X. It does not depend on the actual valuestaken by the r.v. X, but only on the probabilities.

)(

1

)(1

))((

log

logofvalueexpectedtheXofentropyThe:Remark

aswrittenis

g(x)..theofvalueexpectedthethen,,If

xP

xP

XxxEg

p

ExH

xPxgxgE

vrxPX

Expectation value

Self-information


4/60


Lemma 1.1: H(x) 0

Lemma 1.2: Hb(x) = (logba) Ha(x)Ex:

)()1log()1(log)(

1)1(,1

)0(,0

2 PHPPPPXH

PP

PPX

def

1

0.9

0.8

0.7

0.6

0.50.4

0.3

0.2

0.1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

H2(P)1) H(x)=1 bits when P=1/2

2) H(x) is a concave function of P

3) H(x)=0 if P=0 or 14) max H(x) occurs when P=1/2


5/60


Joint Entropy and Conditional Entropy Definition: The joint entropy H(X, Y) of a pair of discrete random

variables (X, Y) with a joint distribution P(x, y) is defined as

Definition: The conditional entropy H(Y|X) is defined as

),(log),(

),(log),(),(

YXPEYXH

or

yxPyxPYXHXx Yy

)|(log

)|(log),(

)|(log)|()(

asdefinedis|)(|

),( XYPE

xyPyxP

xyPxyPxP

xXYHxPXYH

yxP

Xx Yy

Xx Yy

Xx


7/60


Corollary:

H(X, Y|Z) = H(X|Z) + H(Y|X,Z)

Remark:

(i) H(Y|X) H(X|Y)

(II) H(X)H(X|Y) = H(Y)H(Y|X)


8/60


Relative Entropy and Mutual Information

The entropy of a random variable is a measure

of the uncertainty of the random variable; it isa measure of the amount of informationrequired on the average to describe therandom variable.

The relative entropy is a measure of thedistance between two distributions. In statistics,it arises as an expected logarithm of thelikelihood ratio. The relative entropy D(p||q) isa measure of the inefficiency of assuming thatthe distribution is q when the true distributionis p.


9/60


Ex: If we knew the true distribution of the r.v.,

then we could construct a code with averagedescription length H(p). If instead, we used thecode for a distribution q, we would needH(p)+D(p||q) bits on the average to describe

the r.v..


10/60


Definition:

The relative entropy or Kullback Lieblerdistance between two probability massfunctions p(x) and q(x) is defines as

)(

1log

)(

1log

)(

1log

)(

1log

)(

)(log

)(

)(

log)(||

xpE

xqE

xpxqE

xq

xpE

xq

xp

xpqpD

pp

pp

Xx


11/60


Definition:

Consider two r.v.s X and Y with a jointprobability mass function p(x,y) and marginalprobability mass functions p(x) and p(y). Themutual information I(X;Y) is the relative

entropy between the joint distribution and theproduct distribution p(x)p(y), i.e.,

)()(

),(log

)()(||),(

)()(

),(log),();(

),(YPXP

YXPE

ypxpyxpD

ypxp

yxpyxpYXI

yxp

Xx Yy


12/60


Ex: Let X = {0, 1} and consider two distributions p andq on X. Let p(0)=1-r, p(1)=r, and let q(0)=1-s, q(1)=s.Then

If r=s, then D(p||q)=D(q||p)=0While, in general,

D(p||q)D(q||p)

r

ss

r

ss

p

qq

p

qqpqDand

s

rr

s

rr

q

pp

q

ppqpD

log1

1log)1(

)1(

)1(log)1(

)0(

)0(log)0(||

log1

1log)1(

)1(

)1(log)1(

)0(

)0(log)0(||


13/60


Relationship between Entropy

and Mutual Information

Rewrite I(X;Y) as

)|()(

)|(log),()(log)(

)|(log),()(log),(

)(

)|(log),(

)()(

),(log),();(

,

,,

,

,

YXHXH

yxpyxpxpxp

yxpyxpxpyxp

xp

yxpyxp

ypxp

yxpyxpYXI

yxx

yxyx

yx

yx


14/60


Thus the mutual information I(X;Y) is the reduction in theuncertainty of X due to the knowledge of Y.

By symmetry, it follows that

I(X;Y) = H(Y)H(Y|X)

X says much about Y as Y says about X

Since H(X,Y) = H(X) + H(Y|X) I(X;Y) = H(X) + H(Y)H(X,Y)

I(X;X) = H(X)H(X|X) = H(X)

The mutual information of a r.v. with itself is the entropyof the r.v. entropy : self-information


15/60


Theorem: (Mutual information and entropy):

i. I(X;Y) = H(X)H(X|Y)= H(Y)H(Y|X)

= H(X) + H(Y)H(X,Y)

ii. I(X;Y) = I(Y;X)

iii. I(X;X) = H(X)

I(X;Y)

H(X|Y)

H(Y|X)

H(X,Y)

H(X) H(Y)


16/60


Chain Rules for Entropy, Relative Entropy

and Mutual Information

Theorem: (Chain rule for entropy)

Let X1, X2, , Xn, be drawn according toP(x1, x2, , xn).

Then

n

i

iin XXXHXXXH1

1121 ),,|(),,,(


17/60


Proof

(1)

n

i

ii

nnn

XXXH

XXXHXXHXHXXXH

XXXHXXHXH

XXXHXHXXXH

XXHXHXXH

1

11

1112121

123121

1321321

12121

),,|(

),,|()|()(),,,(

),|()|()(

)|,()(),,(

)|()(),(


18/60


(2) We write

n

i

ii

n

i XXX

iii

n

i XXX

iin

XXX

ii

n

i

n

XXX

n

iiin

XXX

nn

n

n

i

iin

XXXH

xxxPxxxP

xxxPxxxP

xxxPxxxP

xxxPxxxP

xxxPxxxP

XXXHthen

xxxPxxxP

i

n

n

n

n

1

11

1 ,,,

1121

1 ,,,

1121

,,,

11

1

21

,,, 11121

,,,

2121

21

1

1121

,,|

,,|log,,,

,,|log,,,

,,|log,,,

,,|log,,,

,,,log,,,

,,,

,,|,,,

21

21

21

21

21


19/60


Definition:

The conditional mutual information of rvs. Xand Y given Z is defined by

)|()|()|,(log

),|()|()|;(

),,(ZYPZXP

ZYXPE

ZYXHZXHZYXI

zyxp


20/60


Theorem: (chain rule for mutual-information)

proof:

n

i

iin XXYXIYXXXI1

1121 ),,|;();,,,(

n

i

iii

n

i

ii

n

i

ii

nn

n

XXXYXI

YXXXHXXXH

YXXXHXXXH

YXXXI

1

121

1

11

1

11

2121

21

),,,|;(

),,,|(),,|(

)|,,,(),,,(

);,,,(


21/60


Definition:

The conditional relative entropy D(p(y|x) || q (y|x)) isthe average of the relative entropies between theconditional probability mass functions p(y|x) and q(y|x)averaged over the probability mass function p(x).

Theorem: (Chain rule for relative entropy)

D(p(x,y)||q(x,y)) = D(p(x)||q(x))+ D(p(y|x)||q(y|x))

XYq

XYpE

xyqxypxypxpxyqxypD

yxp

x y

|

|log

||log|||||

,


22/60


Jensens Inequality and Its Consequences

Definition: A function is said to be convex over aninterval (a,b) if for every x

1

, x2

(a,b) and0 1, f(x1+(1-)x2)f(x1)+(1-)f(x2)A function f is said to be strictly convexif equality holds only if =0 or =1.

Definition: A function is concave iff is convex.

Ex: convex functions: X2, |X|, eX,XlogX (for X0)

concave functions: logX, X1/2 for X0

both convex and concave: ax+b; linear functions


23/60


Theorem:

If the function f has a second derivative whichis non-negative (positive) everywhere, then thefunction is convex (strictly convex).

casecontinuous:)(

casediscrete:)(

dxxxpEX

xxpEXXx


24/60


Theorem : (Jensens inequality):If f(x) is convex function and X is a random variable, then Ef(X) f(EX).

Proof: For a two mass point distribution, the inequality becomes

p1f(x1)+p2f(x2) f(p1x1+p2x2), p1+p2=1

which follows directly from the definition of convex functions.Suppose the theorem is true for distributions with K-1 mass points.

Then writing Pi=Pi/(1-PK) for i = 1, 2, , K-1, we have

The proof can be extended to continuous distributions by continuity arguments.

(Mathematical Induction)

k

i

ii

k

i

iik

k

i

iik

k

i

iikkkk

k

i

iikkk

k

i

iikkk

k

i

iikkki

k

i

i

xpf

xppf

xppf

xppxppf

xppxpf

xpfpxfp

xfppxfpxfp

1

1

1

1

1

1

1

1

1

1

11

)(

))1((

))1((

))1()1((

))1((

)()1()(

)()1()()(


25/60


Theorem: (Information inequality):

Let p(x), q(x) xX, be two probability mass functions. Then

D(p||q) 0

with equality iff p(x)=q(x) for all x.

Proof: Let A={x:p(x)>0} be the support set of p(x). Then

01log

)(log

)(log

concave)istlog()(

)()(log

)(

)(log

)(

)(log

)(

)(log)(

)(

)(log)()||(

Xx

Ax

Ax

Ax

Ax

xq

xq

xp

xqxp

xp

xqE

xp

xqE

xp

xqxp

xq

xpxpqpD


26/60


Corollary: (Non-negativity of mutual information):

For any two rvs., X, Y,

I(X;Y) 0

with equality iff X and Y are independent.

Proof:

I(X;Y) = D(p(x,y)||p(x)p(y)) 0 with equality iffp(x,y)=p(x)p(y), i.e., X and Y are independent

Corollary:

D(p(y|x)||q (y|x)) 0

with equality iff p(y|x)=q(y|x) for all x and y with p(x)>0.

Corollary:

I(X;Y|Z) 0

with equality iff X and Y are conditionary independent given Z.


27/60


Theorem:H(x)log|X|, where |X| denotes the number ofelements in the range of X, with equality iff X has auniform distribution over X.

Proof:Let u(x)=1/|X| be the uniform probability mass function

over X, and let p(x) be the probability mass function forX. Then

)(log)||(0

entropyrelativeofnegativity-nonby theHence

)(log

)(

)(log)()||(

xHupD

xH

xu

xpxpupD

X

X


28/60


Theorem: (conditioning reduces entropy):

H(X|Y) H(X)with equality iff X and Y are independent.

Proof: 0 I(X;Y)=H(X)H(X|Y)

Note that this is true only on the average; specifically,

H(X|Y=y) may be greater than or less than or equal to

H(X), but on the average H(X|Y)=yp(y)H(X|Y=y) H(X).


30/60


Theorem: (Independence bound on entropy):

Let X1, X2, ,Xn be drawn according to p(x1, x2, ,xn).Then

with equality iff the Xi are independent.

Proof: By the chain rule for entropies,

with equality iff the Xis are independent.

n

i

in XHXXXH1

21 ,,,

n

i

i

n

i

iin

XH

XXXHXXXH

1

1

1121 ,,|,,,

The LOG SUM INEQUALITY AND ITS


31/60


The LOG SUM INEQUALITY AND ITS

APPLICATIONS

Theorem: (Log sum inequality)

For non-negative numbers, a1, a2, , an and b1, b2.. bn

with equality iff ai/bi = constant.

n

i i

n

i

in

i

i

n

i i

ii

b

a

ab

aa

1

1

11

loglog

00log

0aifalog0,0log0:sconventionsome

00

0a


32/60


Proof:Assume w.l.o.g that ai>0 and bi>0. The functionf(t)=tlogt is strictly convex, sincefor all positive t. Hence by Jensens inequality, we

have

which is the log sum inequality.

0log1

" et

tf

i

i

i

i

ii

i

i

i

i

i

i

i

i

i

i

ii

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

iin

ii

ii

i

ii

iiii

b

aa

b

aa

b

b

a

b

a

b

a

b

a

b

b

b

a

b

b

b

a

b

b

b

a

b

a

b

b

b

at

b

b

tftf

loglog

)0that(noteloglog

loglogobtainwe

,andSetting.1,0for

1


33/60


Reproving the theorem that D(p||q) 0, with equality iffp(x)=q(x)

with equality iff p(x)/q(x)=c. Since both p and q are

probability mass functions, c=1 p(x)=q(x), x.

0

1

1log1

)inequalitysum-logfrom(log

log||

xq

xpxp

xq

xpxpqpD


34/60


Theorem:D(p||q) is convex in the pair (p,q), i.e., if (p1, q1) and(p2, q2) are two pairs of probability mass functions,then

Proof:

10

)||()1()||())1(||)1(( 22112121

allfor

qpDqpDqqppD

)||()1()||(

log)1(log

)1(

)1(log)1(loglog

log)1(

)1(,

)1(,

)1()1(

)1(log)1(

))1(||)1((

2211

2

22

1

11

2

22

1

11

2

1

2

1

2

12

1

2211

2211

21

2121

2121

qpDqpDq

pp

q

pp

q

pp

q

pp

b

aa

b

a

athen

qbqb

papaLet

qq

pppp

qqppD

i i

ii

i

i

ii

i

i

log-sum


35/60


Theorem: (concavity of entropy):

H(p) is a concave function of P.That is: H(1p1+(1-)p2)H(p1)+(1-)H(p2)

Proof:H(p)=log|X|D(p||u)

where u is the uniform distribution on |X|outcomes. The concavity of H then follows

directly from the convexity of D.


36/60


Theorem: Let (X,Y)~p(x,y) = p(x)p(y|x).

The mutual information I(X;Y) is(i) a concave function of p(x) for fixed p(y|x)

(ii) a convex function of p(y|x) for fixed p(x).

Proof:

(1) I(X;Y)=H(Y)-H(Y|X)=H(Y)xp(x)H(Y|X=x) ()if p(y|x) is fixed, then p(y) is a linear function ofp(x). ( p(y) = xp(x,y) = xp(x)p(y|x) )Hence H(Y), which is a concave function of

p(y), is a concave function of p(x). The secondterm of () is a linear function of p(x). Hence thedifference is a concave function of p(x).


37/60


(2) We fix p(x) and consider two different conditional distributionsp1(y|x) and p2(y|x). The corresponding joint distributions arep1(x,y)=p(x) p1(y|x) and p2(x,y)=p(x) p2(y|x), and their respective

marginals are p(x), p1(y) and p(x), p2(y).Consider a conditional distribution

p(y|x)= p1(y|x)+(1-)p2(y|x)

that is a mixture of p1(y|x) and p2(y|x). The corresponding jointdistribution is also a mixture of the corresponding joint

distributions,p(x,y) = p1 (x,y)+(1-)p2(x,y)

and the distribution of Y is also a mixture p(y)= p1 (y)+(1-)p2(y).Hence if we let q(x,y)=p(x)p(y) q(x,y)= q1 (x,y)+(1-)q2(x,y).

I(X;Y) = D(p||q) convex of (p,q)

the mutual information is a convex function of the conditionaldistribution. Therefore, the convexity of I(X;Y) is the same as thatof the D(p||q) w.r.t. pi(y|x) when p(x) is fixed.

The product of the marginal distributions

when p(x) is fixed,

p(x,y) is linear with pi(y|x)

q(x,y) is also linear with pi(y|x)

when p(x) is fixed.


38/60


Data processing inequality:

No clever manipulation of the data can improve theinferences that can be made from the data

Definition:Rvs. X,Y,Z are said to form a Markov chain in thatorder (denoted by XYZ) if the conditionaldistribution of Z depends only on Y and is conditionally

independent of X. That is XYZ form a Markovchain, then

(i) p(x,y,z)=p(x)p(y|x)p(z|y)

(ii) p(x,z|y)=p(x|y)p(z|y) :X and Z are conditionally

independent given Y XYZ implies that ZYX

If Z=f(Y), then XYZ


39/60


Theorem: (Data processing inequality)

if XYZ , then I(X;Y) I(X;Z)No processing of Y, deterministic or random, canincrease the information that Y contains about X.

Proof:I(X;Y,Z) = I(X;Z) + I(X;Y|Z) : chain rule

= I(X;Y) + I(X;Z|Y) : chain rule

Since X and Z are independent given Y, we have

I(X;Z|Y)=0. Since I(X;Y|Z)0, we have I(X;Y)I(X;Z)

with equality iff I(X;Y|Z)=0, i.e., XZY forms aMarkov chain. Similarly, one can prove I(Y;Z)I(X;Z).


40/60


Corollary:If XYZ forms a Markov chain and if Z=g(Y), wehave I(X;Y)I(X;g(Y))

: functions of the data Y cannot increase theinformation about X.

Corollary: If XYZ, then I(X;Y|Z)I(X;Y)Proof: I(X;Y,Z)=I(X;Z)+I(X;Y|Z)

=I(X;Y)+I(X;Z|Y)

By Markovity, I(X;Z|Y)=0

and I(X;Z) 0 I(X;Y|Z)I(X;Y) The dependence of X and Y is decreased (or remains

unchanged) by the observation of a downstream r.v. Z.


42/60


Fanos inequality:

Fanos inequality relates the probability of error in

guessing the r.v. X to its conditional entropy H(X|Y).

Note that:

The conditional entropy of a r.v. X given anotherrandom variable Y is zero iff X is a function of Y.

proof: HW we can estimate X from Y with zero probability oferror iff H(X|Y)=0.

we expect to be able to estimate X with a low

probability of error only if the conditional entropyH(X|Y) is small.

Fanos inequality quantifies this idea.

H(X|Y)=0 implies there is no uncertainty about X if we know Y

for all x with p(x)>0, there is only one possible value of y with p(x,y)>0


43/60


Suppose we wish to estimate a r.v. X with a

distribution p(x). We observe a r.v. Y which isrelated to X by the conditional distributionp(y|x). From Y, we calculate a function

which is an estimate of X. We wish to bound

the probability that . We observe thatforms a Markov chain.

Define the probability of error

XYg

XX

XYX

XYgPXXPP rre

)(


44/60


Theorem: (Fanos inequality)

For any estimator X such that XYX with Pe=Pr(XX),we have

H(Pe) + Pelog(|X|-1) H(X|Y)

This inequality can be weakened to1 + Pelog(|X|) H(X|Y)

or

Remark: Pe = 0 H(X|Y) = 0

H(Pe)1, E: binary r.v.log(|X|-1) log|X|

||log

1|

X

YXHPe

^ ^^


46/60


Since given E=0, X=X, and given E=1, we can upper bound theconditional entropy by the log of the number of remaining

outcomes (|X

|-1).

H(Pe)+Pelog|X|H(X|X). By the data processing inequality, wehave I(X;X)I(X;Y) since XYX, and therefore H(X|X) H(X|Y).Thus we have H(Pe)+Pelog|X| H(X|X) H(X|Y).

Remark:Suppose there is no knowledge of Y. Thus X must be guessedwithout any information. Let X{1,2,,m} and P1P2 Pm.Then the best guess of X is X=1 and the resulting probability oferror is Pe=1 - P1.

Fanos inequality becomesH(Pe) + Pelog(m-1) H(X)

The probability mass function

(P1, P2,, Pm) = (1-Pe, Pe/(m-1),, Pe/(m-1) )

achieves this bound with equality.

^

^^ ^ ^

^

^

S P ti f th R l ti E t


47/60


Some Properties of the Relative Entropy

1. Let n and n be two probability distributions on thestate space of a Markov chain at time n, and let n+1

and n+1 be the corresponding distributions at timen+1. Let the corresponding joint mass function bedenoted by p and q.

That is,

p(xn, xn+1) = p(xn) r(xn+1| xn)

q(xn, xn+1) = q(xn) r(xn+1| xn)

where

r( | ) is the probability transition function for the

Markov chain.


48/60


Then by the chain rule for relative entropy, we have

the following two expansions:

D(p(xn, xn+1)||q(xn, xn+1))

= D(p(xn)||q(xn)) + D(p(xn+1|xn)||q(xn+1|xn))

= D(p(xn+1)||q(xn+1)) + D(p(xn|xn+1)||q(xn|xn+1))

Since both p and q are derived from the same Markovchain, so

p(xn+1|xn) = q(xn+1|xn) = r(xn+1|xn),

and hence

D(p(xn+1|xn)) || q(xn+1|xn)) = 0


49/60


That is,

D(p(xn) || q(x

n))

= D(p(xn+1) || q(xn+1)) + D(p(xn|xn+1) || q(xn|xn +1))

Since D(p(xn|xn+1) || q(xn|xn +1)) 0

D(p(xn) || q(xn)) D(p(xn+1) || q(xn+1))

or D(n|| n) D (n+1|| n+1)

Conclusion:

The distance between the probability massfunctions is decreasing with time n for anyMarkov chain.


50/60


2. Relative entropy D(n|| ) between a distribution non the states at time n and a stationary distribution

decreases with n.In the last equation, if we let n be any stationarydistribution , then n+1 is the same stationary

distribution. Hence

D(n|| ) D (n+1|| )

Any state distribution gets closer and closer to eachstationary distribution as time passes. 0||lim

nn

D


51/60


3. Def:A probability transition matrix [Pij],

Pij

= Pr{x

n+1=j|x

n=i} is called doubly stochastic if

iPij=1, i=1,2,, j=1,2,

and

jPij=1, i=1,2,, j=1,2,

The uniform distribution is a stationary distribution ofP iff the probability transition matrix is doublystochastic.


52/60


4. The conditional entropy H(Xn|X1) increase with n for astationary Markov process.

If the Markov process is stationary, then H(Xn) isconstant. So the entropy is non-increasing. However,it can be proved that H(Xn|X1) increases with n. Thisimplies that:

the conditional uncertainty of the future increases.Proof:

H(Xn|X1) H(Xn|X1, X2) (conditioning reduces entropy)

= H(Xn|X2) (by Markovity)

= H(Xn-1|X1) (by stationarity)

Similarly: H(X0|Xn) is increasing in n for any Markov chain.

Sufficient Statistics


53/60


Sufficient Statistics

Suppose we have a family of probability mass function{f(x)} indexed by , and let X be a sample from a

distribution in this family. Let T(X) be any statistic(function of the sample) like the sample mean or samplevariance. Then

XT(X),

And by the data processing inequality, we have

I(;T(X)) I(;X)

for any distribution on . However, if equality holds, noinformation is lost.

A statistic T(X) is called sufficient for if itcontains all the information in X about .


54/60


Def:

A function T(X) is said to be a sufficientstatistic relative to the family {f(x)} if X isindependent of give T(X), i.e., T(X)Xforms a Markov chain.

or:I(;X) = I(; T(X))

for all distributions on

Sufficient statistics preserve mutual information.

Some examples of Sufficient Statistics


55/60


Some examples of Sufficient Statistics

1. Let be an i.i.d.sequence of coin tosses of a coin with unknown

parameter .

Given n, the number of 1s is a sufficient statistics

for .

Here

Given T, all sequences having that many 1s are

equally likely and independent of the parameter .

{0,1} in21 , X, X,, XX

1)Pr(X i

.)(1

n

i

in21 X, X,, XXT


56/60


.

),...,,(,

,0

,1

),...,,(),...,,(Pr

21

12121

forstatisticssufficientaisTand

XXXXThus

otherwise

kxif

k

n

kxxxxXXX

ni

i

n

i

inn


57/60


2. If Xis normally distributed with mean and variance 1; that is,

if

and are drawn

independently according to ,

a sufficient statistic for is the sample mean

This can be verified that

is independent of .

)1,(2

12

)( 2

Nef

x

n21 , X,, XX

f

),|( nX, X,, XXP nn21

.1

1

n

i

in Xn

X


58/60


The minimal sufficient statistics is a sufficient statisticsthat is a function of all other sufficient statistics.

Def:

A static T(X) is a minimal sufficient statistic related toif it is a function of every other sufficient

statistic

Hence, a minimal sufficient statistic maximallycompresses the information about in the sample.Other sufficient statistics may contain additional

irrelevant information.

The sufficient statistics of the above examples areminimal.

)(XfXXUXTU )()(:


59/60


Shuffles increase Entropy:

If T is a shuffle (permutation) of a deck of cards and

X is the initial (random) position of the cards in thedeck and if the choice of the shuffle T is independentof X, then

H(TX) H(X)

where TX is the permutation of the deck induced bythe shuffle T on the initial permutation X.

Proof: H(TX) H(TX|T)

= H(T-1TX|T) (why?)

= H(X|T)

= H(X)

if X and T are independent!


60/60

If X and X are i.i.d. with entropy H(X), then Pr(X=X) 2-H(X)

with equality iff X has a uniform distribution.

pf: suppose X~p(x). By Jensens inequality, we have

2Elogp(x) E2logp(x)

which implies that 2-H(X)=2p(x)logp(x)p(x)2logp(x)=p2(x)=Pr(X=X)

( Let X and X be two i.i.d. rvs with entropy H(X). The prob.at X=X is given by Pr(X=X)= p

2(x) )

Let X, X be independent with X~p(x), X~r(x), x, x

Then Pr(X=X) 2

-H(p)-D(p||r)

Pr(X=X) 2-H(r)-D(r||p)

pf: 2-H(p)-D(p||r)= 2p(x)logp(x)+p(x)logr(x)/p(x)=2p(x)logr(x) p(x)2logr(x) = p(x)r(x) =Pr(X=X)

I f i Th 60

x

*Notice that, the function f(y)=2y is convex

Information Theory Entropy Relative Entropy

Documents