Information Theory Entropy Relative Entropy

8/2/2019 Information Theory Entropy Relative Entropy
1/60
Entropy, Relative Entropyand Mutual Information
Prof. Ja-Ling Wu
Department of Computer Scienceand Information EngineeringNational Taiwan University

2/60
Information Theory 2
Definition: The Entropy H(X) of a discrete
random variable X is defined by
entropythechangenotdoesyprobabilitzerooftermsadding:
0)xas(xlogx00log0
bits:H(P)2base:log
)(log)()(
XxPH
xPxPxH

3/60
Note that entropyis a function of the distribution
of X. It does not depend on the actual valuestaken by the r.v. X, but only on the probabilities.
)(
1
)(1
))((
log
logofvalueexpectedtheXofentropyThe:Remark
aswrittenis
g(x)..theofvalueexpectedthethen,,If
xP
xP
XxxEg
p
ExH
xPxgxgE
vrxPX
Expectation value
Self-information

4/60
Lemma 1.1: H(x) 0
Lemma 1.2: Hb(x) = (logba) Ha(x)Ex:
)()1log()1(log)(
1)1(,1
)0(,0
2 PHPPPPXH
PP
PPX
def
1
0.9
0.8
0.7
0.6
0.50.4
0.3
0.2
0.1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
H2(P)1) H(x)=1 bits when P=1/2
2) H(x) is a concave function of P
3) H(x)=0 if P=0 or 14) max H(x) occurs when P=1/2

5/60
Joint Entropy and Conditional Entropy Definition: The joint entropy H(X, Y) of a pair of discrete random
variables (X, Y) with a joint distribution P(x, y) is defined as
Definition: The conditional entropy H(Y|X) is defined as
),(log),(
),(log),(),(
YXPEYXH
or
yxPyxPYXHXx Yy
)|(log
)|(log),(
)|(log)|()(
asdefinedis|)(|
),( XYPE
xyPyxP
xyPxyPxP
xXYHxPXYH
yxP
Xx Yy
Xx Yy
Xx

7/60
Corollary:
H(X, Y|Z) = H(X|Z) + H(Y|X,Z)
Remark:
(i) H(Y|X) H(X|Y)
(II) H(X)H(X|Y) = H(Y)H(Y|X)

8/60
Relative Entropy and Mutual Information
The entropy of a random variable is a measure
of the uncertainty of the random variable; it isa measure of the amount of informationrequired on the average to describe therandom variable.
The relative entropy is a measure of thedistance between two distributions. In statistics,it arises as an expected logarithm of thelikelihood ratio. The relative entropy D(p||q) isa measure of the inefficiency of assuming thatthe distribution is q when the true distributionis p.

9/60
Ex: If we knew the true distribution of the r.v.,
then we could construct a code with averagedescription length H(p). If instead, we used thecode for a distribution q, we would needH(p)+D(p||q) bits on the average to describe
the r.v..

10/60
Definition:
The relative entropy or Kullback Lieblerdistance between two probability massfunctions p(x) and q(x) is defines as
)(
1log
)(
1log
)(
1log
)(
1log
)(
)(log
)(
)(
log)(||
xpE
xqE
xpxqE
xq
xpE
xq
xp
xpqpD
pp
pp
Xx

11/60
Definition:
Consider two r.v.s X and Y with a jointprobability mass function p(x,y) and marginalprobability mass functions p(x) and p(y). Themutual information I(X;Y) is the relative
entropy between the joint distribution and theproduct distribution p(x)p(y), i.e.,
)()(
),(log
)()(||),(
)()(
),(log),();(
),(YPXP
YXPE
ypxpyxpD
ypxp
yxpyxpYXI
yxp
Xx Yy

12/60
Ex: Let X = {0, 1} and consider two distributions p andq on X. Let p(0)=1-r, p(1)=r, and let q(0)=1-s, q(1)=s.Then
If r=s, then D(p||q)=D(q||p)=0While, in general,
D(p||q)D(q||p)
r
ss
r
ss
p
qq
p
qqpqDand
s
rr
s
rr
q
pp
q
ppqpD
log1
1log)1(
)1(
)1(log)1(
)0(
)0(log)0(||
log1
1log)1(
)1(
)1(log)1(
)0(
)0(log)0(||

13/60
Relationship between Entropy
and Mutual Information
Rewrite I(X;Y) as
)|()(
)|(log),()(log)(
)|(log),()(log),(
)(
)|(log),(
)()(
),(log),();(
,
,,
,
,
YXHXH
yxpyxpxpxp
yxpyxpxpyxp
xp
yxpyxp
ypxp
yxpyxpYXI
yxx
yxyx
yx
yx

14/60
Thus the mutual information I(X;Y) is the reduction in theuncertainty of X due to the knowledge of Y.
By symmetry, it follows that
I(X;Y) = H(Y)H(Y|X)
X says much about Y as Y says about X
Since H(X,Y) = H(X) + H(Y|X) I(X;Y) = H(X) + H(Y)H(X,Y)
I(X;X) = H(X)H(X|X) = H(X)
The mutual information of a r.v. with itself is the entropyof the r.v. entropy : self-information

15/60
Theorem: (Mutual information and entropy):
i. I(X;Y) = H(X)H(X|Y)= H(Y)H(Y|X)
= H(X) + H(Y)H(X,Y)
ii. I(X;Y) = I(Y;X)
iii. I(X;X) = H(X)
I(X;Y)
H(X|Y)
H(Y|X)
H(X,Y)
H(X) H(Y)

16/60
Chain Rules for Entropy, Relative Entropy
and Mutual Information
Theorem: (Chain rule for entropy)
Let X1, X2, , Xn, be drawn according toP(x1, x2, , xn).
Then
n
i
iin XXXHXXXH1
1121 ),,|(),,,(

17/60
Proof
(1)
n
i
ii
nnn
XXXH
XXXHXXHXHXXXH
XXXHXXHXH
XXXHXHXXXH
XXHXHXXH
1
11
1112121
123121
1321321
12121
),,|(
),,|()|()(),,,(
),|()|()(
)|,()(),,(
)|()(),(

18/60
(2) We write
n
i
ii
n
i XXX
iii
n
i XXX
iin
XXX
ii
n
i
n
XXX
n
iiin
XXX
nn
n
n
i
iin
XXXH
xxxPxxxP
xxxPxxxP
xxxPxxxP
xxxPxxxP
xxxPxxxP
XXXHthen
xxxPxxxP
i
n
n
n
n
1
11
1 ,,,
1121
1 ,,,
1121
,,,
11
1
21
,,, 11121
,,,
2121
21
1
1121
,,|
,,|log,,,
,,|log,,,
,,|log,,,
,,|log,,,
,,,log,,,
,,,
,,|,,,
21
21
21
21
21

19/60
Definition:
The conditional mutual information of rvs. Xand Y given Z is defined by
)|()|()|,(log
),|()|()|;(
),,(ZYPZXP
ZYXPE
ZYXHZXHZYXI
zyxp

20/60
Theorem: (chain rule for mutual-information)
proof:
n
i
iin XXYXIYXXXI1
1121 ),,|;();,,,(
n
i
iii
n
i
ii
n
i
ii
nn
n
XXXYXI
YXXXHXXXH
YXXXHXXXH
YXXXI
1
121
1
11
1
11
2121
21
),,,|;(
),,,|(),,|(
)|,,,(),,,(
);,,,(

21/60
Definition:
The conditional relative entropy D(p(y|x) || q (y|x)) isthe average of the relative entropies between theconditional probability mass functions p(y|x) and q(y|x)averaged over the probability mass function p(x).
Theorem: (Chain rule for relative entropy)
D(p(x,y)||q(x,y)) = D(p(x)||q(x))+ D(p(y|x)||q(y|x))
XYq
XYpE
xyqxypxypxpxyqxypD
yxp
x y
|
|log
||log|||||
,

22/60
Jensens Inequality and Its Consequences
Definition: A function is said to be convex over aninterval (a,b) if for every x
1
, x2
(a,b) and0 1, f(x1+(1-)x2)f(x1)+(1-)f(x2)A function f is said to be strictly convexif equality holds only if =0 or =1.
Definition: A function is concave iff is convex.
Ex: convex functions: X2, |X|, eX,XlogX (for X0)
concave functions: logX, X1/2 for X0
both convex and concave: ax+b; linear functions

23/60
Theorem:
If the function f has a second derivative whichis non-negative (positive) everywhere, then thefunction is convex (strictly convex).
casecontinuous:)(
casediscrete:)(
dxxxpEX
xxpEXXx

24/60
Theorem : (Jensens inequality):If f(x) is convex function and X is a random variable, then Ef(X) f(EX).
Proof: For a two mass point distribution, the inequality becomes
p1f(x1)+p2f(x2) f(p1x1+p2x2), p1+p2=1
which follows directly from the definition of convex functions.Suppose the theorem is true for distributions with K-1 mass points.
Then writing Pi=Pi/(1-PK) for i = 1, 2, , K-1, we have
The proof can be extended to continuous distributions by continuity arguments.
(Mathematical Induction)
k
i
ii
k
i
iik
k
i
iik
k
i
iikkkk
k
i
iikkk
k
i
iikkk
k
i
iikkki
k
i
i
xpf
xppf
xppf
xppxppf
xppxpf
xpfpxfp
xfppxfpxfp
1
1
1
1
1
1
1
1
1
1
11
)(
))1((
))1((
))1()1((
))1((
)()1()(
)()1()()(

25/60
Theorem: (Information inequality):
Let p(x), q(x) xX, be two probability mass functions. Then
D(p||q) 0
with equality iff p(x)=q(x) for all x.
Proof: Let A={x:p(x)>0} be the support set of p(x). Then
01log
)(log
)(log
concave)istlog()(
)()(log
)(
)(log
)(
)(log
)(
)(log)(
)(
)(log)()||(
Xx
Ax
Ax
Ax
Ax
xq
xq
xp
xqxp
xp
xqE
xp
xqE
xp
xqxp
xq
xpxpqpD

26/60
Corollary: (Non-negativity of mutual information):
For any two rvs., X, Y,
I(X;Y) 0
with equality iff X and Y are independent.
Proof:
I(X;Y) = D(p(x,y)||p(x)p(y)) 0 with equality iffp(x,y)=p(x)p(y), i.e., X and Y are independent
Corollary:
D(p(y|x)||q (y|x)) 0
with equality iff p(y|x)=q(y|x) for all x and y with p(x)>0.
Corollary:
I(X;Y|Z) 0
with equality iff X and Y are conditionary independent given Z.

27/60
Theorem:H(x)log|X|, where |X| denotes the number ofelements in the range of X, with equality iff X has auniform distribution over X.
Proof:Let u(x)=1/|X| be the uniform probability mass function
over X, and let p(x) be the probability mass function forX. Then
)(log)||(0
entropyrelativeofnegativity-nonby theHence
)(log
)(
)(log)()||(
xHupD
xH
xu
xpxpupD
X
X

28/60
Theorem: (conditioning reduces entropy):
H(X|Y) H(X)with equality iff X and Y are independent.
Proof: 0 I(X;Y)=H(X)H(X|Y)
Note that this is true only on the average; specifically,
H(X|Y=y) may be greater than or less than or equal to
H(X), but on the average H(X|Y)=yp(y)H(X|Y=y) H(X).

30/60
Theorem: (Independence bound on entropy):
Let X1, X2, ,Xn be drawn according to p(x1, x2, ,xn).Then
with equality iff the Xi are independent.
Proof: By the chain rule for entropies,
with equality iff the Xis are independent.
n
i
in XHXXXH1
21 ,,,
n
i
i
n
i
iin
XH
XXXHXXXH
1
1
1121 ,,|,,,
The LOG SUM INEQUALITY AND ITS

31/60
The LOG SUM INEQUALITY AND ITS
APPLICATIONS
Theorem: (Log sum inequality)
For non-negative numbers, a1, a2, , an and b1, b2.. bn
with equality iff ai/bi = constant.
n
i i
n
i
in
i
i
n
i i
ii
b
a
ab
aa
1
1
11
loglog
00log
0aifalog0,0log0:sconventionsome
00
0a

32/60
Proof:Assume w.l.o.g that ai>0 and bi>0. The functionf(t)=tlogt is strictly convex, sincefor all positive t. Hence by Jensens inequality, we
have
which is the log sum inequality.
0log1
" et
tf
i
i
i
i
ii
i
i
i
i
i
i
i
i
i
i
ii
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
iin
ii
ii
i
ii
iiii
b
aa
b
aa
b
b
a
b
a
b
a
b
a
b
b
b
a
b
b
b
a
b
b
b
a
b
a
b
b
b
at
b
b
tftf
loglog
)0that(noteloglog
loglogobtainwe
,andSetting.1,0for
1

33/60
Reproving the theorem that D(p||q) 0, with equality iffp(x)=q(x)
with equality iff p(x)/q(x)=c. Since both p and q are
probability mass functions, c=1 p(x)=q(x), x.
0
1
1log1
)inequalitysum-logfrom(log
log||
xq
xpxp
xq
xpxpqpD

34/60
Theorem:D(p||q) is convex in the pair (p,q), i.e., if (p1, q1) and(p2, q2) are two pairs of probability mass functions,then
Proof:
10
)||()1()||())1(||)1(( 22112121
allfor
qpDqpDqqppD
)||()1()||(
log)1(log
)1(
)1(log)1(loglog
log)1(
)1(,
)1(,
)1()1(
)1(log)1(
))1(||)1((
2211
2
22
1
11
2
22
1
11
2
1
2
1
2
12
1
2211
2211
21
2121
2121
qpDqpDq
pp
q
pp
q
pp
q
pp
b
aa
b
a
athen
qbqb
papaLet
qq
pppp
qqppD
i i
ii
i
i
ii
i
i
log-sum

35/60
Theorem: (concavity of entropy):
H(p) is a concave function of P.That is: H(1p1+(1-)p2)H(p1)+(1-)H(p2)
Proof:H(p)=log|X|D(p||u)
where u is the uniform distribution on |X|outcomes. The concavity of H then follows
directly from the convexity of D.

36/60
Theorem: Let (X,Y)~p(x,y) = p(x)p(y|x).
The mutual information I(X;Y) is(i) a concave function of p(x) for fixed p(y|x)
(ii) a convex function of p(y|x) for fixed p(x).
Proof:
(1) I(X;Y)=H(Y)-H(Y|X)=H(Y)xp(x)H(Y|X=x) ()if p(y|x) is fixed, then p(y) is a linear function ofp(x). ( p(y) = xp(x,y) = xp(x)p(y|x) )Hence H(Y), which is a concave function of
p(y), is a concave function of p(x). The secondterm of () is a linear function of p(x). Hence thedifference is a concave function of p(x).

37/60
(2) We fix p(x) and consider two different conditional distributionsp1(y|x) and p2(y|x). The corresponding joint distributions arep1(x,y)=p(x) p1(y|x) and p2(x,y)=p(x) p2(y|x), and their respective
marginals are p(x), p1(y) and p(x), p2(y).Consider a conditional distribution
p(y|x)= p1(y|x)+(1-)p2(y|x)
that is a mixture of p1(y|x) and p2(y|x). The corresponding jointdistribution is also a mixture of the corresponding joint
distributions,p(x,y) = p1 (x,y)+(1-)p2(x,y)
and the distribution of Y is also a mixture p(y)= p1 (y)+(1-)p2(y).Hence if we let q(x,y)=p(x)p(y) q(x,y)= q1 (x,y)+(1-)q2(x,y).
I(X;Y) = D(p||q) convex of (p,q)
the mutual information is a convex function of the conditionaldistribution. Therefore, the convexity of I(X;Y) is the same as thatof the D(p||q) w.r.t. pi(y|x) when p(x) is fixed.
The product of the marginal distributions
when p(x) is fixed,
p(x,y) is linear with pi(y|x)
q(x,y) is also linear with pi(y|x)
when p(x) is fixed.

38/60
Data processing inequality:
No clever manipulation of the data can improve theinferences that can be made from the data
Definition:Rvs. X,Y,Z are said to form a Markov chain in thatorder (denoted by XYZ) if the conditionaldistribution of Z depends only on Y and is conditionally
independent of X. That is XYZ form a Markovchain, then
(i) p(x,y,z)=p(x)p(y|x)p(z|y)
(ii) p(x,z|y)=p(x|y)p(z|y) :X and Z are conditionally
independent given Y XYZ implies that ZYX
If Z=f(Y), then XYZ

39/60
Theorem: (Data processing inequality)
if XYZ , then I(X;Y) I(X;Z)No processing of Y, deterministic or random, canincrease the information that Y contains about X.
Proof:I(X;Y,Z) = I(X;Z) + I(X;Y|Z) : chain rule
= I(X;Y) + I(X;Z|Y) : chain rule
Since X and Z are independent given Y, we have
I(X;Z|Y)=0. Since I(X;Y|Z)0, we have I(X;Y)I(X;Z)
with equality iff I(X;Y|Z)=0, i.e., XZY forms aMarkov chain. Similarly, one can prove I(Y;Z)I(X;Z).

40/60
Corollary:If XYZ forms a Markov chain and if Z=g(Y), wehave I(X;Y)I(X;g(Y))
: functions of the data Y cannot increase theinformation about X.
Corollary: If XYZ, then I(X;Y|Z)I(X;Y)Proof: I(X;Y,Z)=I(X;Z)+I(X;Y|Z)
=I(X;Y)+I(X;Z|Y)
By Markovity, I(X;Z|Y)=0
and I(X;Z) 0 I(X;Y|Z)I(X;Y) The dependence of X and Y is decreased (or remains
unchanged) by the observation of a downstream r.v. Z.

42/60
Fanos inequality:
Fanos inequality relates the probability of error in
guessing the r.v. X to its conditional entropy H(X|Y).
Note that:
The conditional entropy of a r.v. X given anotherrandom variable Y is zero iff X is a function of Y.
proof: HW we can estimate X from Y with zero probability oferror iff H(X|Y)=0.
we expect to be able to estimate X with a low
probability of error only if the conditional entropyH(X|Y) is small.
Fanos inequality quantifies this idea.
H(X|Y)=0 implies there is no uncertainty about X if we know Y
for all x with p(x)>0, there is only one possible value of y with p(x,y)>0

43/60
Suppose we wish to estimate a r.v. X with a
distribution p(x). We observe a r.v. Y which isrelated to X by the conditional distributionp(y|x). From Y, we calculate a function
which is an estimate of X. We wish to bound
the probability that . We observe thatforms a Markov chain.
Define the probability of error
XYg
XX
XYX
XYgPXXPP rre
)(

44/60
Theorem: (Fanos inequality)
For any estimator X such that XYX with Pe=Pr(XX),we have
H(Pe) + Pelog(|X|-1) H(X|Y)
This inequality can be weakened to1 + Pelog(|X|) H(X|Y)
or
Remark: Pe = 0 H(X|Y) = 0
H(Pe)1, E: binary r.v.log(|X|-1) log|X|
||log
1|
X
YXHPe
^ ^^

46/60
Since given E=0, X=X, and given E=1, we can upper bound theconditional entropy by the log of the number of remaining
outcomes (|X
|-1).
H(Pe)+Pelog|X|H(X|X). By the data processing inequality, wehave I(X;X)I(X;Y) since XYX, and therefore H(X|X) H(X|Y).Thus we have H(Pe)+Pelog|X| H(X|X) H(X|Y).
Remark:Suppose there is no knowledge of Y. Thus X must be guessedwithout any information. Let X{1,2,,m} and P1P2 Pm.Then the best guess of X is X=1 and the resulting probability oferror is Pe=1 - P1.
Fanos inequality becomesH(Pe) + Pelog(m-1) H(X)
The probability mass function
(P1, P2,, Pm) = (1-Pe, Pe/(m-1),, Pe/(m-1) )
achieves this bound with equality.
^
^^ ^ ^
^
^
S P ti f th R l ti E t

47/60
Some Properties of the Relative Entropy
1. Let n and n be two probability distributions on thestate space of a Markov chain at time n, and let n+1
and n+1 be the corresponding distributions at timen+1. Let the corresponding joint mass function bedenoted by p and q.
That is,
p(xn, xn+1) = p(xn) r(xn+1| xn)
q(xn, xn+1) = q(xn) r(xn+1| xn)
where
r( | ) is the probability transition function for the
Markov chain.

48/60
Then by the chain rule for relative entropy, we have
the following two expansions:
D(p(xn, xn+1)||q(xn, xn+1))
= D(p(xn)||q(xn)) + D(p(xn+1|xn)||q(xn+1|xn))
= D(p(xn+1)||q(xn+1)) + D(p(xn|xn+1)||q(xn|xn+1))
Since both p and q are derived from the same Markovchain, so
p(xn+1|xn) = q(xn+1|xn) = r(xn+1|xn),
and hence
D(p(xn+1|xn)) || q(xn+1|xn)) = 0

49/60
That is,
D(p(xn) || q(x
n))
= D(p(xn+1) || q(xn+1)) + D(p(xn|xn+1) || q(xn|xn +1))
Since D(p(xn|xn+1) || q(xn|xn +1)) 0
D(p(xn) || q(xn)) D(p(xn+1) || q(xn+1))
or D(n|| n) D (n+1|| n+1)
Conclusion:
The distance between the probability massfunctions is decreasing with time n for anyMarkov chain.

50/60
2. Relative entropy D(n|| ) between a distribution non the states at time n and a stationary distribution
decreases with n.In the last equation, if we let n be any stationarydistribution , then n+1 is the same stationary
distribution. Hence
D(n|| ) D (n+1|| )
Any state distribution gets closer and closer to eachstationary distribution as time passes. 0||lim
nn
D

51/60
3. Def:A probability transition matrix [Pij],
Pij
= Pr{x
n+1=j|x
n=i} is called doubly stochastic if
iPij=1, i=1,2,, j=1,2,
and
jPij=1, i=1,2,, j=1,2,
The uniform distribution is a stationary distribution ofP iff the probability transition matrix is doublystochastic.

52/60
4. The conditional entropy H(Xn|X1) increase with n for astationary Markov process.
If the Markov process is stationary, then H(Xn) isconstant. So the entropy is non-increasing. However,it can be proved that H(Xn|X1) increases with n. Thisimplies that:
the conditional uncertainty of the future increases.Proof:
H(Xn|X1) H(Xn|X1, X2) (conditioning reduces entropy)
= H(Xn|X2) (by Markovity)
= H(Xn-1|X1) (by stationarity)
Similarly: H(X0|Xn) is increasing in n for any Markov chain.
Sufficient Statistics

53/60
Sufficient Statistics
Suppose we have a family of probability mass function{f(x)} indexed by , and let X be a sample from a
distribution in this family. Let T(X) be any statistic(function of the sample) like the sample mean or samplevariance. Then
XT(X),
And by the data processing inequality, we have
I(;T(X)) I(;X)
for any distribution on . However, if equality holds, noinformation is lost.
A statistic T(X) is called sufficient for if itcontains all the information in X about .

54/60
Def:
A function T(X) is said to be a sufficientstatistic relative to the family {f(x)} if X isindependent of give T(X), i.e., T(X)Xforms a Markov chain.
or:I(;X) = I(; T(X))
for all distributions on
Sufficient statistics preserve mutual information.
Some examples of Sufficient Statistics

55/60
Some examples of Sufficient Statistics
1. Let be an i.i.d.sequence of coin tosses of a coin with unknown
parameter .
Given n, the number of 1s is a sufficient statistics
for .
Here
Given T, all sequences having that many 1s are
equally likely and independent of the parameter .
{0,1} in21 , X, X,, XX
1)Pr(X i
.)(1
n
i
in21 X, X,, XXT

56/60
.
),...,,(,
,0
,1
),...,,(),...,,(Pr
21
12121
forstatisticssufficientaisTand
XXXXThus
otherwise
kxif
k
n
kxxxxXXX
ni
i
n
i
inn

57/60
2. If Xis normally distributed with mean and variance 1; that is,
if
and are drawn
independently according to ,
a sufficient statistic for is the sample mean
This can be verified that
is independent of .
)1,(2
12
)( 2
Nef
x
n21 , X,, XX
f
),|( nX, X,, XXP nn21
.1
1
n
i
in Xn
X

58/60
The minimal sufficient statistics is a sufficient statisticsthat is a function of all other sufficient statistics.
Def:
A static T(X) is a minimal sufficient statistic related toif it is a function of every other sufficient
statistic
Hence, a minimal sufficient statistic maximallycompresses the information about in the sample.Other sufficient statistics may contain additional
irrelevant information.
The sufficient statistics of the above examples areminimal.
)(XfXXUXTU )()(:

59/60
Shuffles increase Entropy:
If T is a shuffle (permutation) of a deck of cards and
X is the initial (random) position of the cards in thedeck and if the choice of the shuffle T is independentof X, then
H(TX) H(X)
where TX is the permutation of the deck induced bythe shuffle T on the initial permutation X.
Proof: H(TX) H(TX|T)
= H(T-1TX|T) (why?)
= H(X|T)
= H(X)
if X and T are independent!

60/60
If X and X are i.i.d. with entropy H(X), then Pr(X=X) 2-H(X)
with equality iff X has a uniform distribution.
pf: suppose X~p(x). By Jensens inequality, we have
2Elogp(x) E2logp(x)
which implies that 2-H(X)=2p(x)logp(x)p(x)2logp(x)=p2(x)=Pr(X=X)
( Let X and X be two i.i.d. rvs with entropy H(X). The prob.at X=X is given by Pr(X=X)= p
2(x) )
Let X, X be independent with X~p(x), X~r(x), x, x
Then Pr(X=X) 2
-H(p)-D(p||r)
Pr(X=X) 2-H(r)-D(r||p)
pf: 2-H(p)-D(p||r)= 2p(x)logp(x)+p(x)logr(x)/p(x)=2p(x)logr(x) p(x)2logr(x) = p(x)r(x) =Pr(X=X)
I f i Th 60
x
*Notice that, the function f(y)=2y is convex

Information Theory Entropy Relative Entropy

Documents

Information Theory Differential Entropy

CombiningHumanDemonstrations andMotionPlanningfor … ·...

Recoverability in quantum information theory · Mark M....

Entropy, Relative Entropy and Mutual...

Black Hole Entropy in String Theory

Keystroke dynamic analysis using relative entropy & timing.....

Entropy stability theory for diﬀerence approximations of.....

A variational expression for a generlized relative entropy

Sediment Graphs Based on Entropy Theory

Geodesic convexity of the relative entropy in reversible...

The role of relative entropy in quantum information...

Musical Time and Information Theory Entropy

Quantum quenches and boundary entropy: some new...

Entropy and Information Theory - Universitas...

Improved interpolation inequalities, relative entropy and...

Algorithmic complexity theory and the relative eﬃciency of...