Information Theory Entropy Relative Entropy
Post on 05-Apr-2018
264 Views
Preview:
Transcript
8/2/2019 Information Theory Entropy Relative Entropy
1/60
Entropy, Relative Entropyand Mutual Information
Prof. Ja-Ling Wu
Department of Computer Scienceand Information EngineeringNational Taiwan University
8/2/2019 Information Theory Entropy Relative Entropy
2/60
Information Theory 2
Definition: The Entropy H(X) of a discrete
random variable X is defined by
entropythechangenotdoesyprobabilitzerooftermsadding:
0)xas(xlogx00log0
bits:H(P)2base:log
)(log)()(
XxPH
xPxPxH
8/2/2019 Information Theory Entropy Relative Entropy
3/60
Information Theory 3
Note that entropyis a function of the distribution
of X. It does not depend on the actual valuestaken by the r.v. X, but only on the probabilities.
)(
1
)(1
))((
log
logofvalueexpectedtheXofentropyThe:Remark
aswrittenis
g(x)..theofvalueexpectedthethen,,If
xP
xP
XxxEg
p
ExH
xPxgxgE
vrxPX
Expectation value
Self-information
8/2/2019 Information Theory Entropy Relative Entropy
4/60
Information Theory 4
Lemma 1.1: H(x) 0
Lemma 1.2: Hb(x) = (logba) Ha(x)Ex:
)()1log()1(log)(
1)1(,1
)0(,0
2 PHPPPPXH
PP
PPX
def
1
0.9
0.8
0.7
0.6
0.50.4
0.3
0.2
0.1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
H2(P)1) H(x)=1 bits when P=1/2
2) H(x) is a concave function of P
3) H(x)=0 if P=0 or 14) max H(x) occurs when P=1/2
8/2/2019 Information Theory Entropy Relative Entropy
5/60
Information Theory 5
Joint Entropy and Conditional Entropy Definition: The joint entropy H(X, Y) of a pair of discrete random
variables (X, Y) with a joint distribution P(x, y) is defined as
Definition: The conditional entropy H(Y|X) is defined as
),(log),(
),(log),(),(
YXPEYXH
or
yxPyxPYXHXx Yy
)|(log
)|(log),(
)|(log)|()(
asdefinedis|)(|
),( XYPE
xyPyxP
xyPxyPxP
xXYHxPXYH
yxP
Xx Yy
Xx Yy
Xx
8/2/2019 Information Theory Entropy Relative Entropy
6/60
Information Theory 6
Theorem 1.1 (Chain Rule):
)|(log)(log),(log
can writewely,equivalentor
)|()(
)|(log),()(log)(
)|(log),()(log),(
)|()(log),(
),(log),(),(
:)|()(),(
XYPXPYXP
XYHXH
xyPyxPxPxP
xyPyxPxPyxP
xyPxPyxP
yxPyxPYXH
pfXYHXHYXH
Xx Xx Yy
Xx YyXx Yy
Xx Yy
Xx Yy
8/2/2019 Information Theory Entropy Relative Entropy
7/60
Information Theory 7
Corollary:
H(X, Y|Z) = H(X|Z) + H(Y|X,Z)
Remark:
(i) H(Y|X) H(X|Y)
(II) H(X)H(X|Y) = H(Y)H(Y|X)
8/2/2019 Information Theory Entropy Relative Entropy
8/60
Information Theory 8
Relative Entropy and Mutual Information
The entropy of a random variable is a measure
of the uncertainty of the random variable; it isa measure of the amount of informationrequired on the average to describe therandom variable.
The relative entropy is a measure of thedistance between two distributions. In statistics,it arises as an expected logarithm of thelikelihood ratio. The relative entropy D(p||q) isa measure of the inefficiency of assuming thatthe distribution is q when the true distributionis p.
8/2/2019 Information Theory Entropy Relative Entropy
9/60
Information Theory 9
Ex: If we knew the true distribution of the r.v.,
then we could construct a code with averagedescription length H(p). If instead, we used thecode for a distribution q, we would needH(p)+D(p||q) bits on the average to describe
the r.v..
8/2/2019 Information Theory Entropy Relative Entropy
10/60
Information Theory 10
Definition:
The relative entropy or Kullback Lieblerdistance between two probability massfunctions p(x) and q(x) is defines as
)(
1log
)(
1log
)(
1log
)(
1log
)(
)(log
)(
)(
log)(||
xpE
xqE
xpxqE
xq
xpE
xq
xp
xpqpD
pp
pp
Xx
8/2/2019 Information Theory Entropy Relative Entropy
11/60
Information Theory 11
Definition:
Consider two r.v.s X and Y with a jointprobability mass function p(x,y) and marginalprobability mass functions p(x) and p(y). Themutual information I(X;Y) is the relative
entropy between the joint distribution and theproduct distribution p(x)p(y), i.e.,
)()(
),(log
)()(||),(
)()(
),(log),();(
),(YPXP
YXPE
ypxpyxpD
ypxp
yxpyxpYXI
yxp
Xx Yy
8/2/2019 Information Theory Entropy Relative Entropy
12/60
Information Theory 12
Ex: Let X = {0, 1} and consider two distributions p andq on X. Let p(0)=1-r, p(1)=r, and let q(0)=1-s, q(1)=s.Then
If r=s, then D(p||q)=D(q||p)=0While, in general,
D(p||q)D(q||p)
r
ss
r
ss
p
qq
p
qqpqDand
s
rr
s
rr
q
pp
q
ppqpD
log1
1log)1(
)1(
)1(log)1(
)0(
)0(log)0(||
log1
1log)1(
)1(
)1(log)1(
)0(
)0(log)0(||
8/2/2019 Information Theory Entropy Relative Entropy
13/60
Information Theory 13
Relationship between Entropy
and Mutual Information
Rewrite I(X;Y) as
)|()(
)|(log),()(log)(
)|(log),()(log),(
)(
)|(log),(
)()(
),(log),();(
,
,,
,
,
YXHXH
yxpyxpxpxp
yxpyxpxpyxp
xp
yxpyxp
ypxp
yxpyxpYXI
yxx
yxyx
yx
yx
8/2/2019 Information Theory Entropy Relative Entropy
14/60
Information Theory 14
Thus the mutual information I(X;Y) is the reduction in theuncertainty of X due to the knowledge of Y.
By symmetry, it follows that
I(X;Y) = H(Y)H(Y|X)
X says much about Y as Y says about X
Since H(X,Y) = H(X) + H(Y|X) I(X;Y) = H(X) + H(Y)H(X,Y)
I(X;X) = H(X)H(X|X) = H(X)
The mutual information of a r.v. with itself is the entropyof the r.v. entropy : self-information
8/2/2019 Information Theory Entropy Relative Entropy
15/60
Information Theory 15
Theorem: (Mutual information and entropy):
i. I(X;Y) = H(X)H(X|Y)= H(Y)H(Y|X)
= H(X) + H(Y)H(X,Y)
ii. I(X;Y) = I(Y;X)
iii. I(X;X) = H(X)
I(X;Y)
H(X|Y)
H(Y|X)
H(X,Y)
H(X) H(Y)
8/2/2019 Information Theory Entropy Relative Entropy
16/60
Information Theory 16
Chain Rules for Entropy, Relative Entropy
and Mutual Information
Theorem: (Chain rule for entropy)
Let X1, X2, , Xn, be drawn according toP(x1, x2, , xn).
Then
n
i
iin XXXHXXXH1
1121 ),,|(),,,(
8/2/2019 Information Theory Entropy Relative Entropy
17/60
Information Theory 17
Proof
(1)
n
i
ii
nnn
XXXH
XXXHXXHXHXXXH
XXXHXXHXH
XXXHXHXXXH
XXHXHXXH
1
11
1112121
123121
1321321
12121
),,|(
),,|()|()(),,,(
),|()|()(
)|,()(),,(
)|()(),(
8/2/2019 Information Theory Entropy Relative Entropy
18/60
Information Theory 18
(2) We write
n
i
ii
n
i XXX
iii
n
i XXX
iin
XXX
ii
n
i
n
XXX
n
iiin
XXX
nn
n
n
i
iin
XXXH
xxxPxxxP
xxxPxxxP
xxxPxxxP
xxxPxxxP
xxxPxxxP
XXXHthen
xxxPxxxP
i
n
n
n
n
1
11
1 ,,,
1121
1 ,,,
1121
,,,
11
1
21
,,, 11121
,,,
2121
21
1
1121
,,|
,,|log,,,
,,|log,,,
,,|log,,,
,,|log,,,
,,,log,,,
,,,
,,|,,,
21
21
21
21
21
8/2/2019 Information Theory Entropy Relative Entropy
19/60
Information Theory 19
Definition:
The conditional mutual information of rvs. Xand Y given Z is defined by
)|()|()|,(log
),|()|()|;(
),,(ZYPZXP
ZYXPE
ZYXHZXHZYXI
zyxp
8/2/2019 Information Theory Entropy Relative Entropy
20/60
Information Theory 20
Theorem: (chain rule for mutual-information)
proof:
n
i
iin XXYXIYXXXI1
1121 ),,|;();,,,(
n
i
iii
n
i
ii
n
i
ii
nn
n
XXXYXI
YXXXHXXXH
YXXXHXXXH
YXXXI
1
121
1
11
1
11
2121
21
),,,|;(
),,,|(),,|(
)|,,,(),,,(
);,,,(
8/2/2019 Information Theory Entropy Relative Entropy
21/60
Information Theory 21
Definition:
The conditional relative entropy D(p(y|x) || q (y|x)) isthe average of the relative entropies between theconditional probability mass functions p(y|x) and q(y|x)averaged over the probability mass function p(x).
Theorem: (Chain rule for relative entropy)
D(p(x,y)||q(x,y)) = D(p(x)||q(x))+ D(p(y|x)||q(y|x))
XYq
XYpE
xyqxypxypxpxyqxypD
yxp
x y
|
|log
||log|||||
,
8/2/2019 Information Theory Entropy Relative Entropy
22/60
Information Theory 22
Jensens Inequality and Its Consequences
Definition: A function is said to be convex over aninterval (a,b) if for every x
1
, x2
(a,b) and0 1, f(x1+(1-)x2)f(x1)+(1-)f(x2)A function f is said to be strictly convexif equality holds only if =0 or =1.
Definition: A function is concave iff is convex.
Ex: convex functions: X2, |X|, eX,XlogX (for X0)
concave functions: logX, X1/2 for X0
both convex and concave: ax+b; linear functions
8/2/2019 Information Theory Entropy Relative Entropy
23/60
Information Theory 23
Theorem:
If the function f has a second derivative whichis non-negative (positive) everywhere, then thefunction is convex (strictly convex).
casecontinuous:)(
casediscrete:)(
dxxxpEX
xxpEXXx
8/2/2019 Information Theory Entropy Relative Entropy
24/60
Information Theory 24
Theorem : (Jensens inequality):If f(x) is convex function and X is a random variable, then Ef(X) f(EX).
Proof: For a two mass point distribution, the inequality becomes
p1f(x1)+p2f(x2) f(p1x1+p2x2), p1+p2=1
which follows directly from the definition of convex functions.Suppose the theorem is true for distributions with K-1 mass points.
Then writing Pi=Pi/(1-PK) for i = 1, 2, , K-1, we have
The proof can be extended to continuous distributions by continuity arguments.
(Mathematical Induction)
k
i
ii
k
i
iik
k
i
iik
k
i
iikkkk
k
i
iikkk
k
i
iikkk
k
i
iikkki
k
i
i
xpf
xppf
xppf
xppxppf
xppxpf
xpfpxfp
xfppxfpxfp
1
1
1
1
1
1
1
1
1
1
11
)(
))1((
))1((
))1()1((
))1((
)()1()(
)()1()()(
8/2/2019 Information Theory Entropy Relative Entropy
25/60
Information Theory 25
Theorem: (Information inequality):
Let p(x), q(x) xX, be two probability mass functions. Then
D(p||q) 0
with equality iff p(x)=q(x) for all x.
Proof: Let A={x:p(x)>0} be the support set of p(x). Then
01log
)(log
)(log
concave)istlog()(
)()(log
)(
)(log
)(
)(log
)(
)(log)(
)(
)(log)()||(
Xx
Ax
Ax
Ax
Ax
xq
xq
xp
xqxp
xp
xqE
xp
xqE
xp
xqxp
xq
xpxpqpD
8/2/2019 Information Theory Entropy Relative Entropy
26/60
Information Theory 26
Corollary: (Non-negativity of mutual information):
For any two rvs., X, Y,
I(X;Y) 0
with equality iff X and Y are independent.
Proof:
I(X;Y) = D(p(x,y)||p(x)p(y)) 0 with equality iffp(x,y)=p(x)p(y), i.e., X and Y are independent
Corollary:
D(p(y|x)||q (y|x)) 0
with equality iff p(y|x)=q(y|x) for all x and y with p(x)>0.
Corollary:
I(X;Y|Z) 0
with equality iff X and Y are conditionary independent given Z.
8/2/2019 Information Theory Entropy Relative Entropy
27/60
Information Theory 27
Theorem:H(x)log|X|, where |X| denotes the number ofelements in the range of X, with equality iff X has auniform distribution over X.
Proof:Let u(x)=1/|X| be the uniform probability mass function
over X, and let p(x) be the probability mass function forX. Then
)(log)||(0
entropyrelativeofnegativity-nonby theHence
)(log
)(
)(log)()||(
xHupD
xH
xu
xpxpupD
X
X
8/2/2019 Information Theory Entropy Relative Entropy
28/60
Information Theory 28
Theorem: (conditioning reduces entropy):
H(X|Y) H(X)with equality iff X and Y are independent.
Proof: 0 I(X;Y)=H(X)H(X|Y)
Note that this is true only on the average; specifically,
H(X|Y=y) may be greater than or less than or equal to
H(X), but on the average H(X|Y)=yp(y)H(X|Y=y) H(X).
8/2/2019 Information Theory Entropy Relative Entropy
29/60
Information Theory 29
Ex: Let (X,Y) have the following joint
distribution
Then, H(X)=H(1/8, 7/8)=0.544 bits
H(X|Y=1)=0 bits
H(X|Y=2)=1 bits > H(X)However, H(X|Y) = 3/4 H(X|Y=1)+1/4 H(X|Y=2)
= 0.25 bits < H(X)
XY
1 2
1 0 3/4
2 1/8 1/8
8/2/2019 Information Theory Entropy Relative Entropy
30/60
Information Theory 30
Theorem: (Independence bound on entropy):
Let X1, X2, ,Xn be drawn according to p(x1, x2, ,xn).Then
with equality iff the Xi are independent.
Proof: By the chain rule for entropies,
with equality iff the Xis are independent.
n
i
in XHXXXH1
21 ,,,
n
i
i
n
i
iin
XH
XXXHXXXH
1
1
1121 ,,|,,,
The LOG SUM INEQUALITY AND ITS
8/2/2019 Information Theory Entropy Relative Entropy
31/60
Information Theory 31
The LOG SUM INEQUALITY AND ITS
APPLICATIONS
Theorem: (Log sum inequality)
For non-negative numbers, a1, a2, , an and b1, b2.. bn
with equality iff ai/bi = constant.
n
i i
n
i
in
i
i
n
i i
ii
b
a
ab
aa
1
1
11
loglog
00log
0aifalog0,0log0:sconventionsome
00
0a
8/2/2019 Information Theory Entropy Relative Entropy
32/60
Information Theory 32
Proof:Assume w.l.o.g that ai>0 and bi>0. The functionf(t)=tlogt is strictly convex, sincefor all positive t. Hence by Jensens inequality, we
have
which is the log sum inequality.
0log1
" et
tf
i
i
i
i
ii
i
i
i
i
i
i
i
i
i
i
ii
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
iin
ii
ii
i
ii
iiii
b
aa
b
aa
b
b
a
b
a
b
a
b
a
b
b
b
a
b
b
b
a
b
b
b
a
b
a
b
b
b
at
b
b
tftf
loglog
)0that(noteloglog
loglogobtainwe
,andSetting.1,0for
1
8/2/2019 Information Theory Entropy Relative Entropy
33/60
Information Theory 33
Reproving the theorem that D(p||q) 0, with equality iffp(x)=q(x)
with equality iff p(x)/q(x)=c. Since both p and q are
probability mass functions, c=1 p(x)=q(x), x.
0
1
1log1
)inequalitysum-logfrom(log
log||
xq
xpxp
xq
xpxpqpD
8/2/2019 Information Theory Entropy Relative Entropy
34/60
Information Theory 34
Theorem:D(p||q) is convex in the pair (p,q), i.e., if (p1, q1) and(p2, q2) are two pairs of probability mass functions,then
Proof:
10
)||()1()||())1(||)1(( 22112121
allfor
qpDqpDqqppD
)||()1()||(
log)1(log
)1(
)1(log)1(loglog
log)1(
)1(,
)1(,
)1()1(
)1(log)1(
))1(||)1((
2211
2
22
1
11
2
22
1
11
2
1
2
1
2
12
1
2211
2211
21
2121
2121
qpDqpDq
pp
q
pp
q
pp
q
pp
b
aa
b
a
athen
qbqb
papaLet
qq
pppp
qqppD
i i
ii
i
i
ii
i
i
log-sum
8/2/2019 Information Theory Entropy Relative Entropy
35/60
Information Theory 35
Theorem: (concavity of entropy):
H(p) is a concave function of P.That is: H(1p1+(1-)p2)H(p1)+(1-)H(p2)
Proof:H(p)=log|X|D(p||u)
where u is the uniform distribution on |X|outcomes. The concavity of H then follows
directly from the convexity of D.
8/2/2019 Information Theory Entropy Relative Entropy
36/60
Information Theory 36
Theorem: Let (X,Y)~p(x,y) = p(x)p(y|x).
The mutual information I(X;Y) is(i) a concave function of p(x) for fixed p(y|x)
(ii) a convex function of p(y|x) for fixed p(x).
Proof:
(1) I(X;Y)=H(Y)-H(Y|X)=H(Y)xp(x)H(Y|X=x) ()if p(y|x) is fixed, then p(y) is a linear function ofp(x). ( p(y) = xp(x,y) = xp(x)p(y|x) )Hence H(Y), which is a concave function of
p(y), is a concave function of p(x). The secondterm of () is a linear function of p(x). Hence thedifference is a concave function of p(x).
8/2/2019 Information Theory Entropy Relative Entropy
37/60
Information Theory 37
(2) We fix p(x) and consider two different conditional distributionsp1(y|x) and p2(y|x). The corresponding joint distributions arep1(x,y)=p(x) p1(y|x) and p2(x,y)=p(x) p2(y|x), and their respective
marginals are p(x), p1(y) and p(x), p2(y).Consider a conditional distribution
p(y|x)= p1(y|x)+(1-)p2(y|x)
that is a mixture of p1(y|x) and p2(y|x). The corresponding jointdistribution is also a mixture of the corresponding joint
distributions,p(x,y) = p1 (x,y)+(1-)p2(x,y)
and the distribution of Y is also a mixture p(y)= p1 (y)+(1-)p2(y).Hence if we let q(x,y)=p(x)p(y) q(x,y)= q1 (x,y)+(1-)q2(x,y).
I(X;Y) = D(p||q) convex of (p,q)
the mutual information is a convex function of the conditionaldistribution. Therefore, the convexity of I(X;Y) is the same as thatof the D(p||q) w.r.t. pi(y|x) when p(x) is fixed.
The product of the marginal distributions
when p(x) is fixed,
p(x,y) is linear with pi(y|x)
q(x,y) is also linear with pi(y|x)
when p(x) is fixed.
8/2/2019 Information Theory Entropy Relative Entropy
38/60
Information Theory 38
Data processing inequality:
No clever manipulation of the data can improve theinferences that can be made from the data
Definition:Rvs. X,Y,Z are said to form a Markov chain in thatorder (denoted by XYZ) if the conditionaldistribution of Z depends only on Y and is conditionally
independent of X. That is XYZ form a Markovchain, then
(i) p(x,y,z)=p(x)p(y|x)p(z|y)
(ii) p(x,z|y)=p(x|y)p(z|y) :X and Z are conditionally
independent given Y XYZ implies that ZYX
If Z=f(Y), then XYZ
8/2/2019 Information Theory Entropy Relative Entropy
39/60
Information Theory 39
Theorem: (Data processing inequality)
if XYZ , then I(X;Y) I(X;Z)No processing of Y, deterministic or random, canincrease the information that Y contains about X.
Proof:I(X;Y,Z) = I(X;Z) + I(X;Y|Z) : chain rule
= I(X;Y) + I(X;Z|Y) : chain rule
Since X and Z are independent given Y, we have
I(X;Z|Y)=0. Since I(X;Y|Z)0, we have I(X;Y)I(X;Z)
with equality iff I(X;Y|Z)=0, i.e., XZY forms aMarkov chain. Similarly, one can prove I(Y;Z)I(X;Z).
8/2/2019 Information Theory Entropy Relative Entropy
40/60
Information Theory 40
Corollary:If XYZ forms a Markov chain and if Z=g(Y), wehave I(X;Y)I(X;g(Y))
: functions of the data Y cannot increase theinformation about X.
Corollary: If XYZ, then I(X;Y|Z)I(X;Y)Proof: I(X;Y,Z)=I(X;Z)+I(X;Y|Z)
=I(X;Y)+I(X;Z|Y)
By Markovity, I(X;Z|Y)=0
and I(X;Z) 0 I(X;Y|Z)I(X;Y) The dependence of X and Y is decreased (or remains
unchanged) by the observation of a downstream r.v. Z.
8/2/2019 Information Theory Entropy Relative Entropy
41/60
Information Theory 41
Note that it is possible that I(X;Y|Z)>I(X;Y)
when X,Y and Z do not form a Markov chain.
Ex: Let X and Y be independent fair binary rvs, andlet Z=X+Y. Then I(X;Y)=0, but
I(X;Y|Z) =H(X|Z)H(X|Y,Z)=H(X|Z)
=P(Z=1)H(X|Z=1)=1/2 bit.
F i li
8/2/2019 Information Theory Entropy Relative Entropy
42/60
Information Theory 42
Fanos inequality:
Fanos inequality relates the probability of error in
guessing the r.v. X to its conditional entropy H(X|Y).
Note that:
The conditional entropy of a r.v. X given anotherrandom variable Y is zero iff X is a function of Y.
proof: HW we can estimate X from Y with zero probability oferror iff H(X|Y)=0.
we expect to be able to estimate X with a low
probability of error only if the conditional entropyH(X|Y) is small.
Fanos inequality quantifies this idea.
H(X|Y)=0 implies there is no uncertainty about X if we know Y
for all x with p(x)>0, there is only one possible value of y with p(x,y)>0
8/2/2019 Information Theory Entropy Relative Entropy
43/60
Information Theory 43
Suppose we wish to estimate a r.v. X with a
distribution p(x). We observe a r.v. Y which isrelated to X by the conditional distributionp(y|x). From Y, we calculate a function
which is an estimate of X. We wish to bound
the probability that . We observe thatforms a Markov chain.
Define the probability of error
XYg
XX
XYX
XYgPXXPP rre
)(
8/2/2019 Information Theory Entropy Relative Entropy
44/60
Information Theory 44
Theorem: (Fanos inequality)
For any estimator X such that XYX with Pe=Pr(XX),we have
H(Pe) + Pelog(|X|-1) H(X|Y)
This inequality can be weakened to1 + Pelog(|X|) H(X|Y)
or
Remark: Pe = 0 H(X|Y) = 0
H(Pe)1, E: binary r.v.log(|X|-1) log|X|
||log
1|
X
YXHPe
^ ^^
8/2/2019 Information Theory Entropy Relative Entropy
45/60
Information Theory 45
Proof: Define an error rv.
By the chain rule for entropyies, we have
H(E,X|X) =H(X|X) + H(E|X,X)
=H(E|X) + H(X|E,X)
Since conditioning reduces entropy, H(E|X) H(E)= H(Pe). Nowsince E is a function of X and X H(E|X,X)=0. Since E is abinary-valued r.v., H(E)= H(Pe).
The remaining term, H(X|E,X), can be bounded as follows:H(X|E,X) = Pr(E=0)H(X|X,E=0)+Pr(E=1)H(X|X,E=1)
(1- Pe)0 + Pelog(|X|-1),
X
X
EXif,0
Xif,1
=0
H(Pe) Pelog(|X|-1)
^ ^ ^
^ ^
^^ ^
^^ ^ ^
8/2/2019 Information Theory Entropy Relative Entropy
46/60
Information Theory 46
Since given E=0, X=X, and given E=1, we can upper bound theconditional entropy by the log of the number of remaining
outcomes (|X
|-1).
H(Pe)+Pelog|X|H(X|X). By the data processing inequality, wehave I(X;X)I(X;Y) since XYX, and therefore H(X|X) H(X|Y).Thus we have H(Pe)+Pelog|X| H(X|X) H(X|Y).
Remark:Suppose there is no knowledge of Y. Thus X must be guessedwithout any information. Let X{1,2,,m} and P1P2 Pm.Then the best guess of X is X=1 and the resulting probability oferror is Pe=1 - P1.
Fanos inequality becomesH(Pe) + Pelog(m-1) H(X)
The probability mass function
(P1, P2,, Pm) = (1-Pe, Pe/(m-1),, Pe/(m-1) )
achieves this bound with equality.
^
^^ ^ ^
^
^
S P ti f th R l ti E t
8/2/2019 Information Theory Entropy Relative Entropy
47/60
Information Theory 47
Some Properties of the Relative Entropy
1. Let n and n be two probability distributions on thestate space of a Markov chain at time n, and let n+1
and n+1 be the corresponding distributions at timen+1. Let the corresponding joint mass function bedenoted by p and q.
That is,
p(xn, xn+1) = p(xn) r(xn+1| xn)
q(xn, xn+1) = q(xn) r(xn+1| xn)
where
r( | ) is the probability transition function for the
Markov chain.
8/2/2019 Information Theory Entropy Relative Entropy
48/60
Information Theory 48
Then by the chain rule for relative entropy, we have
the following two expansions:
D(p(xn, xn+1)||q(xn, xn+1))
= D(p(xn)||q(xn)) + D(p(xn+1|xn)||q(xn+1|xn))
= D(p(xn+1)||q(xn+1)) + D(p(xn|xn+1)||q(xn|xn+1))
Since both p and q are derived from the same Markovchain, so
p(xn+1|xn) = q(xn+1|xn) = r(xn+1|xn),
and hence
D(p(xn+1|xn)) || q(xn+1|xn)) = 0
8/2/2019 Information Theory Entropy Relative Entropy
49/60
Information Theory 49
That is,
D(p(xn) || q(x
n))
= D(p(xn+1) || q(xn+1)) + D(p(xn|xn+1) || q(xn|xn +1))
Since D(p(xn|xn+1) || q(xn|xn +1)) 0
D(p(xn) || q(xn)) D(p(xn+1) || q(xn+1))
or D(n|| n) D (n+1|| n+1)
Conclusion:
The distance between the probability massfunctions is decreasing with time n for anyMarkov chain.
8/2/2019 Information Theory Entropy Relative Entropy
50/60
Information Theory 50
2. Relative entropy D(n|| ) between a distribution non the states at time n and a stationary distribution
decreases with n.In the last equation, if we let n be any stationarydistribution , then n+1 is the same stationary
distribution. Hence
D(n|| ) D (n+1|| )
Any state distribution gets closer and closer to eachstationary distribution as time passes. 0||lim
nn
D
8/2/2019 Information Theory Entropy Relative Entropy
51/60
Information Theory 51
3. Def:A probability transition matrix [Pij],
Pij
= Pr{x
n+1=j|x
n=i} is called doubly stochastic if
iPij=1, i=1,2,, j=1,2,
and
jPij=1, i=1,2,, j=1,2,
The uniform distribution is a stationary distribution ofP iff the probability transition matrix is doublystochastic.
8/2/2019 Information Theory Entropy Relative Entropy
52/60
Information Theory 52
4. The conditional entropy H(Xn|X1) increase with n for astationary Markov process.
If the Markov process is stationary, then H(Xn) isconstant. So the entropy is non-increasing. However,it can be proved that H(Xn|X1) increases with n. Thisimplies that:
the conditional uncertainty of the future increases.Proof:
H(Xn|X1) H(Xn|X1, X2) (conditioning reduces entropy)
= H(Xn|X2) (by Markovity)
= H(Xn-1|X1) (by stationarity)
Similarly: H(X0|Xn) is increasing in n for any Markov chain.
Sufficient Statistics
8/2/2019 Information Theory Entropy Relative Entropy
53/60
Information Theory 53
Sufficient Statistics
Suppose we have a family of probability mass function{f(x)} indexed by , and let X be a sample from a
distribution in this family. Let T(X) be any statistic(function of the sample) like the sample mean or samplevariance. Then
XT(X),
And by the data processing inequality, we have
I(;T(X)) I(;X)
for any distribution on . However, if equality holds, noinformation is lost.
A statistic T(X) is called sufficient for if itcontains all the information in X about .
8/2/2019 Information Theory Entropy Relative Entropy
54/60
Information Theory 54
Def:
A function T(X) is said to be a sufficientstatistic relative to the family {f(x)} if X isindependent of give T(X), i.e., T(X)Xforms a Markov chain.
or:I(;X) = I(; T(X))
for all distributions on
Sufficient statistics preserve mutual information.
Some examples of Sufficient Statistics
8/2/2019 Information Theory Entropy Relative Entropy
55/60
Information Theory 55
Some examples of Sufficient Statistics
1. Let be an i.i.d.sequence of coin tosses of a coin with unknown
parameter .
Given n, the number of 1s is a sufficient statistics
for .
Here
Given T, all sequences having that many 1s are
equally likely and independent of the parameter .
{0,1} in21 , X, X,, XX
1)Pr(X i
.)(1
n
i
in21 X, X,, XXT
8/2/2019 Information Theory Entropy Relative Entropy
56/60
Information Theory 56
.
),...,,(,
,0
,1
),...,,(),...,,(Pr
21
12121
forstatisticssufficientaisTand
XXXXThus
otherwise
kxif
k
n
kxxxxXXX
ni
i
n
i
inn
8/2/2019 Information Theory Entropy Relative Entropy
57/60
Information Theory 57
2. If Xis normally distributed with mean and variance 1; that is,
if
and are drawn
independently according to ,
a sufficient statistic for is the sample mean
This can be verified that
is independent of .
)1,(2
12
)( 2
Nef
x
n21 , X,, XX
f
),|( nX, X,, XXP nn21
.1
1
n
i
in Xn
X
8/2/2019 Information Theory Entropy Relative Entropy
58/60
Information Theory 58
The minimal sufficient statistics is a sufficient statisticsthat is a function of all other sufficient statistics.
Def:
A static T(X) is a minimal sufficient statistic related toif it is a function of every other sufficient
statistic
Hence, a minimal sufficient statistic maximallycompresses the information about in the sample.Other sufficient statistics may contain additional
irrelevant information.
The sufficient statistics of the above examples areminimal.
)(XfXXUXTU )()(:
8/2/2019 Information Theory Entropy Relative Entropy
59/60
Information Theory 59
Shuffles increase Entropy:
If T is a shuffle (permutation) of a deck of cards and
X is the initial (random) position of the cards in thedeck and if the choice of the shuffle T is independentof X, then
H(TX) H(X)
where TX is the permutation of the deck induced bythe shuffle T on the initial permutation X.
Proof: H(TX) H(TX|T)
= H(T-1TX|T) (why?)
= H(X|T)
= H(X)
if X and T are independent!
8/2/2019 Information Theory Entropy Relative Entropy
60/60
If X and X are i.i.d. with entropy H(X), then Pr(X=X) 2-H(X)
with equality iff X has a uniform distribution.
pf: suppose X~p(x). By Jensens inequality, we have
2Elogp(x) E2logp(x)
which implies that 2-H(X)=2p(x)logp(x)p(x)2logp(x)=p2(x)=Pr(X=X)
( Let X and X be two i.i.d. rvs with entropy H(X). The prob.at X=X is given by Pr(X=X)= p
2(x) )
Let X, X be independent with X~p(x), X~r(x), x, x
Then Pr(X=X) 2
-H(p)-D(p||r)
Pr(X=X) 2-H(r)-D(r||p)
pf: 2-H(p)-D(p||r)= 2p(x)logp(x)+p(x)logr(x)/p(x)=2p(x)logr(x) p(x)2logr(x) = p(x)r(x) =Pr(X=X)
I f i Th 60
x
*Notice that, the function f(y)=2y is convex
top related