Top Banner

of 60

Information Theory Entropy Relative Entropy

Apr 05, 2018

Download

Documents

Luc Doyon
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/2/2019 Information Theory Entropy Relative Entropy

    1/60

    Entropy, Relative Entropyand Mutual Information

    Prof. Ja-Ling Wu

    Department of Computer Scienceand Information EngineeringNational Taiwan University

  • 8/2/2019 Information Theory Entropy Relative Entropy

    2/60

    Information Theory 2

    Definition: The Entropy H(X) of a discrete

    random variable X is defined by

    entropythechangenotdoesyprobabilitzerooftermsadding:

    0)xas(xlogx00log0

    bits:H(P)2base:log

    )(log)()(

    XxPH

    xPxPxH

  • 8/2/2019 Information Theory Entropy Relative Entropy

    3/60

    Information Theory 3

    Note that entropyis a function of the distribution

    of X. It does not depend on the actual valuestaken by the r.v. X, but only on the probabilities.

    )(

    1

    )(1

    ))((

    log

    logofvalueexpectedtheXofentropyThe:Remark

    aswrittenis

    g(x)..theofvalueexpectedthethen,,If

    xP

    xP

    XxxEg

    p

    ExH

    xPxgxgE

    vrxPX

    Expectation value

    Self-information

  • 8/2/2019 Information Theory Entropy Relative Entropy

    4/60

    Information Theory 4

    Lemma 1.1: H(x) 0

    Lemma 1.2: Hb(x) = (logba) Ha(x)Ex:

    )()1log()1(log)(

    1)1(,1

    )0(,0

    2 PHPPPPXH

    PP

    PPX

    def

    1

    0.9

    0.8

    0.7

    0.6

    0.50.4

    0.3

    0.2

    0.1

    0

    0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

    H2(P)1) H(x)=1 bits when P=1/2

    2) H(x) is a concave function of P

    3) H(x)=0 if P=0 or 14) max H(x) occurs when P=1/2

  • 8/2/2019 Information Theory Entropy Relative Entropy

    5/60

    Information Theory 5

    Joint Entropy and Conditional Entropy Definition: The joint entropy H(X, Y) of a pair of discrete random

    variables (X, Y) with a joint distribution P(x, y) is defined as

    Definition: The conditional entropy H(Y|X) is defined as

    ),(log),(

    ),(log),(),(

    YXPEYXH

    or

    yxPyxPYXHXx Yy

    )|(log

    )|(log),(

    )|(log)|()(

    asdefinedis|)(|

    ),( XYPE

    xyPyxP

    xyPxyPxP

    xXYHxPXYH

    yxP

    Xx Yy

    Xx Yy

    Xx

  • 8/2/2019 Information Theory Entropy Relative Entropy

    6/60

    Information Theory 6

    Theorem 1.1 (Chain Rule):

    )|(log)(log),(log

    can writewely,equivalentor

    )|()(

    )|(log),()(log)(

    )|(log),()(log),(

    )|()(log),(

    ),(log),(),(

    :)|()(),(

    XYPXPYXP

    XYHXH

    xyPyxPxPxP

    xyPyxPxPyxP

    xyPxPyxP

    yxPyxPYXH

    pfXYHXHYXH

    Xx Xx Yy

    Xx YyXx Yy

    Xx Yy

    Xx Yy

  • 8/2/2019 Information Theory Entropy Relative Entropy

    7/60

    Information Theory 7

    Corollary:

    H(X, Y|Z) = H(X|Z) + H(Y|X,Z)

    Remark:

    (i) H(Y|X) H(X|Y)

    (II) H(X)H(X|Y) = H(Y)H(Y|X)

  • 8/2/2019 Information Theory Entropy Relative Entropy

    8/60

    Information Theory 8

    Relative Entropy and Mutual Information

    The entropy of a random variable is a measure

    of the uncertainty of the random variable; it isa measure of the amount of informationrequired on the average to describe therandom variable.

    The relative entropy is a measure of thedistance between two distributions. In statistics,it arises as an expected logarithm of thelikelihood ratio. The relative entropy D(p||q) isa measure of the inefficiency of assuming thatthe distribution is q when the true distributionis p.

  • 8/2/2019 Information Theory Entropy Relative Entropy

    9/60

    Information Theory 9

    Ex: If we knew the true distribution of the r.v.,

    then we could construct a code with averagedescription length H(p). If instead, we used thecode for a distribution q, we would needH(p)+D(p||q) bits on the average to describe

    the r.v..

  • 8/2/2019 Information Theory Entropy Relative Entropy

    10/60

    Information Theory 10

    Definition:

    The relative entropy or Kullback Lieblerdistance between two probability massfunctions p(x) and q(x) is defines as

    )(

    1log

    )(

    1log

    )(

    1log

    )(

    1log

    )(

    )(log

    )(

    )(

    log)(||

    xpE

    xqE

    xpxqE

    xq

    xpE

    xq

    xp

    xpqpD

    pp

    pp

    Xx

  • 8/2/2019 Information Theory Entropy Relative Entropy

    11/60

    Information Theory 11

    Definition:

    Consider two r.v.s X and Y with a jointprobability mass function p(x,y) and marginalprobability mass functions p(x) and p(y). Themutual information I(X;Y) is the relative

    entropy between the joint distribution and theproduct distribution p(x)p(y), i.e.,

    )()(

    ),(log

    )()(||),(

    )()(

    ),(log),();(

    ),(YPXP

    YXPE

    ypxpyxpD

    ypxp

    yxpyxpYXI

    yxp

    Xx Yy

  • 8/2/2019 Information Theory Entropy Relative Entropy

    12/60

    Information Theory 12

    Ex: Let X = {0, 1} and consider two distributions p andq on X. Let p(0)=1-r, p(1)=r, and let q(0)=1-s, q(1)=s.Then

    If r=s, then D(p||q)=D(q||p)=0While, in general,

    D(p||q)D(q||p)

    r

    ss

    r

    ss

    p

    qq

    p

    qqpqDand

    s

    rr

    s

    rr

    q

    pp

    q

    ppqpD

    log1

    1log)1(

    )1(

    )1(log)1(

    )0(

    )0(log)0(||

    log1

    1log)1(

    )1(

    )1(log)1(

    )0(

    )0(log)0(||

  • 8/2/2019 Information Theory Entropy Relative Entropy

    13/60

    Information Theory 13

    Relationship between Entropy

    and Mutual Information

    Rewrite I(X;Y) as

    )|()(

    )|(log),()(log)(

    )|(log),()(log),(

    )(

    )|(log),(

    )()(

    ),(log),();(

    ,

    ,,

    ,

    ,

    YXHXH

    yxpyxpxpxp

    yxpyxpxpyxp

    xp

    yxpyxp

    ypxp

    yxpyxpYXI

    yxx

    yxyx

    yx

    yx

  • 8/2/2019 Information Theory Entropy Relative Entropy

    14/60

    Information Theory 14

    Thus the mutual information I(X;Y) is the reduction in theuncertainty of X due to the knowledge of Y.

    By symmetry, it follows that

    I(X;Y) = H(Y)H(Y|X)

    X says much about Y as Y says about X

    Since H(X,Y) = H(X) + H(Y|X) I(X;Y) = H(X) + H(Y)H(X,Y)

    I(X;X) = H(X)H(X|X) = H(X)

    The mutual information of a r.v. with itself is the entropyof the r.v. entropy : self-information

  • 8/2/2019 Information Theory Entropy Relative Entropy

    15/60

    Information Theory 15

    Theorem: (Mutual information and entropy):

    i. I(X;Y) = H(X)H(X|Y)= H(Y)H(Y|X)

    = H(X) + H(Y)H(X,Y)

    ii. I(X;Y) = I(Y;X)

    iii. I(X;X) = H(X)

    I(X;Y)

    H(X|Y)

    H(Y|X)

    H(X,Y)

    H(X) H(Y)

  • 8/2/2019 Information Theory Entropy Relative Entropy

    16/60

    Information Theory 16

    Chain Rules for Entropy, Relative Entropy

    and Mutual Information

    Theorem: (Chain rule for entropy)

    Let X1, X2, , Xn, be drawn according toP(x1, x2, , xn).

    Then

    n

    i

    iin XXXHXXXH1

    1121 ),,|(),,,(

  • 8/2/2019 Information Theory Entropy Relative Entropy

    17/60

    Information Theory 17

    Proof

    (1)

    n

    i

    ii

    nnn

    XXXH

    XXXHXXHXHXXXH

    XXXHXXHXH

    XXXHXHXXXH

    XXHXHXXH

    1

    11

    1112121

    123121

    1321321

    12121

    ),,|(

    ),,|()|()(),,,(

    ),|()|()(

    )|,()(),,(

    )|()(),(

  • 8/2/2019 Information Theory Entropy Relative Entropy

    18/60

    Information Theory 18

    (2) We write

    n

    i

    ii

    n

    i XXX

    iii

    n

    i XXX

    iin

    XXX

    ii

    n

    i

    n

    XXX

    n

    iiin

    XXX

    nn

    n

    n

    i

    iin

    XXXH

    xxxPxxxP

    xxxPxxxP

    xxxPxxxP

    xxxPxxxP

    xxxPxxxP

    XXXHthen

    xxxPxxxP

    i

    n

    n

    n

    n

    1

    11

    1 ,,,

    1121

    1 ,,,

    1121

    ,,,

    11

    1

    21

    ,,, 11121

    ,,,

    2121

    21

    1

    1121

    ,,|

    ,,|log,,,

    ,,|log,,,

    ,,|log,,,

    ,,|log,,,

    ,,,log,,,

    ,,,

    ,,|,,,

    21

    21

    21

    21

    21

  • 8/2/2019 Information Theory Entropy Relative Entropy

    19/60

    Information Theory 19

    Definition:

    The conditional mutual information of rvs. Xand Y given Z is defined by

    )|()|()|,(log

    ),|()|()|;(

    ),,(ZYPZXP

    ZYXPE

    ZYXHZXHZYXI

    zyxp

  • 8/2/2019 Information Theory Entropy Relative Entropy

    20/60

    Information Theory 20

    Theorem: (chain rule for mutual-information)

    proof:

    n

    i

    iin XXYXIYXXXI1

    1121 ),,|;();,,,(

    n

    i

    iii

    n

    i

    ii

    n

    i

    ii

    nn

    n

    XXXYXI

    YXXXHXXXH

    YXXXHXXXH

    YXXXI

    1

    121

    1

    11

    1

    11

    2121

    21

    ),,,|;(

    ),,,|(),,|(

    )|,,,(),,,(

    );,,,(

  • 8/2/2019 Information Theory Entropy Relative Entropy

    21/60

    Information Theory 21

    Definition:

    The conditional relative entropy D(p(y|x) || q (y|x)) isthe average of the relative entropies between theconditional probability mass functions p(y|x) and q(y|x)averaged over the probability mass function p(x).

    Theorem: (Chain rule for relative entropy)

    D(p(x,y)||q(x,y)) = D(p(x)||q(x))+ D(p(y|x)||q(y|x))

    XYq

    XYpE

    xyqxypxypxpxyqxypD

    yxp

    x y

    |

    |log

    ||log|||||

    ,

  • 8/2/2019 Information Theory Entropy Relative Entropy

    22/60

    Information Theory 22

    Jensens Inequality and Its Consequences

    Definition: A function is said to be convex over aninterval (a,b) if for every x

    1

    , x2

    (a,b) and0 1, f(x1+(1-)x2)f(x1)+(1-)f(x2)A function f is said to be strictly convexif equality holds only if =0 or =1.

    Definition: A function is concave iff is convex.

    Ex: convex functions: X2, |X|, eX,XlogX (for X0)

    concave functions: logX, X1/2 for X0

    both convex and concave: ax+b; linear functions

  • 8/2/2019 Information Theory Entropy Relative Entropy

    23/60

    Information Theory 23

    Theorem:

    If the function f has a second derivative whichis non-negative (positive) everywhere, then thefunction is convex (strictly convex).

    casecontinuous:)(

    casediscrete:)(

    dxxxpEX

    xxpEXXx

  • 8/2/2019 Information Theory Entropy Relative Entropy

    24/60

    Information Theory 24

    Theorem : (Jensens inequality):If f(x) is convex function and X is a random variable, then Ef(X) f(EX).

    Proof: For a two mass point distribution, the inequality becomes

    p1f(x1)+p2f(x2) f(p1x1+p2x2), p1+p2=1

    which follows directly from the definition of convex functions.Suppose the theorem is true for distributions with K-1 mass points.

    Then writing Pi=Pi/(1-PK) for i = 1, 2, , K-1, we have

    The proof can be extended to continuous distributions by continuity arguments.

    (Mathematical Induction)

    k

    i

    ii

    k

    i

    iik

    k

    i

    iik

    k

    i

    iikkkk

    k

    i

    iikkk

    k

    i

    iikkk

    k

    i

    iikkki

    k

    i

    i

    xpf

    xppf

    xppf

    xppxppf

    xppxpf

    xpfpxfp

    xfppxfpxfp

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    11

    )(

    ))1((

    ))1((

    ))1()1((

    ))1((

    )()1()(

    )()1()()(

  • 8/2/2019 Information Theory Entropy Relative Entropy

    25/60

    Information Theory 25

    Theorem: (Information inequality):

    Let p(x), q(x) xX, be two probability mass functions. Then

    D(p||q) 0

    with equality iff p(x)=q(x) for all x.

    Proof: Let A={x:p(x)>0} be the support set of p(x). Then

    01log

    )(log

    )(log

    concave)istlog()(

    )()(log

    )(

    )(log

    )(

    )(log

    )(

    )(log)(

    )(

    )(log)()||(

    Xx

    Ax

    Ax

    Ax

    Ax

    xq

    xq

    xp

    xqxp

    xp

    xqE

    xp

    xqE

    xp

    xqxp

    xq

    xpxpqpD

  • 8/2/2019 Information Theory Entropy Relative Entropy

    26/60

    Information Theory 26

    Corollary: (Non-negativity of mutual information):

    For any two rvs., X, Y,

    I(X;Y) 0

    with equality iff X and Y are independent.

    Proof:

    I(X;Y) = D(p(x,y)||p(x)p(y)) 0 with equality iffp(x,y)=p(x)p(y), i.e., X and Y are independent

    Corollary:

    D(p(y|x)||q (y|x)) 0

    with equality iff p(y|x)=q(y|x) for all x and y with p(x)>0.

    Corollary:

    I(X;Y|Z) 0

    with equality iff X and Y are conditionary independent given Z.

  • 8/2/2019 Information Theory Entropy Relative Entropy

    27/60

    Information Theory 27

    Theorem:H(x)log|X|, where |X| denotes the number ofelements in the range of X, with equality iff X has auniform distribution over X.

    Proof:Let u(x)=1/|X| be the uniform probability mass function

    over X, and let p(x) be the probability mass function forX. Then

    )(log)||(0

    entropyrelativeofnegativity-nonby theHence

    )(log

    )(

    )(log)()||(

    xHupD

    xH

    xu

    xpxpupD

    X

    X

  • 8/2/2019 Information Theory Entropy Relative Entropy

    28/60

    Information Theory 28

    Theorem: (conditioning reduces entropy):

    H(X|Y) H(X)with equality iff X and Y are independent.

    Proof: 0 I(X;Y)=H(X)H(X|Y)

    Note that this is true only on the average; specifically,

    H(X|Y=y) may be greater than or less than or equal to

    H(X), but on the average H(X|Y)=yp(y)H(X|Y=y) H(X).

  • 8/2/2019 Information Theory Entropy Relative Entropy

    29/60

    Information Theory 29

    Ex: Let (X,Y) have the following joint

    distribution

    Then, H(X)=H(1/8, 7/8)=0.544 bits

    H(X|Y=1)=0 bits

    H(X|Y=2)=1 bits > H(X)However, H(X|Y) = 3/4 H(X|Y=1)+1/4 H(X|Y=2)

    = 0.25 bits < H(X)

    XY

    1 2

    1 0 3/4

    2 1/8 1/8

  • 8/2/2019 Information Theory Entropy Relative Entropy

    30/60

    Information Theory 30

    Theorem: (Independence bound on entropy):

    Let X1, X2, ,Xn be drawn according to p(x1, x2, ,xn).Then

    with equality iff the Xi are independent.

    Proof: By the chain rule for entropies,

    with equality iff the Xis are independent.

    n

    i

    in XHXXXH1

    21 ,,,

    n

    i

    i

    n

    i

    iin

    XH

    XXXHXXXH

    1

    1

    1121 ,,|,,,

    The LOG SUM INEQUALITY AND ITS

  • 8/2/2019 Information Theory Entropy Relative Entropy

    31/60

    Information Theory 31

    The LOG SUM INEQUALITY AND ITS

    APPLICATIONS

    Theorem: (Log sum inequality)

    For non-negative numbers, a1, a2, , an and b1, b2.. bn

    with equality iff ai/bi = constant.

    n

    i i

    n

    i

    in

    i

    i

    n

    i i

    ii

    b

    a

    ab

    aa

    1

    1

    11

    loglog

    00log

    0aifalog0,0log0:sconventionsome

    00

    0a

  • 8/2/2019 Information Theory Entropy Relative Entropy

    32/60

    Information Theory 32

    Proof:Assume w.l.o.g that ai>0 and bi>0. The functionf(t)=tlogt is strictly convex, sincefor all positive t. Hence by Jensens inequality, we

    have

    which is the log sum inequality.

    0log1

    " et

    tf

    i

    i

    i

    i

    ii

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    ii

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    iin

    ii

    ii

    i

    ii

    iiii

    b

    aa

    b

    aa

    b

    b

    a

    b

    a

    b

    a

    b

    a

    b

    b

    b

    a

    b

    b

    b

    a

    b

    b

    b

    a

    b

    a

    b

    b

    b

    at

    b

    b

    tftf

    loglog

    )0that(noteloglog

    loglogobtainwe

    ,andSetting.1,0for

    1

  • 8/2/2019 Information Theory Entropy Relative Entropy

    33/60

    Information Theory 33

    Reproving the theorem that D(p||q) 0, with equality iffp(x)=q(x)

    with equality iff p(x)/q(x)=c. Since both p and q are

    probability mass functions, c=1 p(x)=q(x), x.

    0

    1

    1log1

    )inequalitysum-logfrom(log

    log||

    xq

    xpxp

    xq

    xpxpqpD

  • 8/2/2019 Information Theory Entropy Relative Entropy

    34/60

    Information Theory 34

    Theorem:D(p||q) is convex in the pair (p,q), i.e., if (p1, q1) and(p2, q2) are two pairs of probability mass functions,then

    Proof:

    10

    )||()1()||())1(||)1(( 22112121

    allfor

    qpDqpDqqppD

    )||()1()||(

    log)1(log

    )1(

    )1(log)1(loglog

    log)1(

    )1(,

    )1(,

    )1()1(

    )1(log)1(

    ))1(||)1((

    2211

    2

    22

    1

    11

    2

    22

    1

    11

    2

    1

    2

    1

    2

    12

    1

    2211

    2211

    21

    2121

    2121

    qpDqpDq

    pp

    q

    pp

    q

    pp

    q

    pp

    b

    aa

    b

    a

    athen

    qbqb

    papaLet

    qq

    pppp

    qqppD

    i i

    ii

    i

    i

    ii

    i

    i

    log-sum

  • 8/2/2019 Information Theory Entropy Relative Entropy

    35/60

    Information Theory 35

    Theorem: (concavity of entropy):

    H(p) is a concave function of P.That is: H(1p1+(1-)p2)H(p1)+(1-)H(p2)

    Proof:H(p)=log|X|D(p||u)

    where u is the uniform distribution on |X|outcomes. The concavity of H then follows

    directly from the convexity of D.

  • 8/2/2019 Information Theory Entropy Relative Entropy

    36/60

    Information Theory 36

    Theorem: Let (X,Y)~p(x,y) = p(x)p(y|x).

    The mutual information I(X;Y) is(i) a concave function of p(x) for fixed p(y|x)

    (ii) a convex function of p(y|x) for fixed p(x).

    Proof:

    (1) I(X;Y)=H(Y)-H(Y|X)=H(Y)xp(x)H(Y|X=x) ()if p(y|x) is fixed, then p(y) is a linear function ofp(x). ( p(y) = xp(x,y) = xp(x)p(y|x) )Hence H(Y), which is a concave function of

    p(y), is a concave function of p(x). The secondterm of () is a linear function of p(x). Hence thedifference is a concave function of p(x).

  • 8/2/2019 Information Theory Entropy Relative Entropy

    37/60

    Information Theory 37

    (2) We fix p(x) and consider two different conditional distributionsp1(y|x) and p2(y|x). The corresponding joint distributions arep1(x,y)=p(x) p1(y|x) and p2(x,y)=p(x) p2(y|x), and their respective

    marginals are p(x), p1(y) and p(x), p2(y).Consider a conditional distribution

    p(y|x)= p1(y|x)+(1-)p2(y|x)

    that is a mixture of p1(y|x) and p2(y|x). The corresponding jointdistribution is also a mixture of the corresponding joint

    distributions,p(x,y) = p1 (x,y)+(1-)p2(x,y)

    and the distribution of Y is also a mixture p(y)= p1 (y)+(1-)p2(y).Hence if we let q(x,y)=p(x)p(y) q(x,y)= q1 (x,y)+(1-)q2(x,y).

    I(X;Y) = D(p||q) convex of (p,q)

    the mutual information is a convex function of the conditionaldistribution. Therefore, the convexity of I(X;Y) is the same as thatof the D(p||q) w.r.t. pi(y|x) when p(x) is fixed.

    The product of the marginal distributions

    when p(x) is fixed,

    p(x,y) is linear with pi(y|x)

    q(x,y) is also linear with pi(y|x)

    when p(x) is fixed.

  • 8/2/2019 Information Theory Entropy Relative Entropy

    38/60

    Information Theory 38

    Data processing inequality:

    No clever manipulation of the data can improve theinferences that can be made from the data

    Definition:Rvs. X,Y,Z are said to form a Markov chain in thatorder (denoted by XYZ) if the conditionaldistribution of Z depends only on Y and is conditionally

    independent of X. That is XYZ form a Markovchain, then

    (i) p(x,y,z)=p(x)p(y|x)p(z|y)

    (ii) p(x,z|y)=p(x|y)p(z|y) :X and Z are conditionally

    independent given Y XYZ implies that ZYX

    If Z=f(Y), then XYZ

  • 8/2/2019 Information Theory Entropy Relative Entropy

    39/60

    Information Theory 39

    Theorem: (Data processing inequality)

    if XYZ , then I(X;Y) I(X;Z)No processing of Y, deterministic or random, canincrease the information that Y contains about X.

    Proof:I(X;Y,Z) = I(X;Z) + I(X;Y|Z) : chain rule

    = I(X;Y) + I(X;Z|Y) : chain rule

    Since X and Z are independent given Y, we have

    I(X;Z|Y)=0. Since I(X;Y|Z)0, we have I(X;Y)I(X;Z)

    with equality iff I(X;Y|Z)=0, i.e., XZY forms aMarkov chain. Similarly, one can prove I(Y;Z)I(X;Z).

  • 8/2/2019 Information Theory Entropy Relative Entropy

    40/60

    Information Theory 40

    Corollary:If XYZ forms a Markov chain and if Z=g(Y), wehave I(X;Y)I(X;g(Y))

    : functions of the data Y cannot increase theinformation about X.

    Corollary: If XYZ, then I(X;Y|Z)I(X;Y)Proof: I(X;Y,Z)=I(X;Z)+I(X;Y|Z)

    =I(X;Y)+I(X;Z|Y)

    By Markovity, I(X;Z|Y)=0

    and I(X;Z) 0 I(X;Y|Z)I(X;Y) The dependence of X and Y is decreased (or remains

    unchanged) by the observation of a downstream r.v. Z.

  • 8/2/2019 Information Theory Entropy Relative Entropy

    41/60

    Information Theory 41

    Note that it is possible that I(X;Y|Z)>I(X;Y)

    when X,Y and Z do not form a Markov chain.

    Ex: Let X and Y be independent fair binary rvs, andlet Z=X+Y. Then I(X;Y)=0, but

    I(X;Y|Z) =H(X|Z)H(X|Y,Z)=H(X|Z)

    =P(Z=1)H(X|Z=1)=1/2 bit.

    F i li

  • 8/2/2019 Information Theory Entropy Relative Entropy

    42/60

    Information Theory 42

    Fanos inequality:

    Fanos inequality relates the probability of error in

    guessing the r.v. X to its conditional entropy H(X|Y).

    Note that:

    The conditional entropy of a r.v. X given anotherrandom variable Y is zero iff X is a function of Y.

    proof: HW we can estimate X from Y with zero probability oferror iff H(X|Y)=0.

    we expect to be able to estimate X with a low

    probability of error only if the conditional entropyH(X|Y) is small.

    Fanos inequality quantifies this idea.

    H(X|Y)=0 implies there is no uncertainty about X if we know Y

    for all x with p(x)>0, there is only one possible value of y with p(x,y)>0

  • 8/2/2019 Information Theory Entropy Relative Entropy

    43/60

    Information Theory 43

    Suppose we wish to estimate a r.v. X with a

    distribution p(x). We observe a r.v. Y which isrelated to X by the conditional distributionp(y|x). From Y, we calculate a function

    which is an estimate of X. We wish to bound

    the probability that . We observe thatforms a Markov chain.

    Define the probability of error

    XYg

    XX

    XYX

    XYgPXXPP rre

    )(

  • 8/2/2019 Information Theory Entropy Relative Entropy

    44/60

    Information Theory 44

    Theorem: (Fanos inequality)

    For any estimator X such that XYX with Pe=Pr(XX),we have

    H(Pe) + Pelog(|X|-1) H(X|Y)

    This inequality can be weakened to1 + Pelog(|X|) H(X|Y)

    or

    Remark: Pe = 0 H(X|Y) = 0

    H(Pe)1, E: binary r.v.log(|X|-1) log|X|

    ||log

    1|

    X

    YXHPe

    ^ ^^

  • 8/2/2019 Information Theory Entropy Relative Entropy

    45/60

    Information Theory 45

    Proof: Define an error rv.

    By the chain rule for entropyies, we have

    H(E,X|X) =H(X|X) + H(E|X,X)

    =H(E|X) + H(X|E,X)

    Since conditioning reduces entropy, H(E|X) H(E)= H(Pe). Nowsince E is a function of X and X H(E|X,X)=0. Since E is abinary-valued r.v., H(E)= H(Pe).

    The remaining term, H(X|E,X), can be bounded as follows:H(X|E,X) = Pr(E=0)H(X|X,E=0)+Pr(E=1)H(X|X,E=1)

    (1- Pe)0 + Pelog(|X|-1),

    X

    X

    EXif,0

    Xif,1

    =0

    H(Pe) Pelog(|X|-1)

    ^ ^ ^

    ^ ^

    ^^ ^

    ^^ ^ ^

  • 8/2/2019 Information Theory Entropy Relative Entropy

    46/60

    Information Theory 46

    Since given E=0, X=X, and given E=1, we can upper bound theconditional entropy by the log of the number of remaining

    outcomes (|X

    |-1).

    H(Pe)+Pelog|X|H(X|X). By the data processing inequality, wehave I(X;X)I(X;Y) since XYX, and therefore H(X|X) H(X|Y).Thus we have H(Pe)+Pelog|X| H(X|X) H(X|Y).

    Remark:Suppose there is no knowledge of Y. Thus X must be guessedwithout any information. Let X{1,2,,m} and P1P2 Pm.Then the best guess of X is X=1 and the resulting probability oferror is Pe=1 - P1.

    Fanos inequality becomesH(Pe) + Pelog(m-1) H(X)

    The probability mass function

    (P1, P2,, Pm) = (1-Pe, Pe/(m-1),, Pe/(m-1) )

    achieves this bound with equality.

    ^

    ^^ ^ ^

    ^

    ^

    S P ti f th R l ti E t

  • 8/2/2019 Information Theory Entropy Relative Entropy

    47/60

    Information Theory 47

    Some Properties of the Relative Entropy

    1. Let n and n be two probability distributions on thestate space of a Markov chain at time n, and let n+1

    and n+1 be the corresponding distributions at timen+1. Let the corresponding joint mass function bedenoted by p and q.

    That is,

    p(xn, xn+1) = p(xn) r(xn+1| xn)

    q(xn, xn+1) = q(xn) r(xn+1| xn)

    where

    r( | ) is the probability transition function for the

    Markov chain.

  • 8/2/2019 Information Theory Entropy Relative Entropy

    48/60

    Information Theory 48

    Then by the chain rule for relative entropy, we have

    the following two expansions:

    D(p(xn, xn+1)||q(xn, xn+1))

    = D(p(xn)||q(xn)) + D(p(xn+1|xn)||q(xn+1|xn))

    = D(p(xn+1)||q(xn+1)) + D(p(xn|xn+1)||q(xn|xn+1))

    Since both p and q are derived from the same Markovchain, so

    p(xn+1|xn) = q(xn+1|xn) = r(xn+1|xn),

    and hence

    D(p(xn+1|xn)) || q(xn+1|xn)) = 0

  • 8/2/2019 Information Theory Entropy Relative Entropy

    49/60

    Information Theory 49

    That is,

    D(p(xn) || q(x

    n))

    = D(p(xn+1) || q(xn+1)) + D(p(xn|xn+1) || q(xn|xn +1))

    Since D(p(xn|xn+1) || q(xn|xn +1)) 0

    D(p(xn) || q(xn)) D(p(xn+1) || q(xn+1))

    or D(n|| n) D (n+1|| n+1)

    Conclusion:

    The distance between the probability massfunctions is decreasing with time n for anyMarkov chain.

  • 8/2/2019 Information Theory Entropy Relative Entropy

    50/60

    Information Theory 50

    2. Relative entropy D(n|| ) between a distribution non the states at time n and a stationary distribution

    decreases with n.In the last equation, if we let n be any stationarydistribution , then n+1 is the same stationary

    distribution. Hence

    D(n|| ) D (n+1|| )

    Any state distribution gets closer and closer to eachstationary distribution as time passes. 0||lim

    nn

    D

  • 8/2/2019 Information Theory Entropy Relative Entropy

    51/60

    Information Theory 51

    3. Def:A probability transition matrix [Pij],

    Pij

    = Pr{x

    n+1=j|x

    n=i} is called doubly stochastic if

    iPij=1, i=1,2,, j=1,2,

    and

    jPij=1, i=1,2,, j=1,2,

    The uniform distribution is a stationary distribution ofP iff the probability transition matrix is doublystochastic.

  • 8/2/2019 Information Theory Entropy Relative Entropy

    52/60

    Information Theory 52

    4. The conditional entropy H(Xn|X1) increase with n for astationary Markov process.

    If the Markov process is stationary, then H(Xn) isconstant. So the entropy is non-increasing. However,it can be proved that H(Xn|X1) increases with n. Thisimplies that:

    the conditional uncertainty of the future increases.Proof:

    H(Xn|X1) H(Xn|X1, X2) (conditioning reduces entropy)

    = H(Xn|X2) (by Markovity)

    = H(Xn-1|X1) (by stationarity)

    Similarly: H(X0|Xn) is increasing in n for any Markov chain.

    Sufficient Statistics

  • 8/2/2019 Information Theory Entropy Relative Entropy

    53/60

    Information Theory 53

    Sufficient Statistics

    Suppose we have a family of probability mass function{f(x)} indexed by , and let X be a sample from a

    distribution in this family. Let T(X) be any statistic(function of the sample) like the sample mean or samplevariance. Then

    XT(X),

    And by the data processing inequality, we have

    I(;T(X)) I(;X)

    for any distribution on . However, if equality holds, noinformation is lost.

    A statistic T(X) is called sufficient for if itcontains all the information in X about .

  • 8/2/2019 Information Theory Entropy Relative Entropy

    54/60

    Information Theory 54

    Def:

    A function T(X) is said to be a sufficientstatistic relative to the family {f(x)} if X isindependent of give T(X), i.e., T(X)Xforms a Markov chain.

    or:I(;X) = I(; T(X))

    for all distributions on

    Sufficient statistics preserve mutual information.

    Some examples of Sufficient Statistics

  • 8/2/2019 Information Theory Entropy Relative Entropy

    55/60

    Information Theory 55

    Some examples of Sufficient Statistics

    1. Let be an i.i.d.sequence of coin tosses of a coin with unknown

    parameter .

    Given n, the number of 1s is a sufficient statistics

    for .

    Here

    Given T, all sequences having that many 1s are

    equally likely and independent of the parameter .

    {0,1} in21 , X, X,, XX

    1)Pr(X i

    .)(1

    n

    i

    in21 X, X,, XXT

  • 8/2/2019 Information Theory Entropy Relative Entropy

    56/60

    Information Theory 56

    .

    ),...,,(,

    ,0

    ,1

    ),...,,(),...,,(Pr

    21

    12121

    forstatisticssufficientaisTand

    XXXXThus

    otherwise

    kxif

    k

    n

    kxxxxXXX

    ni

    i

    n

    i

    inn

  • 8/2/2019 Information Theory Entropy Relative Entropy

    57/60

    Information Theory 57

    2. If Xis normally distributed with mean and variance 1; that is,

    if

    and are drawn

    independently according to ,

    a sufficient statistic for is the sample mean

    This can be verified that

    is independent of .

    )1,(2

    12

    )( 2

    Nef

    x

    n21 , X,, XX

    f

    ),|( nX, X,, XXP nn21

    .1

    1

    n

    i

    in Xn

    X

  • 8/2/2019 Information Theory Entropy Relative Entropy

    58/60

    Information Theory 58

    The minimal sufficient statistics is a sufficient statisticsthat is a function of all other sufficient statistics.

    Def:

    A static T(X) is a minimal sufficient statistic related toif it is a function of every other sufficient

    statistic

    Hence, a minimal sufficient statistic maximallycompresses the information about in the sample.Other sufficient statistics may contain additional

    irrelevant information.

    The sufficient statistics of the above examples areminimal.

    )(XfXXUXTU )()(:

  • 8/2/2019 Information Theory Entropy Relative Entropy

    59/60

    Information Theory 59

    Shuffles increase Entropy:

    If T is a shuffle (permutation) of a deck of cards and

    X is the initial (random) position of the cards in thedeck and if the choice of the shuffle T is independentof X, then

    H(TX) H(X)

    where TX is the permutation of the deck induced bythe shuffle T on the initial permutation X.

    Proof: H(TX) H(TX|T)

    = H(T-1TX|T) (why?)

    = H(X|T)

    = H(X)

    if X and T are independent!

  • 8/2/2019 Information Theory Entropy Relative Entropy

    60/60

    If X and X are i.i.d. with entropy H(X), then Pr(X=X) 2-H(X)

    with equality iff X has a uniform distribution.

    pf: suppose X~p(x). By Jensens inequality, we have

    2Elogp(x) E2logp(x)

    which implies that 2-H(X)=2p(x)logp(x)p(x)2logp(x)=p2(x)=Pr(X=X)

    ( Let X and X be two i.i.d. rvs with entropy H(X). The prob.at X=X is given by Pr(X=X)= p

    2(x) )

    Let X, X be independent with X~p(x), X~r(x), x, x

    Then Pr(X=X) 2

    -H(p)-D(p||r)

    Pr(X=X) 2-H(r)-D(r||p)

    pf: 2-H(p)-D(p||r)= 2p(x)logp(x)+p(x)logr(x)/p(x)=2p(x)logr(x) p(x)2logr(x) = p(x)r(x) =Pr(X=X)

    I f i Th 60

    x

    *Notice that, the function f(y)=2y is convex