Information Theory: Principles and Applications

. . . . . .

.

.

. ..

.

.

Information Theory: Principles and Applications

Tiago T. V. Vinhoza

March 19, 2010

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 1 / 50

. . . . . .

.. .1 Course Information

.. .2 What is Information Theory?

.. .3 Review of Probability Theory

.. .4 Information Measures


. . . . . .

Course Information


Prof. Tiago T. V. VinhozaOffice: FEUP Building I, Room I322Office hours: Wednesdays from 14h30-15h30.Email: [email protected]

Prof. José VieiraProf. Paulo Jorge Ferreira


. . . . . .

Course Information


http://paginas.fe.up.pt/∼vinhoza (link for Info Theory)HomeworksOther notes

My Evaluation: (Almost) Weekly Homeworks + Final ExamReferences:

Elements of Information Theory, Cover and Thomas, WileyInformation Theory and Reliable Communication, GallagerInformation Theory, Inference, and Learning Algorithms, McKay(available online)


. . . . . .

What is Information Theory?


IT is a branch of math (a strictly deductive system). (C. Shannon,The bandwagon)General statistical concept of communication. (N. Wiener, What isIT?)It was build upon the work of Shannon (1948)It answers to two fundamental questions in Communications Theory:

What is the fundamental limit for information compression?What is the fundamental limit on information transmission rate over acommunications channel?


. . . . . .



Mathematics: InequalitiesComputer Science: Kolmogorov ComplexityStatistics: Hypothesis TestingsProbability Theory: Limit TheoremsEngineering: CommunicationsPhysics: ThermodynamicsEconomics: Portfolio Theory


. . . . . .


Communications Systems

The fundamental problem of communication is that of reproducing atone point either exactly or approximately a message selected atanother point. (Claude Shannon: A Mathematical Theory ofCommunications, 1948)


. . . . . .


Digital Communications Systems

SourceSource Coder: Convert an analog or digital source into bits.Channel Coder: Protection against errors/erasures in the channel.Modulator: Each binary sequence is assigned to a waveformChannel: Physical Medium to send information from transmitter toreceiver. Source of randomness.Demodulator, Channel Decoder, Source Decoder, Sink.


. . . . . .


Digital Communications Systems

Modulator + Channel = Discrete Channel.Binary Symmetric Channel.Binary Erasure Channel.


. . . . . .

Review of Probability Theory


Axiomatic ApproachRelative Frequency Approach


. . . . . .


Axiomatic Approach

Application of a mathematical theory called Measure Theory.It is based on a triplet

(Ω,F , P)

whereΩ is the sample space, which is the set of all possible outcomes.F is the σ−algebra, which is the set of all possible events (orcombinations of outcomes).P is the probability function, which can be any set function, whosedomain is Ω and the range is the closed unit interval [0,1]. It mustobey the following rules:

P(Ω) = 1Let A be any event in F , then P(A) ≥ 0.Let A and B be two events in F such that A ∩ B = ∅, thenP(A ∪ B) = P(A) + P(B).


. . . . . .


Axiomatic Approach: Other properties

Probability of complement: P(A) = 1 − P(A).P(A) ≤ 1.P(∅) = 0.P(A ∪ B) = P(A) + P(B) − P(A ∩ B).


. . . . . .


Conditional Probability

Let A and B be two events, with P(A) > 0. The conditionalprobability of B given A is defined as:

P(B|A) =P(A ∩ B)

P(A)

Hence, P(A ∩ B) = P(B|A)P(A) = P(A|B)P(B)

If A ∩ B = ∅ then P(B|A) = 0.If A ⊂ B, then P(B|A) = 1.


. . . . . .


Bayes Rule

If A and B are events

P(A|B) =P(B|A)P(A)

P(B)


. . . . . .


Total Probabilty Theorem

A set of Bi , i = 1, . . . , n of events is a partition of Ω when:∪ni=1 Bi = Ω.

Bi ∩ Bj = ∅, if i = j .

Theorem: If A is an event and Bi , i = 1, . . . , n of is a partition of Ω,then:

P(A) =n∑

i=1

P(A ∩ Bi ) =n∑

i=1

P(A|Bi )P(Bi )


. . . . . .


Independence between Events

Two events A and B are statistically independent when

P(A ∩ B) = P(A)P(B)

Supposing that both P(A) and P(B) are greater than zero, from theabove definition we have that:

P(A|B) = P(A) P(B|A) = P(B)

Independent events and mutually exclusive events are different!


. . . . . .


Independence between events

N events are statistically independent if the intersection of the eventscontained in any subset of those N events have probability equal tothe product of the individual probabilitiesExample: Three events A, B and C are independent if:

P(A∩B) = P(A)P(B), P(A∩C ) = P(A)P(C ), P(B∩C ) = P(B)P(C )

P(A ∩ B ∩ C ) = P(A)P(B)P(C )


. . . . . .


Random Variables

A random variable (rv) is a function that maps each ω ∈ Ω to a realnumber.

X : Ω → Rω → X (ω)

Through a random variable, subsets of Ω are mapped as subsets(intervals) of the real numbers.

P(X ∈ I ) = P(w |X (ω) ∈ I)


. . . . . .


Random Variables

A real random variable is a function whose domain is Ω and such thatfor all real number x , the set Ax = ω|X (ω) ≤ x is an event.P(w |X (w) = ±∞) = 0.


. . . . . .


Cumulative Distribution Function

FX : R → [0, 1]X → FX (x) = P(X ≤ x) = P(ω|X (ω) ≤ x)

FX (∞) = 1FX (−∞) = 0If x1 < x2, FX (x2) ≥ FX (x1).FX (x+) = limϵ→0 FX (x + ϵ) = FX (x). (continuous on the right side).FX (x) − FX (x−) = P(X = x).


. . . . . .


Types of Random Variables

Discrete: Cumulative function is a step function (sum of unit stepfunctions)

FX (x) =∑

i

P(X = xi )u(x − xi )

where u(x) is the unit step funtion.Example: X is the random variable that describes the outcome of theroll of a die. X ∈ 1, 2, 3, 4, 5, 6


. . . . . .


Types of Random Variable

Continous: Cumulative function is a continous function.Mixed: Neither discrete nor continous.


. . . . . .


Probability Density Function

It is the derivative of the cumulative distribution function:

pX (x) =ddx

FX (x)∫ x−∞ pX (x)dx = FX (x).

pX (x) ≥ 0.∫∞−∞ pX (x)dx = 1.∫ ba pX (x)dx = FX (b) − FX (a) = P(a ≤ X ≤ b).

P(X ∈ I ) =∫I pX (x)dx , I ⊂ R.


. . . . . .


Discrete Random Variables

Let us now focus only on discrete random variables.Let X be a random variable with sample space XThe probability mass function (probability distribution function) of Xis a mapping pX (x) : X → [0, 1] satisfying:∑

X∈XpX (x) = 1

The number pX (x) := P(X = x)


. . . . . .


Discrete Random Vectors

Let Z = [X , Y ] be a random vector with sample space Z = X × YThe joint probability mass function (probability distribution function)of Z is a mapping pZ (z) : Z → [0, 1] satisfying:∑

Z∈ZpZ (z) =

∑x ,y×Y

pXY (x , y) = 1

The number pZ (z) := pXY (x , y) = P(Z = z) = P(X = x , Y = y).


. . . . . .



Marginal Distributions

pX (x) =∑y∈Y

pXY (x , y)

pY (y) =∑x∈X

pXY (x , y)


. . . . . .



Conditional Distributions

pX |Y=y (x) =pXY (x , y)

pY (y)

pY |X=x(y) =pXY (x , y)

pX (x)


. . . . . .



Random variables X and Y are independent if and only if

pXY (x , y) = pX (x)pY (y)

Consequences:pX |Y=y (x) = pX (x)

pY |X=x(y) = pY (y)


. . . . . .


Moments of a Discrete Random Variable

The n−th order moment of a discrete random variable X is defined as:

E [X n] =∑x∈X

xnpX (x)

if n = 1, we have the mean of X , mX = E [X ].The m−th order central moment of a discrete random variable X isdefined as:

E [(X − mX )m] =∑x∈X

(x − mX )mpX (x)

if m = 2, we have the variance of X , σ2X .


. . . . . .


Moments of a Discrete Random Vector

The joint moment n−th order with relation to X and k−th order withrelation to Y :

mnk = E [X nY k ] =∑x∈X

∑y∈Y

xnykpXY (x , y)

The joint central n−th order with relation to X and k−th order withrelation to Y :

µnk = E [(X−mX )n(Y−mY )k ] ==∑x∈X

∑y∈Y

(x−mX )n(y−mY )kpXY (x , y)


. . . . . .


Correlation and Covariance

The correlation of two random variables X and Y is the expected valueof their product (joint moment of order 1 in X and order 1 in Y ):

Corr(X , Y ) = m11 = E [XY ]

The covariance of two random variables X and Y is the joint centralmoment of order 1 in X and order 1 in Y:

Cov(X , Y ) = µ11 = E [(X − mX )(Y − mY )]

Cov(X , Y ) = Corr(X , Y ) − mXmY

Correlation Coefficient:

ρXY =Cov(X , Y )

σXσY→ −1 ≤ ρXY ≤ 1


. . . . . .

Information Measures

What is Information?

It is a measure that quantifies the uncertainty of an event with givenprobability - Shannon 1948.For a discrete source with finite alphabet X = x0, x1, . . . , xM−1where the probability of each symbol is given by P(X = xk) = pk

I (xk) = log1pk

= − log(pk)

If logarithm is base 2, information is given in bits.


. . . . . .


What is Information?

It represents the surprise of seeing the outcome (a highly probableoutcome is not surprising).

event probability surpriseone equals one 1 0 bits

wrong guess on a 4-choice question 3/4 0.415 bitscorrect guess on true-false question 1/2 1 bitcorrect guess on a 4-choice question 1/4 2 bits

seven on a pair of dice 6/36 2.58 bitswin any prize at Euromilhões 1/24 4.585 bits

win Euromilhões Jackpot ≈ 1/76 million ≈ 26 bitsgamma ray burst mass extinction today < 2.7 · 10−12 > 38 bits


. . . . . .


Entropy

Expected value of information from a source.

H(X ) = E [I (xk)] =∑x∈X

px(x)I (xk)

= −∑x∈X

px(x) log px(x)


. . . . . .


Entropy of binary source

Let X be a binary source with p0 and p1 being the probabililty ofsymbols x0 and x1 respectively.

H(X ) = −p0 log p0 − p1 log p1

= −p0 log p0 − (1 − p0) log(1 − p0)


. . . . . .


Entropy of binary source

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p0

H(X

)


. . . . . .


Joint Entropy

The joint entropy of a pair of random variables X and Y is given by:.

H(X , Y ) = −∑y∈Y

∑x∈X

pXY (x , y) log pX ,Y (x)


. . . . . .


Conditional Entropy

Average amount of information of a random variable given theoccurence of other.

H(X |Y ) =∑y∈Y

pY (y)H(X |Y = y)

= −∑y∈Y

pY (y)∑x∈X

pX |Y=y (x) log px |Y=y (x)

= −∑y∈Y

∑x∈X

pXY (x , y) log pX |Y=y (x)


. . . . . .


Chain Rule of Entropy

The entropy of a pair of random variables is equal to the entropy ofone of them plus the conditional entropy.

H(X ,Y ) = H(X ) + H(Y |X )

Corollary

H(X , Y |Z ) = H(X |Z ) + H(Y |X , Z )


. . . . . .


Chain Rule of Entropy: Generalization

H(X1,X2, . . . , XM) =M∑

j=1

H(Xj |X1, . . . , Xj−1)


. . . . . .


Relative Entropy: Kullback-Leibler Distance

Is a measure of the distance between two distributions.The relative entropy between two probability density functions pX (x)and qX (x) is defined as:

D(pX (x)||qX (x)) =∑x∈X

pX (x) logpX (x)

qX (x)


. . . . . .


Relative Entropy: Kullback-Leibler Distance

D(pX (x)||qX (x)) ≥ 0 with equality if and only if pX (x) = qX (x).D(pX (x)||qX (x)) = D(qX (x)||pX (x))


. . . . . .


Mutual Information

The mutual information of two random variables X and Y is definedas the relative entropy between the joint probability density pXY (x , y)and the product of the marginals pX (x) and pY (y)

I (X ; Y ) = D(pXY (x , y)||pX (x)pY (y))

=∑x∈X

∑y∈Y

pXY (x , y) logpX ,Y (x , y)

pX (x)pY (y)


. . . . . .


Mutual Information: Relations with Entropy

Reducing uncertainty of X due to the knowledge of Y :

I (X ; Y ) = H(X ) − H(X |Y )

Symmetry of the relation above: I (X ; Y ) = H(Y ) − H(Y |X )

Sum of entropies:

I (X ; Y ) = H(X ) + H(Y ) − H(X , Y )

“Self” Mutual Information:

I (X ; X ) = H(X ) − H(X |X ) = H(X )


. . . . . .


Mutual Information: Other Relations

Conditional Mutual Information:

I (X ;Y |Z ) = H(X |Z ) − H(X |Y , Z )

Chain Rule for Mutual Information

I (X1, X2, . . . , XM ; Y ) =M∑

j=1

I (Xj ; Y |X1, . . . , Xj−1)


. . . . . .


Convex and Concave Functions

A function f (·) is convex over ain interval (a, b) if for everyx1, x2 ∈ [a, b] and 0 ≤ λ ≤ 1, if :

f (λx1 + (1 − λ)x2) ≤ λf (x1) + (1 − λ)f (x2)

A function f (·) is convex over an interval (a, b) if its second derivativeis non-negative over that interval (a, b).A function f (·) is concave if −f (·) is convex.Examples of convex functions: x2, |x |, ex , x log x , x ≥ 0.Examples of concave functions: log x and

√x , for x ≥ 0.


. . . . . .


Jensen’s Inequality

If f (·) is a convex function and X is a random variable

E [f (X )] ≥ f (E [X ])

Used to show that relative entropy and mutual information are greaterthan zero.Used also to show that H(X ) ≤ log |X |.


. . . . . .


Log-Sum Inequality

For n positive numbers a1, a2, . . . , an and b1, b2, . . . bn

n∑i=1

ai logai

bi≥

(n∑

i=1

ai

)log∑n

i=1 ai∑ni=1 bi

with equality if and only if ai/bi = c .This inequality is used to prove the convexity of the relative entropyand the concavity of the entropy.Convexity/Concavity of mutual information


. . . . . .


Data Processing Inequality

Random variables X , Y , Z are said to form a Markov chain in thatorder X → Y → Z , if the conditional distribution of Z depends onlyon Y and is onditionally independent of X .

pXYZ (x , y , z) = pX (x)pY |X=x(y)pZ |Y=y (y)

If X → Y → Z , then

I (X ; Y ) ≥ I (X ;Z )

Let Z = g(Y ), X → Y → g(Y ), then I (X ; Y ) ≥ I (X ; g(Y ))


. . . . . .


Fano’s Inequality

Suppose we know a random variable Y and we wish to guess the valueof a correlated random variable X .Fano’s inequality relates the probability of error in guessing X from Yto its conditional entropy H(X |Y ).Let X = g(Y ), if Pe = P(X = X ), then

H(Pe) + Pe log(|X | − 1) ≥ H(X |Y )

where H(Pe) is the binary entropy function evaluated at Pe .


Information Theory: Principles and Applications

Documents