Top Banner
. . . . . . . . . Information Theory: Principles and Applications Tiago T. V. Vinhoza March 19, 2010 Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 1 / 50
50

Information Theory: Principles and Applications

May 06, 2023

Download

Documents

Khang Minh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information Theory: Principles and Applications

. . . . . .

.

.

. ..

.

.

Information Theory: Principles and Applications

Tiago T. V. Vinhoza

March 19, 2010

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 1 / 50

Page 2: Information Theory: Principles and Applications

. . . . . .

.. .1 Course Information

.. .2 What is Information Theory?

.. .3 Review of Probability Theory

.. .4 Information Measures

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 2 / 50

Page 3: Information Theory: Principles and Applications

. . . . . .

Course Information

Information Theory: Principles and Applications

Prof. Tiago T. V. VinhozaOffice: FEUP Building I, Room I322Office hours: Wednesdays from 14h30-15h30.Email: [email protected]

Prof. José VieiraProf. Paulo Jorge Ferreira

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 3 / 50

Page 4: Information Theory: Principles and Applications

. . . . . .

Course Information

Information Theory: Principles and Applications

http://paginas.fe.up.pt/∼vinhoza (link for Info Theory)HomeworksOther notes

My Evaluation: (Almost) Weekly Homeworks + Final ExamReferences:

Elements of Information Theory, Cover and Thomas, WileyInformation Theory and Reliable Communication, GallagerInformation Theory, Inference, and Learning Algorithms, McKay(available online)

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 4 / 50

Page 5: Information Theory: Principles and Applications

. . . . . .

What is Information Theory?

What is Information Theory?

IT is a branch of math (a strictly deductive system). (C. Shannon,The bandwagon)General statistical concept of communication. (N. Wiener, What isIT?)It was build upon the work of Shannon (1948)It answers to two fundamental questions in Communications Theory:

What is the fundamental limit for information compression?What is the fundamental limit on information transmission rate over acommunications channel?

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 5 / 50

Page 6: Information Theory: Principles and Applications

. . . . . .

What is Information Theory?

What is Information Theory?

Mathematics: InequalitiesComputer Science: Kolmogorov ComplexityStatistics: Hypothesis TestingsProbability Theory: Limit TheoremsEngineering: CommunicationsPhysics: ThermodynamicsEconomics: Portfolio Theory

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 6 / 50

Page 7: Information Theory: Principles and Applications

. . . . . .

What is Information Theory?

Communications Systems

The fundamental problem of communication is that of reproducing atone point either exactly or approximately a message selected atanother point. (Claude Shannon: A Mathematical Theory ofCommunications, 1948)

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 7 / 50

Page 8: Information Theory: Principles and Applications

. . . . . .

What is Information Theory?

Digital Communications Systems

SourceSource Coder: Convert an analog or digital source into bits.Channel Coder: Protection against errors/erasures in the channel.Modulator: Each binary sequence is assigned to a waveformChannel: Physical Medium to send information from transmitter toreceiver. Source of randomness.Demodulator, Channel Decoder, Source Decoder, Sink.

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 8 / 50

Page 9: Information Theory: Principles and Applications

. . . . . .

What is Information Theory?

Digital Communications Systems

Modulator + Channel = Discrete Channel.Binary Symmetric Channel.Binary Erasure Channel.

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 9 / 50

Page 10: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Review of Probability Theory

Axiomatic ApproachRelative Frequency Approach

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 10 / 50

Page 11: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Axiomatic Approach

Application of a mathematical theory called Measure Theory.It is based on a triplet

(Ω,F , P)

whereΩ is the sample space, which is the set of all possible outcomes.F is the σ−algebra, which is the set of all possible events (orcombinations of outcomes).P is the probability function, which can be any set function, whosedomain is Ω and the range is the closed unit interval [0,1]. It mustobey the following rules:

P(Ω) = 1Let A be any event in F , then P(A) ≥ 0.Let A and B be two events in F such that A ∩ B = ∅, thenP(A ∪ B) = P(A) + P(B).

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 11 / 50

Page 12: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Axiomatic Approach: Other properties

Probability of complement: P(A) = 1 − P(A).P(A) ≤ 1.P(∅) = 0.P(A ∪ B) = P(A) + P(B) − P(A ∩ B).

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 12 / 50

Page 13: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Conditional Probability

Let A and B be two events, with P(A) > 0. The conditionalprobability of B given A is defined as:

P(B|A) =P(A ∩ B)

P(A)

Hence, P(A ∩ B) = P(B|A)P(A) = P(A|B)P(B)

If A ∩ B = ∅ then P(B|A) = 0.If A ⊂ B, then P(B|A) = 1.

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 13 / 50

Page 14: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Bayes Rule

If A and B are events

P(A|B) =P(B|A)P(A)

P(B)

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 14 / 50

Page 15: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Total Probabilty Theorem

A set of Bi , i = 1, . . . , n of events is a partition of Ω when:∪ni=1 Bi = Ω.

Bi ∩ Bj = ∅, if i = j .

Theorem: If A is an event and Bi , i = 1, . . . , n of is a partition of Ω,then:

P(A) =n∑

i=1

P(A ∩ Bi ) =n∑

i=1

P(A|Bi )P(Bi )

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 15 / 50

Page 16: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Independence between Events

Two events A and B are statistically independent when

P(A ∩ B) = P(A)P(B)

Supposing that both P(A) and P(B) are greater than zero, from theabove definition we have that:

P(A|B) = P(A) P(B|A) = P(B)

Independent events and mutually exclusive events are different!

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 16 / 50

Page 17: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Independence between events

N events are statistically independent if the intersection of the eventscontained in any subset of those N events have probability equal tothe product of the individual probabilitiesExample: Three events A, B and C are independent if:

P(A∩B) = P(A)P(B), P(A∩C ) = P(A)P(C ), P(B∩C ) = P(B)P(C )

P(A ∩ B ∩ C ) = P(A)P(B)P(C )

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 17 / 50

Page 18: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Random Variables

A random variable (rv) is a function that maps each ω ∈ Ω to a realnumber.

X : Ω → Rω → X (ω)

Through a random variable, subsets of Ω are mapped as subsets(intervals) of the real numbers.

P(X ∈ I ) = P(w |X (ω) ∈ I)

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 18 / 50

Page 19: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Random Variables

A real random variable is a function whose domain is Ω and such thatfor all real number x , the set Ax = ω|X (ω) ≤ x is an event.P(w |X (w) = ±∞) = 0.

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 19 / 50

Page 20: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Cumulative Distribution Function

FX : R → [0, 1]X → FX (x) = P(X ≤ x) = P(ω|X (ω) ≤ x)

FX (∞) = 1FX (−∞) = 0If x1 < x2, FX (x2) ≥ FX (x1).FX (x+) = limϵ→0 FX (x + ϵ) = FX (x). (continuous on the right side).FX (x) − FX (x−) = P(X = x).

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 20 / 50

Page 21: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Types of Random Variables

Discrete: Cumulative function is a step function (sum of unit stepfunctions)

FX (x) =∑

i

P(X = xi )u(x − xi )

where u(x) is the unit step funtion.Example: X is the random variable that describes the outcome of theroll of a die. X ∈ 1, 2, 3, 4, 5, 6

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 21 / 50

Page 22: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Types of Random Variable

Continous: Cumulative function is a continous function.Mixed: Neither discrete nor continous.

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 22 / 50

Page 23: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Probability Density Function

It is the derivative of the cumulative distribution function:

pX (x) =ddx

FX (x)∫ x−∞ pX (x)dx = FX (x).

pX (x) ≥ 0.∫∞−∞ pX (x)dx = 1.∫ ba pX (x)dx = FX (b) − FX (a) = P(a ≤ X ≤ b).

P(X ∈ I ) =∫I pX (x)dx , I ⊂ R.

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 23 / 50

Page 24: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Discrete Random Variables

Let us now focus only on discrete random variables.Let X be a random variable with sample space XThe probability mass function (probability distribution function) of Xis a mapping pX (x) : X → [0, 1] satisfying:∑

X∈XpX (x) = 1

The number pX (x) := P(X = x)

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 24 / 50

Page 25: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Discrete Random Vectors

Let Z = [X , Y ] be a random vector with sample space Z = X × YThe joint probability mass function (probability distribution function)of Z is a mapping pZ (z) : Z → [0, 1] satisfying:∑

Z∈ZpZ (z) =

∑x ,y×Y

pXY (x , y) = 1

The number pZ (z) := pXY (x , y) = P(Z = z) = P(X = x , Y = y).

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 25 / 50

Page 26: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Discrete Random Vectors

Marginal Distributions

pX (x) =∑y∈Y

pXY (x , y)

pY (y) =∑x∈X

pXY (x , y)

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 26 / 50

Page 27: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Discrete Random Vectors

Conditional Distributions

pX |Y=y (x) =pXY (x , y)

pY (y)

pY |X=x(y) =pXY (x , y)

pX (x)

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 27 / 50

Page 28: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Discrete Random Vectors

Random variables X and Y are independent if and only if

pXY (x , y) = pX (x)pY (y)

Consequences:pX |Y=y (x) = pX (x)

pY |X=x(y) = pY (y)

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 28 / 50

Page 29: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Moments of a Discrete Random Variable

The n−th order moment of a discrete random variable X is defined as:

E [X n] =∑x∈X

xnpX (x)

if n = 1, we have the mean of X , mX = E [X ].The m−th order central moment of a discrete random variable X isdefined as:

E [(X − mX )m] =∑x∈X

(x − mX )mpX (x)

if m = 2, we have the variance of X , σ2X .

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 29 / 50

Page 30: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Moments of a Discrete Random Vector

The joint moment n−th order with relation to X and k−th order withrelation to Y :

mnk = E [X nY k ] =∑x∈X

∑y∈Y

xnykpXY (x , y)

The joint central n−th order with relation to X and k−th order withrelation to Y :

µnk = E [(X−mX )n(Y−mY )k ] ==∑x∈X

∑y∈Y

(x−mX )n(y−mY )kpXY (x , y)

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 30 / 50

Page 31: Information Theory: Principles and Applications

. . . . . .

Review of Probability Theory

Correlation and Covariance

The correlation of two random variables X and Y is the expected valueof their product (joint moment of order 1 in X and order 1 in Y ):

Corr(X , Y ) = m11 = E [XY ]

The covariance of two random variables X and Y is the joint centralmoment of order 1 in X and order 1 in Y:

Cov(X , Y ) = µ11 = E [(X − mX )(Y − mY )]

Cov(X , Y ) = Corr(X , Y ) − mXmY

Correlation Coefficient:

ρXY =Cov(X , Y )

σXσY→ −1 ≤ ρXY ≤ 1

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 31 / 50

Page 32: Information Theory: Principles and Applications

. . . . . .

Information Measures

What is Information?

It is a measure that quantifies the uncertainty of an event with givenprobability - Shannon 1948.For a discrete source with finite alphabet X = x0, x1, . . . , xM−1where the probability of each symbol is given by P(X = xk) = pk

I (xk) = log1pk

= − log(pk)

If logarithm is base 2, information is given in bits.

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 32 / 50

Page 33: Information Theory: Principles and Applications

. . . . . .

Information Measures

What is Information?

It represents the surprise of seeing the outcome (a highly probableoutcome is not surprising).

event probability surpriseone equals one 1 0 bits

wrong guess on a 4-choice question 3/4 0.415 bitscorrect guess on true-false question 1/2 1 bitcorrect guess on a 4-choice question 1/4 2 bits

seven on a pair of dice 6/36 2.58 bitswin any prize at Euromilhões 1/24 4.585 bits

win Euromilhões Jackpot ≈ 1/76 million ≈ 26 bitsgamma ray burst mass extinction today < 2.7 · 10−12 > 38 bits

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 33 / 50

Page 34: Information Theory: Principles and Applications

. . . . . .

Information Measures

Entropy

Expected value of information from a source.

H(X ) = E [I (xk)] =∑x∈X

px(x)I (xk)

= −∑x∈X

px(x) log px(x)

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 34 / 50

Page 35: Information Theory: Principles and Applications

. . . . . .

Information Measures

Entropy of binary source

Let X be a binary source with p0 and p1 being the probabililty ofsymbols x0 and x1 respectively.

H(X ) = −p0 log p0 − p1 log p1

= −p0 log p0 − (1 − p0) log(1 − p0)

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 35 / 50

Page 36: Information Theory: Principles and Applications

. . . . . .

Information Measures

Entropy of binary source

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p0

H(X

)

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 36 / 50

Page 37: Information Theory: Principles and Applications

. . . . . .

Information Measures

Joint Entropy

The joint entropy of a pair of random variables X and Y is given by:.

H(X , Y ) = −∑y∈Y

∑x∈X

pXY (x , y) log pX ,Y (x)

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 37 / 50

Page 38: Information Theory: Principles and Applications

. . . . . .

Information Measures

Conditional Entropy

Average amount of information of a random variable given theoccurence of other.

H(X |Y ) =∑y∈Y

pY (y)H(X |Y = y)

= −∑y∈Y

pY (y)∑x∈X

pX |Y=y (x) log px |Y=y (x)

= −∑y∈Y

∑x∈X

pXY (x , y) log pX |Y=y (x)

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 38 / 50

Page 39: Information Theory: Principles and Applications

. . . . . .

Information Measures

Chain Rule of Entropy

The entropy of a pair of random variables is equal to the entropy ofone of them plus the conditional entropy.

H(X ,Y ) = H(X ) + H(Y |X )

Corollary

H(X , Y |Z ) = H(X |Z ) + H(Y |X , Z )

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 39 / 50

Page 40: Information Theory: Principles and Applications

. . . . . .

Information Measures

Chain Rule of Entropy: Generalization

H(X1,X2, . . . , XM) =M∑

j=1

H(Xj |X1, . . . , Xj−1)

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 40 / 50

Page 41: Information Theory: Principles and Applications

. . . . . .

Information Measures

Relative Entropy: Kullback-Leibler Distance

Is a measure of the distance between two distributions.The relative entropy between two probability density functions pX (x)and qX (x) is defined as:

D(pX (x)||qX (x)) =∑x∈X

pX (x) logpX (x)

qX (x)

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 41 / 50

Page 42: Information Theory: Principles and Applications

. . . . . .

Information Measures

Relative Entropy: Kullback-Leibler Distance

D(pX (x)||qX (x)) ≥ 0 with equality if and only if pX (x) = qX (x).D(pX (x)||qX (x)) = D(qX (x)||pX (x))

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 42 / 50

Page 43: Information Theory: Principles and Applications

. . . . . .

Information Measures

Mutual Information

The mutual information of two random variables X and Y is definedas the relative entropy between the joint probability density pXY (x , y)and the product of the marginals pX (x) and pY (y)

I (X ; Y ) = D(pXY (x , y)||pX (x)pY (y))

=∑x∈X

∑y∈Y

pXY (x , y) logpX ,Y (x , y)

pX (x)pY (y)

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 43 / 50

Page 44: Information Theory: Principles and Applications

. . . . . .

Information Measures

Mutual Information: Relations with Entropy

Reducing uncertainty of X due to the knowledge of Y :

I (X ; Y ) = H(X ) − H(X |Y )

Symmetry of the relation above: I (X ; Y ) = H(Y ) − H(Y |X )

Sum of entropies:

I (X ; Y ) = H(X ) + H(Y ) − H(X , Y )

“Self” Mutual Information:

I (X ; X ) = H(X ) − H(X |X ) = H(X )

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 44 / 50

Page 45: Information Theory: Principles and Applications

. . . . . .

Information Measures

Mutual Information: Other Relations

Conditional Mutual Information:

I (X ;Y |Z ) = H(X |Z ) − H(X |Y , Z )

Chain Rule for Mutual Information

I (X1, X2, . . . , XM ; Y ) =M∑

j=1

I (Xj ; Y |X1, . . . , Xj−1)

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 45 / 50

Page 46: Information Theory: Principles and Applications

. . . . . .

Information Measures

Convex and Concave Functions

A function f (·) is convex over ain interval (a, b) if for everyx1, x2 ∈ [a, b] and 0 ≤ λ ≤ 1, if :

f (λx1 + (1 − λ)x2) ≤ λf (x1) + (1 − λ)f (x2)

A function f (·) is convex over an interval (a, b) if its second derivativeis non-negative over that interval (a, b).A function f (·) is concave if −f (·) is convex.Examples of convex functions: x2, |x |, ex , x log x , x ≥ 0.Examples of concave functions: log x and

√x , for x ≥ 0.

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 46 / 50

Page 47: Information Theory: Principles and Applications

. . . . . .

Information Measures

Jensen’s Inequality

If f (·) is a convex function and X is a random variable

E [f (X )] ≥ f (E [X ])

Used to show that relative entropy and mutual information are greaterthan zero.Used also to show that H(X ) ≤ log |X |.

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 47 / 50

Page 48: Information Theory: Principles and Applications

. . . . . .

Information Measures

Log-Sum Inequality

For n positive numbers a1, a2, . . . , an and b1, b2, . . . bn

n∑i=1

ai logai

bi≥

(n∑

i=1

ai

)log∑n

i=1 ai∑ni=1 bi

with equality if and only if ai/bi = c .This inequality is used to prove the convexity of the relative entropyand the concavity of the entropy.Convexity/Concavity of mutual information

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 48 / 50

Page 49: Information Theory: Principles and Applications

. . . . . .

Information Measures

Data Processing Inequality

Random variables X , Y , Z are said to form a Markov chain in thatorder X → Y → Z , if the conditional distribution of Z depends onlyon Y and is onditionally independent of X .

pXYZ (x , y , z) = pX (x)pY |X=x(y)pZ |Y=y (y)

If X → Y → Z , then

I (X ; Y ) ≥ I (X ;Z )

Let Z = g(Y ), X → Y → g(Y ), then I (X ; Y ) ≥ I (X ; g(Y ))

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 49 / 50

Page 50: Information Theory: Principles and Applications

. . . . . .

Information Measures

Fano’s Inequality

Suppose we know a random variable Y and we wish to guess the valueof a correlated random variable X .Fano’s inequality relates the probability of error in guessing X from Yto its conditional entropy H(X |Y ).Let X = g(Y ), if Pe = P(X = X ), then

H(Pe) + Pe log(|X | − 1) ≥ H(X |Y )

where H(Pe) is the binary entropy function evaluated at Pe .

Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 50 / 50