. . . . . . . . . Information Theory: Principles and Applications Tiago T. V. Vinhoza March 19, 2010 Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 1 / 50
. . . . . .
.
.
. ..
.
.
Information Theory: Principles and Applications
Tiago T. V. Vinhoza
March 19, 2010
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 1 / 50
. . . . . .
.. .1 Course Information
.. .2 What is Information Theory?
.. .3 Review of Probability Theory
.. .4 Information Measures
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 2 / 50
. . . . . .
Course Information
Information Theory: Principles and Applications
Prof. Tiago T. V. VinhozaOffice: FEUP Building I, Room I322Office hours: Wednesdays from 14h30-15h30.Email: [email protected]
Prof. José VieiraProf. Paulo Jorge Ferreira
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 3 / 50
. . . . . .
Course Information
Information Theory: Principles and Applications
http://paginas.fe.up.pt/∼vinhoza (link for Info Theory)HomeworksOther notes
My Evaluation: (Almost) Weekly Homeworks + Final ExamReferences:
Elements of Information Theory, Cover and Thomas, WileyInformation Theory and Reliable Communication, GallagerInformation Theory, Inference, and Learning Algorithms, McKay(available online)
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 4 / 50
. . . . . .
What is Information Theory?
What is Information Theory?
IT is a branch of math (a strictly deductive system). (C. Shannon,The bandwagon)General statistical concept of communication. (N. Wiener, What isIT?)It was build upon the work of Shannon (1948)It answers to two fundamental questions in Communications Theory:
What is the fundamental limit for information compression?What is the fundamental limit on information transmission rate over acommunications channel?
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 5 / 50
. . . . . .
What is Information Theory?
What is Information Theory?
Mathematics: InequalitiesComputer Science: Kolmogorov ComplexityStatistics: Hypothesis TestingsProbability Theory: Limit TheoremsEngineering: CommunicationsPhysics: ThermodynamicsEconomics: Portfolio Theory
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 6 / 50
. . . . . .
What is Information Theory?
Communications Systems
The fundamental problem of communication is that of reproducing atone point either exactly or approximately a message selected atanother point. (Claude Shannon: A Mathematical Theory ofCommunications, 1948)
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 7 / 50
. . . . . .
What is Information Theory?
Digital Communications Systems
SourceSource Coder: Convert an analog or digital source into bits.Channel Coder: Protection against errors/erasures in the channel.Modulator: Each binary sequence is assigned to a waveformChannel: Physical Medium to send information from transmitter toreceiver. Source of randomness.Demodulator, Channel Decoder, Source Decoder, Sink.
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 8 / 50
. . . . . .
What is Information Theory?
Digital Communications Systems
Modulator + Channel = Discrete Channel.Binary Symmetric Channel.Binary Erasure Channel.
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 9 / 50
. . . . . .
Review of Probability Theory
Review of Probability Theory
Axiomatic ApproachRelative Frequency Approach
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 10 / 50
. . . . . .
Review of Probability Theory
Axiomatic Approach
Application of a mathematical theory called Measure Theory.It is based on a triplet
(Ω,F , P)
whereΩ is the sample space, which is the set of all possible outcomes.F is the σ−algebra, which is the set of all possible events (orcombinations of outcomes).P is the probability function, which can be any set function, whosedomain is Ω and the range is the closed unit interval [0,1]. It mustobey the following rules:
P(Ω) = 1Let A be any event in F , then P(A) ≥ 0.Let A and B be two events in F such that A ∩ B = ∅, thenP(A ∪ B) = P(A) + P(B).
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 11 / 50
. . . . . .
Review of Probability Theory
Axiomatic Approach: Other properties
Probability of complement: P(A) = 1 − P(A).P(A) ≤ 1.P(∅) = 0.P(A ∪ B) = P(A) + P(B) − P(A ∩ B).
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 12 / 50
. . . . . .
Review of Probability Theory
Conditional Probability
Let A and B be two events, with P(A) > 0. The conditionalprobability of B given A is defined as:
P(B|A) =P(A ∩ B)
P(A)
Hence, P(A ∩ B) = P(B|A)P(A) = P(A|B)P(B)
If A ∩ B = ∅ then P(B|A) = 0.If A ⊂ B, then P(B|A) = 1.
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 13 / 50
. . . . . .
Review of Probability Theory
Bayes Rule
If A and B are events
P(A|B) =P(B|A)P(A)
P(B)
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 14 / 50
. . . . . .
Review of Probability Theory
Total Probabilty Theorem
A set of Bi , i = 1, . . . , n of events is a partition of Ω when:∪ni=1 Bi = Ω.
Bi ∩ Bj = ∅, if i = j .
Theorem: If A is an event and Bi , i = 1, . . . , n of is a partition of Ω,then:
P(A) =n∑
i=1
P(A ∩ Bi ) =n∑
i=1
P(A|Bi )P(Bi )
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 15 / 50
. . . . . .
Review of Probability Theory
Independence between Events
Two events A and B are statistically independent when
P(A ∩ B) = P(A)P(B)
Supposing that both P(A) and P(B) are greater than zero, from theabove definition we have that:
P(A|B) = P(A) P(B|A) = P(B)
Independent events and mutually exclusive events are different!
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 16 / 50
. . . . . .
Review of Probability Theory
Independence between events
N events are statistically independent if the intersection of the eventscontained in any subset of those N events have probability equal tothe product of the individual probabilitiesExample: Three events A, B and C are independent if:
P(A∩B) = P(A)P(B), P(A∩C ) = P(A)P(C ), P(B∩C ) = P(B)P(C )
P(A ∩ B ∩ C ) = P(A)P(B)P(C )
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 17 / 50
. . . . . .
Review of Probability Theory
Random Variables
A random variable (rv) is a function that maps each ω ∈ Ω to a realnumber.
X : Ω → Rω → X (ω)
Through a random variable, subsets of Ω are mapped as subsets(intervals) of the real numbers.
P(X ∈ I ) = P(w |X (ω) ∈ I)
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 18 / 50
. . . . . .
Review of Probability Theory
Random Variables
A real random variable is a function whose domain is Ω and such thatfor all real number x , the set Ax = ω|X (ω) ≤ x is an event.P(w |X (w) = ±∞) = 0.
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 19 / 50
. . . . . .
Review of Probability Theory
Cumulative Distribution Function
FX : R → [0, 1]X → FX (x) = P(X ≤ x) = P(ω|X (ω) ≤ x)
FX (∞) = 1FX (−∞) = 0If x1 < x2, FX (x2) ≥ FX (x1).FX (x+) = limϵ→0 FX (x + ϵ) = FX (x). (continuous on the right side).FX (x) − FX (x−) = P(X = x).
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 20 / 50
. . . . . .
Review of Probability Theory
Types of Random Variables
Discrete: Cumulative function is a step function (sum of unit stepfunctions)
FX (x) =∑
i
P(X = xi )u(x − xi )
where u(x) is the unit step funtion.Example: X is the random variable that describes the outcome of theroll of a die. X ∈ 1, 2, 3, 4, 5, 6
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 21 / 50
. . . . . .
Review of Probability Theory
Types of Random Variable
Continous: Cumulative function is a continous function.Mixed: Neither discrete nor continous.
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 22 / 50
. . . . . .
Review of Probability Theory
Probability Density Function
It is the derivative of the cumulative distribution function:
pX (x) =ddx
FX (x)∫ x−∞ pX (x)dx = FX (x).
pX (x) ≥ 0.∫∞−∞ pX (x)dx = 1.∫ ba pX (x)dx = FX (b) − FX (a) = P(a ≤ X ≤ b).
P(X ∈ I ) =∫I pX (x)dx , I ⊂ R.
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 23 / 50
. . . . . .
Review of Probability Theory
Discrete Random Variables
Let us now focus only on discrete random variables.Let X be a random variable with sample space XThe probability mass function (probability distribution function) of Xis a mapping pX (x) : X → [0, 1] satisfying:∑
X∈XpX (x) = 1
The number pX (x) := P(X = x)
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 24 / 50
. . . . . .
Review of Probability Theory
Discrete Random Vectors
Let Z = [X , Y ] be a random vector with sample space Z = X × YThe joint probability mass function (probability distribution function)of Z is a mapping pZ (z) : Z → [0, 1] satisfying:∑
Z∈ZpZ (z) =
∑x ,y×Y
pXY (x , y) = 1
The number pZ (z) := pXY (x , y) = P(Z = z) = P(X = x , Y = y).
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 25 / 50
. . . . . .
Review of Probability Theory
Discrete Random Vectors
Marginal Distributions
pX (x) =∑y∈Y
pXY (x , y)
pY (y) =∑x∈X
pXY (x , y)
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 26 / 50
. . . . . .
Review of Probability Theory
Discrete Random Vectors
Conditional Distributions
pX |Y=y (x) =pXY (x , y)
pY (y)
pY |X=x(y) =pXY (x , y)
pX (x)
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 27 / 50
. . . . . .
Review of Probability Theory
Discrete Random Vectors
Random variables X and Y are independent if and only if
pXY (x , y) = pX (x)pY (y)
Consequences:pX |Y=y (x) = pX (x)
pY |X=x(y) = pY (y)
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 28 / 50
. . . . . .
Review of Probability Theory
Moments of a Discrete Random Variable
The n−th order moment of a discrete random variable X is defined as:
E [X n] =∑x∈X
xnpX (x)
if n = 1, we have the mean of X , mX = E [X ].The m−th order central moment of a discrete random variable X isdefined as:
E [(X − mX )m] =∑x∈X
(x − mX )mpX (x)
if m = 2, we have the variance of X , σ2X .
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 29 / 50
. . . . . .
Review of Probability Theory
Moments of a Discrete Random Vector
The joint moment n−th order with relation to X and k−th order withrelation to Y :
mnk = E [X nY k ] =∑x∈X
∑y∈Y
xnykpXY (x , y)
The joint central n−th order with relation to X and k−th order withrelation to Y :
µnk = E [(X−mX )n(Y−mY )k ] ==∑x∈X
∑y∈Y
(x−mX )n(y−mY )kpXY (x , y)
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 30 / 50
. . . . . .
Review of Probability Theory
Correlation and Covariance
The correlation of two random variables X and Y is the expected valueof their product (joint moment of order 1 in X and order 1 in Y ):
Corr(X , Y ) = m11 = E [XY ]
The covariance of two random variables X and Y is the joint centralmoment of order 1 in X and order 1 in Y:
Cov(X , Y ) = µ11 = E [(X − mX )(Y − mY )]
Cov(X , Y ) = Corr(X , Y ) − mXmY
Correlation Coefficient:
ρXY =Cov(X , Y )
σXσY→ −1 ≤ ρXY ≤ 1
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 31 / 50
. . . . . .
Information Measures
What is Information?
It is a measure that quantifies the uncertainty of an event with givenprobability - Shannon 1948.For a discrete source with finite alphabet X = x0, x1, . . . , xM−1where the probability of each symbol is given by P(X = xk) = pk
I (xk) = log1pk
= − log(pk)
If logarithm is base 2, information is given in bits.
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 32 / 50
. . . . . .
Information Measures
What is Information?
It represents the surprise of seeing the outcome (a highly probableoutcome is not surprising).
event probability surpriseone equals one 1 0 bits
wrong guess on a 4-choice question 3/4 0.415 bitscorrect guess on true-false question 1/2 1 bitcorrect guess on a 4-choice question 1/4 2 bits
seven on a pair of dice 6/36 2.58 bitswin any prize at Euromilhões 1/24 4.585 bits
win Euromilhões Jackpot ≈ 1/76 million ≈ 26 bitsgamma ray burst mass extinction today < 2.7 · 10−12 > 38 bits
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 33 / 50
. . . . . .
Information Measures
Entropy
Expected value of information from a source.
H(X ) = E [I (xk)] =∑x∈X
px(x)I (xk)
= −∑x∈X
px(x) log px(x)
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 34 / 50
. . . . . .
Information Measures
Entropy of binary source
Let X be a binary source with p0 and p1 being the probabililty ofsymbols x0 and x1 respectively.
H(X ) = −p0 log p0 − p1 log p1
= −p0 log p0 − (1 − p0) log(1 − p0)
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 35 / 50
. . . . . .
Information Measures
Entropy of binary source
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p0
H(X
)
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 36 / 50
. . . . . .
Information Measures
Joint Entropy
The joint entropy of a pair of random variables X and Y is given by:.
H(X , Y ) = −∑y∈Y
∑x∈X
pXY (x , y) log pX ,Y (x)
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 37 / 50
. . . . . .
Information Measures
Conditional Entropy
Average amount of information of a random variable given theoccurence of other.
H(X |Y ) =∑y∈Y
pY (y)H(X |Y = y)
= −∑y∈Y
pY (y)∑x∈X
pX |Y=y (x) log px |Y=y (x)
= −∑y∈Y
∑x∈X
pXY (x , y) log pX |Y=y (x)
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 38 / 50
. . . . . .
Information Measures
Chain Rule of Entropy
The entropy of a pair of random variables is equal to the entropy ofone of them plus the conditional entropy.
H(X ,Y ) = H(X ) + H(Y |X )
Corollary
H(X , Y |Z ) = H(X |Z ) + H(Y |X , Z )
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 39 / 50
. . . . . .
Information Measures
Chain Rule of Entropy: Generalization
H(X1,X2, . . . , XM) =M∑
j=1
H(Xj |X1, . . . , Xj−1)
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 40 / 50
. . . . . .
Information Measures
Relative Entropy: Kullback-Leibler Distance
Is a measure of the distance between two distributions.The relative entropy between two probability density functions pX (x)and qX (x) is defined as:
D(pX (x)||qX (x)) =∑x∈X
pX (x) logpX (x)
qX (x)
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 41 / 50
. . . . . .
Information Measures
Relative Entropy: Kullback-Leibler Distance
D(pX (x)||qX (x)) ≥ 0 with equality if and only if pX (x) = qX (x).D(pX (x)||qX (x)) = D(qX (x)||pX (x))
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 42 / 50
. . . . . .
Information Measures
Mutual Information
The mutual information of two random variables X and Y is definedas the relative entropy between the joint probability density pXY (x , y)and the product of the marginals pX (x) and pY (y)
I (X ; Y ) = D(pXY (x , y)||pX (x)pY (y))
=∑x∈X
∑y∈Y
pXY (x , y) logpX ,Y (x , y)
pX (x)pY (y)
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 43 / 50
. . . . . .
Information Measures
Mutual Information: Relations with Entropy
Reducing uncertainty of X due to the knowledge of Y :
I (X ; Y ) = H(X ) − H(X |Y )
Symmetry of the relation above: I (X ; Y ) = H(Y ) − H(Y |X )
Sum of entropies:
I (X ; Y ) = H(X ) + H(Y ) − H(X , Y )
“Self” Mutual Information:
I (X ; X ) = H(X ) − H(X |X ) = H(X )
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 44 / 50
. . . . . .
Information Measures
Mutual Information: Other Relations
Conditional Mutual Information:
I (X ;Y |Z ) = H(X |Z ) − H(X |Y , Z )
Chain Rule for Mutual Information
I (X1, X2, . . . , XM ; Y ) =M∑
j=1
I (Xj ; Y |X1, . . . , Xj−1)
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 45 / 50
. . . . . .
Information Measures
Convex and Concave Functions
A function f (·) is convex over ain interval (a, b) if for everyx1, x2 ∈ [a, b] and 0 ≤ λ ≤ 1, if :
f (λx1 + (1 − λ)x2) ≤ λf (x1) + (1 − λ)f (x2)
A function f (·) is convex over an interval (a, b) if its second derivativeis non-negative over that interval (a, b).A function f (·) is concave if −f (·) is convex.Examples of convex functions: x2, |x |, ex , x log x , x ≥ 0.Examples of concave functions: log x and
√x , for x ≥ 0.
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 46 / 50
. . . . . .
Information Measures
Jensen’s Inequality
If f (·) is a convex function and X is a random variable
E [f (X )] ≥ f (E [X ])
Used to show that relative entropy and mutual information are greaterthan zero.Used also to show that H(X ) ≤ log |X |.
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 47 / 50
. . . . . .
Information Measures
Log-Sum Inequality
For n positive numbers a1, a2, . . . , an and b1, b2, . . . bn
n∑i=1
ai logai
bi≥
(n∑
i=1
ai
)log∑n
i=1 ai∑ni=1 bi
with equality if and only if ai/bi = c .This inequality is used to prove the convexity of the relative entropyand the concavity of the entropy.Convexity/Concavity of mutual information
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 48 / 50
. . . . . .
Information Measures
Data Processing Inequality
Random variables X , Y , Z are said to form a Markov chain in thatorder X → Y → Z , if the conditional distribution of Z depends onlyon Y and is onditionally independent of X .
pXYZ (x , y , z) = pX (x)pY |X=x(y)pZ |Y=y (y)
If X → Y → Z , then
I (X ; Y ) ≥ I (X ;Z )
Let Z = g(Y ), X → Y → g(Y ), then I (X ; Y ) ≥ I (X ; g(Y ))
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 49 / 50
. . . . . .
Information Measures
Fano’s Inequality
Suppose we know a random variable Y and we wish to guess the valueof a correlated random variable X .Fano’s inequality relates the probability of error in guessing X from Yto its conditional entropy H(X |Y ).Let X = g(Y ), if Pe = P(X = X ), then
H(Pe) + Pe log(|X | − 1) ≥ H(X |Y )
where H(Pe) is the binary entropy function evaluated at Pe .
Tiago T. V. Vinhoza () Information Theory - MAP-Tele March 19, 2010 50 / 50