CS/EE 5590 / ENG 401 Special Topics (Class Ids: 17804, 17815, 17803) Lec 02 Entropy and Lossless Coding I Zhu Li Z. Li Multimedia Communciation, 2016 Spring p.1 Outline Lecture 01 ReCap Info Theory on Entropy Lossless Entropy Coding Z. Li Multimedia Communciation, 2016 Spring p.2 Video Compression in Summary Z. Li Multimedia Communciation, 2016 Spring p.3 Video Coding Standards: Rate-Distortion Performance Pre-HEVC Z. Li Multimedia Communciation, 2016 Spring p.4
13
Embed
Multimedia Communication Lec02: Info Theory and Entropy
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CS/EE 5590 / ENG 401 Special Topics
(Class Ids: 17804, 17815, 17803)
Lec 02
Entropy and Lossless Coding I
Zhu Li
Z. Li Multimedia Communciation, 2016 Spring p.1
Outline
Lecture 01 ReCap
Info Theory on Entropy
Lossless Entropy Coding
Z. Li Multimedia Communciation, 2016 Spring p.2
Video Compression in Summary
Z. Li Multimedia Communciation, 2016 Spring p.3
Video Coding Standards: Rate-Distortion Performance
Pre-HEVC
Z. Li Multimedia Communciation, 2016 Spring p.4
PSS over managed IP networks
Managed mobile core IP networks
Z. Li Multimedia Communciation, 2016 Spring p.5
MPEG DASH – OTT
HTTP Adaptive Streaming of Video
Z. Li Multimedia Communciation, 2016 Spring p.6
Outline
Lecture 01 ReCap
Info Theory on Entropy Self Info of an event Entropy of the source Relative Entropy Mutual Info
Entropy Coding
Thanks for SFU’s Prof. Jie Liang’s slides!
Z. Li Multimedia Communciation, 2016 Spring p.7
Entropy and its Application
Entropy coding: the last part of a compression system
Losslessly represent symbolsKey idea:
Assign short codes for common symbols Assign long codes for rare symbols
Question: How to evaluate a compression method?
o Need to know the lower bound we can achieve.o Entropy
EntropycodingQuantizationTransform
Encoder
0100100101111
Z. Li Multimedia Communciation, 2016 Spring p.8
Claude Shannon: 1916-2001 A distant relative of Thomas Edison 1932: Went to University of Michigan. 1937: Master thesis at MIT became the foundation of
digital circuit design:o “The most important, and also the most famous,
master's thesis of the century“ 1940: PhD, MIT 1940-1956: Bell Lab (back to MIT after that) 1948: The birth of Information Theory
o A mathematical theory of communication, Bell System Technical Journal.
Z. Li Multimedia Communciation, 2016 Spring p.9
Axiom Definition of Information
Information is a measure of uncertainty or surprise
Axiom 1: Information of an event is a function of its probability:
i(A) = f (P(A)). What’s the expression of f()?
Axiom 2: Rare events have high information content
Water found on Mars!!!
Common events have low information content It’s raining in Vancouver.
Information should be a decreasing function of the probability:
Still numerous choices of f().
Axiom 3: Information of two independent events = sum of individual information:
If P(AB)=P(A)P(B) i(AB) = i(A) + i(B).
Only the logarithmic function satisfies these conditions.Z. Li Multimedia Communciation, 2016 Spring p.10
Self-information
)(log)(
1log)( xp
xpxi bb
• Shannon’s Definition [1948]:• X: discrete random variable with alphabet {A1, A2, …, AN}
• Probability mass function: p(x) = Pr{ X = x}
• Self-information of an event X = x:
If b = 2, unit of information is bit
Self information indicates the number of bitsneeded to represent an event.
1
P(x)
)(log xPb
0
Z. Li Multimedia Communciation, 2016 Spring p.11
Recall: the mean of a function g(X):
Entropy is the expected self-information of the r.v. X:
The entropy represents the minimal number of bits needed to losslessly represent one output of the source.
Entropy of a Random Variable
x xp
xpXH)(
1log)()(
)g( )())(()( xxpXgE xp
)(log)(
1log )()( XpE
XpEH xpxp
Also write as H (p): function of the distribution of X, not the value of X.
Joint encoding needs joint probability: difficult Sequential encoding only needs conditional entropy,can use local neighbors to approximate the conditional entropy context-adaptive arithmetic coding.
In many cases we can approximate the cond. probability with some nearest neighbors (contexts):
1 1 1( | ,... ) ( | ,... )i i i i L ip x x x p x x x
The low-dim cond prob is more manageable
How to measure the quality of the approximation?
Relative entropy
0 1 1 0 1 0 1a b c b c a bc b a b c b a
Z. Li Multimedia Communciation, 2016 Spring p.23
Relative Entropy – Cost of Coding with Wrong Distr
Also known as Kullback Leibler (K-L) Distance, Information Divergence, Information Gain
A measure of the “distance” between two distributions: In many applications, the true distribution p(X) is unknown, and we
only know an estimation distribution q(X) What is the inefficiency in representing X?
o The true entropy:
o The actual rate:
o The difference:
)(
)(log
)(
)(log)()||(
Xq
XpE
xq
xpxpqpD p
x
1 ( ) log ( )x
R p x p x 2 ( ) log ( )
x
R p x q x 2 1 ( || )R R D p q
Z. Li Multimedia Communciation, 2016 Spring p.24
Relative Entropy
Properties:
)(
)(log
)(
)(log)()||(
Xq
XpE
xq
xpxpqpD p
x
( || ) 0.D p q
( || ) 0 if and only if q = p.D p q
What if p(x)>0, but q(x)=0 for some x? D(p||q)=∞
Caution: D(p||q) is not a true distance
Not symmetric in general: D(p || q) ≠ D(q || p)
Does not satisfy triangular inequality.
Proved later.
Z. Li Multimedia Communciation, 2016 Spring p.25
Relative Entropy
How to make it symmetric?
Many possibilities, for example:
1( || ) ( || )
2D p q D q p
( || ) ( || )D p q D q p
can be useful for pattern classification.
)||(
1
)||(
1
pqDqpD
Z. Li Multimedia Communciation, 2016 Spring p.26
Mutual Information
i (x | y): conditional self-information
)()(
),(log
)(
)|(log)|()();(
ypxp
yxp
xp
yxpyxixiyxi
Note: i(x; y) can be negative, if p(x | y) < p(x).
Mutual information between two events:
i(x | y) = -log p(x | y)
A measure of the amount of information that one event contains about another one.
or the reduction in the uncertainty of one event due to the knowledge of the other.
Z. Li Multimedia Communciation, 2016 Spring p.27
Mutual Information
I(X; Y): Mutual information between two random variables:
( , )
( , )( ; ) ( , ) ( ; ) ( , ) log
( ) ( )
( , )D ( , ) || ( ) ( ) log
( ) ( )
x y x y
p x y
p x yI X Y p x y i x y p x y
p x p y
p X Yp x y p x p y E
p X p Y
But it is symmetric: I(X; Y) = I(Y; X)
Mutual information is a relative entropy:
If X, Y are independent: p(x, y) = p(x) p(y)
I (X; Y) = 0
Knowing X does not reduce the uncertainty of Y.
Different from i(x; y), I(X; Y) >=0 (due to averaging)
Z. Li Multimedia Communciation, 2016 Spring p.28
Entropy and Mutual Information
( , ) ( | )( ; ) ( , ) log ( , ) log
( ) ( ) ( )
( , ) log ( | ) ( , ) log ( )
( ) ( | )
x y x y
x y x y
p x y p x yI X Y p x y p x y
p x p y p x
p x y p x y p x y p x
H X H X Y
2. Similarly: ( ; ) ( ) ( | )I X Y H Y H Y X
1.
3. I(X; Y) = H(X) + H(Y) – H(X, Y)
Proof: Expand the definition:
( ; ) ( ) ( | )I X Y H X H X Y
),()()(
)(log)(log),(log),();(
YXHYHXH
ypxpyxpyxpYXIx y
Z. Li Multimedia Communciation, 2016 Spring p.29
Entropy and Mutual Information
H(X) H(Y)
I(X; Y)H(X | Y) H(Y | X)
Total area: H(X, Y)
It can be seen from this figure that I(X; X) = H(X):
Proof:
Let X = Y in I(X; Y) = H(X) + H(Y) – H(X, Y),
or in I(X; Y) = H(X) – H(X | Y) (and use H(X|X)=0).
Z. Li Multimedia Communciation, 2016 Spring p.30
Application of Mutual Information
a b c b c a bc b a b c b a
Mutual information can be used in the optimization of context quantization.
Example: If each neighbor has 26 possible values (a to z), then 5 neighbors have 265 combinations: too many cond probs to estimate.
To reduce the number, can group similar data pattern together context quantization
1 1 1 1( | ,... ) | ,...i i i ip x x x p x f x x
Z. Li Multimedia Communciation, 2016 Spring p.31
Application of Mutual Information
We need to design the function f( ) to minimize the conditional entropy
1 1 1 1( | ,... ) | ,...i i i ip x x x p x f x x
)...|(),...,( 1,11
21 XXXHXXXH i
n
iin
( | ) ( ) ( ; )H X Y H X I X Y But
The problem is equivalent to maximizing the mutual information between Xi and f(x1, … xi-1).
))...(|( 1,1 XXfXH ii
For further info: Liu and Karam, Mutual Information-Based Analysisof JPEG2000 Contexts, IEEE Trans Image Processing, VOL. 14, NO. 4, APRIL 2005, pp. 411-422.
Design the mapping from source symbols to codewords
Lossless mapping
Different codewords may have different lengths
Goal: minimizing the average codeword length
The entropy is the lower bound.
Z. Li Multimedia Communciation, 2016 Spring p.34
Classes of Codes
Non-singular code: Different inputs are mapped to different codewords (invertible).
Uniquely decodable code: any encoded string has only one possible source string, but may need delay to decode.
Prefix-free code (or simply prefix, or instantaneous):
No codeword is a prefix of any other codeword. The focus of our studies. Questions:
o Characteristic?o How to design?o Is it optimal?
All codes
Non-singular codes
Uniquely decodablecodes
Prefix-free codes
Z. Li Multimedia Communciation, 2016 Spring p.35
Prefix Code
Examples
XSingular Non-singular,
But not uniquely decodable
Uniquely decodable, but not
prefix-free
Prefix-free
1 0 0 0 0
2 0 010 01 10
3 0 01 011 110
4 0 10 0111 111
Need punctuation ……01011…
Need to look at nextbit to decode previous code.
Z. Li Multimedia Communciation, 2016 Spring p.36
Carter-Gill’s Conjecture [1974]
Carter-Gill’s Conjecture [1974] Every uniquely decodable code can be replaced by a prefix-free code
with the same set of codeword compositions. So we only need to study prefix-free code.
Z. Li Multimedia Communciation, 2016 Spring p.37
Prefix-free Code
Can be uniquely decoded.
No codeword is a prefix of another one.
Also called prefix code
Goal: construct prefix code with minimal expected length.
Can put all codewords in a binary tree:
0 1
0 1
0 1
010
110 111
Root node
leaf node
Internal node
Prefix-free code contains leaves only.
How to express the requirement mathematically?
Z. Li Multimedia Communciation, 2016 Spring p.38
Kraft-McMillan Inequality
121
N
i
li
• The codeword lengths li, i=1,…N of a prefix code over an alphabet of size D(=2) satisfies the inequality
Conversely, if a set of {li} satisfies the inequality above, then there exists a prefix code with codeword
lengths li, i=1,…N.
The characteristic of prefix-free codes:
Z. Li Multimedia Communciation, 2016 Spring p.39
Kraft-McMillan Inequality
22 1211
LN
i
lLN
i
l ii
Consider D=2: expand the binary code tree to full depth
L = max(li)
0
10
110111
Number of nodes in the last level:
Each code corresponds to a sub-tree: The number of off springs in the last level:
K-M inequality:
# of L-th level offsprings of all codes is less than 2^L !
ilL2
L2
L = 3
Example: {0, 10, 110, 111}
Z. Li Multimedia Communciation, 2016 Spring p.40
2^3=8
{4, 2, 0, 0}
Kraft-McMillan Inequality
0
10
110 111
11
Invalid code: {0, 10, 11, 110, 111}
Leads to more than2^L offspring: 12> 23
121
i
li
K-M inequality:
Z. Li Multimedia Communciation, 2016 Spring p.41
Extended Kraft Inequality
Countably infinite prefix code also satisfies the Kraft inequality: Has infinite number of codewords.
Example: 0, 10, 110, 1110, 11110, 111……10, ……
(Golomb-Rice code, next lecture) Each codeword can be mapped to a subinterval in [0, 1] that is
disjoint with others (revisited in arithmetic coding)
11
i
liD
)0
)10
)0 0.5 0.75 0.875 1
110 ……
Z. Li Multimedia Communciation, 2016 Spring p.42
0
10
110
….
L = 3
Optimal Codes (Advanced Topic)
How to design the prefix code with the minimal expected length?
Optimization Problem: find {li} to
1 ..
min
i
i
l
iil
Dts
lp
Lagrangian solution: Ignore the integer codeword length constraint for now
Assume equality holds in the Kraft inequality
ilii DlpJ Minimize
Z. Li Multimedia Communciation, 2016 Spring p.43
Optimal Codes
ilii DlpJ
0 ln Let il
ii
DDpl
J D
pD ili
ln
1 into ngSubstituti ilDDln
1
il pD i or
iDi pl log- *
The optimal codeword length is the self-information of an event.
Expected codeword length:
)(log* XHpplpL DiDiii Entropy of X !
Z. Li Multimedia Communciation, 2016 Spring p.44
Optimal Code
Theorem: The expected length L of any prefix code is greater or equal to the entropy
with equality holds iff
is not integer in general.iDi p-l log*
( )DL H X
Proof:
( ) log
log log logi
i
D i i i D i
l ii D i D i i D l
L H X p l p p
pp D p p p
D
This reminds us the definition of relative entropy D(p ||q), but we need to normalize D-li.
il pD i
Z. Li Multimedia Communciation, 2016 Spring p.45
Pi is Diadic, ½, ¼, 1/8, 1/16…
Optimal Code
Let distribution be diadic, i.e, �� = ����/∑ �����
because D(p||q) >= 0, and 1
1 for prefix code.i
Nl
i
D
The equality holds iff both terms are 0:
il pD i or logD ip is an integer.
Z. Li Multimedia Communciation, 2016 Spring p.46
� − �� � =��� log�������
�
=��� log�����
+ log� 1/(�����
�
)�
= �(�| � + log�1
∑����
Optimal Code
D-adic: a probability distribution is called D-adic with respect to D if each probability is equal to D-n for some integer n: Example: {1/2, 1/4, 1/8, 1/8}
Therefore the optimality can be achieved by prefix code iff the distribution is D-adic. Previous example:
Possible codewords:o {0, 10, 110, 111}
log {1, 2,3,3}D ip
Z. Li Multimedia Communciation, 2016 Spring p.47
Shannon Code: Bounds on Optimal Code
is not integer in general.iDi p-l log*
Practical codewords have to be integer.
iDi p
l1
logShannon Code:
Is this a valid prefix code? Check Kraft inequality
.11
log1
log
i
ppl pDDD iD
iD
i
11
log1
log i
Dii
D pl
p
iDi p
l1
log
1)()( XHLXH DD
Yes !
This is just one choice. May not be optimal (see example later)
Z. Li Multimedia Communciation, 2016 Spring p.48
Optimal Code
The optimal code with integer lengths should be better than Shannon code
1)()( * XHLXH DD
To reduce the overhead per symbol: Encode a block of symbols {x1, x2, …, xn} together