Top Banner
Lecture #1 From 0-th order entropy compression To k-th order entropy compression
33

Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Apr 01, 2015

Download

Documents

Tyree Haskin
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Lecture #1

From 0-th order entropy compressionTo k-th order entropy compression

Page 2: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Entropy (Shannon, 1948)

For a source S emitting symbols with probability p(s), the self information of s is:

bits

Lower probability higher information

Entropy is the weighted average of i(s)

Ss sp

spSH)(

1log)()( 2

)(

1log)( 2 sp

si

H0 = 0-th order empirical entropy (of a string, where p(s)=freq(s))

i(s)

Page 3: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Performance

Compression ratio =

#bits in output / #bits in input

Compression performance: We relate entropy against compression ratio.

||

|)(|)(0 T

TCvsTH

or|)(|)(|| 0 TCvsTHT

Page 4: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Huffman Code

Invented by Huffman as a class assignment in ‘50.

Used in most compression algorithms gzip, bzip, jpeg (as option), fax compression,…

Properties: Generates optimal prefix codes Fast to encode and decode

We can prove that (n=|T|): n H(T) ≤ |Huff(T)| < n H(T) + n

This means that it looses < 1 bit per symbol on avg !

Good or bad ?

Page 5: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Arithmetic coding

Given a text of n symbols it takes nH0 +2 bits vs. (nH0 +n) bits of Huffman

Used in PPM, JPEG/MPEG (as option), …

More time costly than Huffman, but integer implementation is “not bad”.

Page 6: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Symbol interval

Assign each symbol to an interval [0, 1).

a = .2

c = .3

b = .5

0.0

0.2

0.7

1.0

f(a) = .0, f(b) = .2, f(c) = .7

e.g. the symbol interval for b is [.2,.7)

Page 7: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Encoding a sequence of symbols

Coding the sequence: bac

The final sequence interval is [.27,.3)

a = .2

c = .3

b = .5

0.0

0.2

0.7

1.0

a = .2

c = .3

b = .5

0.2

0.3

0.55

0.7

a = .2

c = .3

b = .5

0.2

0.22

0.27

0.3(0.7-0.2)*0.3=0.15

(0.3-0.2)*0.5 = 0.05

(0.3-0.2)*0.3=0.03

(0.3-0.2)*0.2=0.02(0.7-0.2)*0.2=0.1

(0.7-0.2)*0.5 = 0.25

Page 8: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

The algorithm

To code a sequence of symbols:

0

1

0

0

l

s iiii

iii

Tfsll

Tpss

*11

1 *

P(a) = .2

P(c) = .3

P(b) = .5

0.2

0.22

0.27

0.3

2.0

1.0

1

1

i

i

l

s

03.03.0*1.0 is

27.0)5.02.0(*1.02.0 il

n

iin Tps

1

Pick a number inside

Page 9: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Decoding Example

Decoding the number .49, knowing the input text to be decoded is of length 3:

The message is bbc.

a = .2

c = .3

b = .5

0.0

0.2

0.7

1.0

a = .2

c = .3

b = .5

0.2

0.3

0.55

0.7

a = .2

c = .3

b = .5

0.3

0.35

0.475

0.55

0.490.49

0.49

Page 10: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

How do we encode that number?

Binary fractional representation:

FractionalEncode(x)1. x = 2 * x2. If x < 1 output 0, goto 13. x = x - 1; output 1, goto 1

.... 54321 bbbbbx ...2222 4

43

32

21

1 bbbbx

01.3/1

2 * (1/3) = 2/3 < 1, output 0

2 * (2/3) = 4/3 > 1, output 1 4/3 – 1 = 1/3Incremental Generation

Page 11: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Which number do we encode?

Truncate the encoding to the first d = log (2/sn) bits

Truncation gets a smaller number… how much smaller?

Compression = Truncation

2222

log2

log 22 sssceil

ln + sn

ln

ln + sn/2

....... 32154321 dddd bbbbbbbbbx =0

Page 12: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Bound on code length

Theorem: For a text of length n, the Arithmetic

encoder generates at most

log2 (2/sn) < 1 + log2 2/sn = 1 + (1 - log2 sn)

= 2 - log2 (∏ i=1,n p(Ti))

= 2 - ∑ i=1,n (log2 p(Ti))

= 2 - ∑s=1,|| occ(s) log p(s)

= 2 + n * ∑s=1,|| p(s) log (1/p(s))

= 2 + n H0(T) bits

nH0 + 0.02 n bits in practicebecause of rounding

T = aaba

3 * log p(a) + 1 * log p(b)

Page 13: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Where is the problem ?

Take the text T = an bn, hence H0 = (1/2) log2 2 + (1/2) + log2 2 = 1 bit

so compression ratio would be 1/256 (ASCII)or, no compression if a,b already encoded in 1

bit

We would like to deploy repetitions:

• Wherever they occur

• Whichever length they have

Any P(T), even random, gets the same bound

Page 14: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Data Compression

Can we use simpler repetition-detectors?

Page 15: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Simple compressors: too simple?

Move-to-Front (MTF): As a freq-sorting approximator As a caching strategy As a compressor

Run-Length-Encoding (RLE): FAX compression

Page 16: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

-g code for integer encoding

x > 0 and Length = log2 x +1

e.g., 9 represented as <000,1001>.

g-code for x takes 2 log2 x +1 bits (ie. factor of 2 from optimal)

0000...........0 x in binary Length-1

Page 17: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Move to Front Coding

Transforms a char sequence into an integer sequence, that can then be var-length coded

Start with the list of symbols L=[a,b,c,d,…] For each input symbol s

1) output the position of s in L 2) move s to the front of L

Properties: It is a dynamic code, with memory (unlike Arithmetic)

X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) +

n2

In fact Huff takes log n bits per symbol being them equi-probable

MTF uses O(1) bits per symbol occurrence but first one O(log n)

Page 18: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Run Length Encoding (RLE)

If spatial locality is very high, then

abbbaacccca => (a,1),(b,3),(a,2),(c,4),

(a,1)

In case of binary strings just numbers and one bit

Properties:

It is a dynamic code, with memory (unlike

Arithmetic)

X = 1n 2n 3n… nn

Huff(X) = O(n2 log n) > Rle(X) = O( n (1+log

n) )

RLE uses log n bits per symb-block using g-code per its length.

Page 19: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Data Compression

Burrows-Wheeler Transform

Page 20: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

The big (unconscious) step...

Page 21: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

The Burrows-Wheeler Transform (1994)

Let us given a text T = mississippi#

mississippi#ississippi#mssissippi#mi sissippi#mis

sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi

ssippi#missiissippi#miss Sort the rows

# mississipp ii #mississip pi ppi#missis s

F L

T

Page 22: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

A famous example

Muchlonger...

Page 23: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Compressing L seems promising...

Key observation: L is locally

homogeneousL is highly compressible

Algorithm Bzip :

1. Move-to-Front coding of

L

2. Run-Length coding

3. Statistical coder

Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

Page 24: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

BWT matrix

#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m

#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m

How to compute the BWT ?

ipssm#pissii

L

12

1185211097463

SA

L[3] = T[ 8 - 1 ]

We said that: L[i] precedes F[i] in T

Given SA and T, we have L[i] = T[SA[i]-1]

This is one of the main reasons forthe number of pubblications spurred

in ‘94-’10 on Suffix Array construction

Page 25: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

# mississipp ii #mississip pi ppi#missis s

F L

Take two equal L’s chars

Can we map L’s chars onto F’s chars ?

... Need to distinguish equal chars...

Rotate rightward their rows

Same relative order !!

unknown

A useful tool: L F mapping

Rank(char,pos) and Select(char,pos) key operations nowadays

Page 26: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

T = .... #

i #mississip p

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

The BWT is invertible

# mississipp i

i ppi#missis s

F Lunknown

1. LF-array maps L’s to F’s chars

2. L[ i ] precedes F[ i ] in T

Two key properties:

Reconstruct T backward:

ippi

Several issues about efficiency in time and space

Page 27: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

You find this in your Linux distribution

Page 28: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Suffix Array construction

Page 29: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Data Compression

What about achieving high-order entropy ?

Page 30: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Recall that

Compression ratio =

#bits in output / #bits in input

Compression performance: We relate entropy against compression ratio.

||

|)(|)(0 T

TCvsTH

or|)(|)(|| 0 TCvsTHT

Page 31: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

The empirical entropy Hk

Hk(T) = (1/|T|) ∑|w|=k | T[w] | H0(T[w])

Example: Given T = “mississippi”, we have

T[w] = string of symbols that precede the substring w in T

T[“is”] = ms

Compress T up to Hk(T)

compress each T[w] up to its

H0

How much is this “operational” ?

Use Huffmanor Arithmetic

The distinct substrings w for H2(T) are

{i_ (1,p), ip (1,s), is (2,ms), pi (1,p), pp (1,i), mi (1,_), si (2,ss), ss

(2,ii)}

H2(T) = (1/11) * [1 * H0(p) + 1 * H0(s) + 2 * H0(ms) + 1 * H0(p) + …]

Page 32: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

pi#mississi pppi#mississ isippi#missi ssissippi#mi sssippi#miss ississippi#m i

issippi#mis s

mississippi #ississippi# m

#mississipp ii#mississip pippi#missis s

BWT versus Hk

Bwt(T)

Compressing pieces in BWT up totheir H0 , we achieve H2(T)

Symbols preceding w

w

but this is a permutation of T[w]

|T[w]| *

H0(T[w])|T| H2(T) = ∑

|w|=2

H0 does not change !!!

We have a workableway to approximate Hk

via bwt-partitions T = m i s s i s s i p p i # 1 2 3 4 5 6 7 8 9 10 11 12

T[w=is] = “ms”

Page 33: Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Let C be a compressor achieving H0

Arithmetic(a) ≤ |a| H0(a) + 2 bits

An interesting approach: Compute bwt(T), and get a partition P induced by k Apply C on each piece of P

The space is

The partition depends on k The approximation of Hk(T) depends on C and gk

Operationally…

Optimal partition P shortest |C(P)| O(n) time Hk-bound holds simultaneously k ≥ 0

Compression booster[J. ACM ‘05]

= ∑|w|=k |C(T[w])| ≤ ∑

|w|=k ( |T[w]| H0 (T[w])

+ 2 )

≤ |T| Hk (T) + 2 gk