Top Banner
Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004
65

Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Dec 22, 2015

Download

Documents

Shanon Golden
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Data Compression and Huffman’s Algorithm

15-211

Fundamental Data Structures

and Algorithms

Klaus Sutner

February 3, 2004

Page 2: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Announcements

• Homework Number 4 is on its way … It's a bit hard conceptually, so don't

procrastinate.

• Read Chapters 7 and 12

Page 3: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Data Compression

Page 4: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Data compression

• Is one of the fundamental technologies of the Internet.

• Is necessary for faster data transmission.

• Useful even locally to keep smaller files or backup data.

Page 5: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Data compression

• Types of compression Lossless – encodes the original information

exactly. Lossy – approximates the original information.

• Uses of compression Images over the web: JPEG Music: MP3 General-purpose: ZIP, GZIP, JAR, …

Page 6: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Lossy vs. Lossless

• What is the practical impact of lossy compression?

Page 7: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Compare two images

One image is 400K the other is 1100K. Which is which?

Page 8: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

So where is the difference?

Page 9: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Another Example - SVD

Rank 1 Rank 8 Rank 16 Original

2231 bytes 4549 bytes

Page 10: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

What can we conclude?

• There is definitely a trade-off.

• Lossless may not perform so well, but it retains 100% of the information.

• Lossy can perform extremely well, but is the compression worth the loss of information?

• So how do we decide which one to use?

Page 11: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Some Considerations

• What types of files would you use a lossless algorithm on?

• What types of files would you use a lossy algorithm on?

Page 12: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Some Considerations

• What types of files would you use a lossless algorithm on? Discrete data (text file e.g.)

• What types of files would you use a lossy algorithm on? Analog data (images, music).

Page 13: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Question #1

• Is there a lossless compression algorithm that can compress any file?

Page 14: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Answer

• Absolutely not!

• Why not?

Count binary strings of length N.

Page 15: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Question #2

• Is there a best possible way to compress files?

• Is there an algorithm that always produces the smallest compressed file possible?

No!

Page 16: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

No optimal compression

• Suppose you wish to compress the first 10,000 digits of Pi.

• In case they slipped your mind…

Page 17: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Pi 10000

31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185950244594553469083026425223082533446850352619311881710100031378387528865875332083814206171776691473035982534904287554687311595628638823537875937519577818577805321712268066130019278766111959092164201989380952572010654858632788659361533818279682303019520353018529689957736225994138912497217752834791315155748572424541506959508295331168617278558890750983817546374649393192550604009277016711390098488240128583616035637076601047101819429555961989467678374494482553797747268471040475346462080466842590694912933136770289891521047521620569660240580381501935112533824300355876402474964732639141992726042699227967823547816360093417216412199245863150302861829745557067498385054945885869269956909272107975093029553211653449872027559602364806654991198818347977535663698074265425278625518184175746728909777727938000816470600161452491921732172147723501414419735685481613611573525521334757418494684385233239073941433345477624168625189835694855620992192221842725502542568876717904946016534668049886272327917860857843838279679766814541009538837863609506800642251252051173929848960841284886269456042419652850222106611863067442786220391949450471237137869609563643719172874677646575739624138908658326459958133904780275900994657640789512694683983525957098258226205224894077267194782684826014769909026401363944374553050682034962524517493996514314298091906592509372216964615157098583874105978859597729754989301617539284681382686838689427741559918559252459539594310499725246808459872736446958486538367362226260991246080512438843904512441365497627807977156914359977001296160894416948685558484063534220722258284886481584560285060168427394522674676788952521385225499546667278239864565961163548862305774564980355936345681743241125150760694794510965960940252288797108931456691368672287489405601015033086179286809208747609178249385890097149096759852613655497818931297848216829989487226588048575640142704775551323796414515237462343645428584447952658678210511413547357395231134271661021359695362314429524849371871101457654035902799344037420073105785390621983874478084784896833214457138687519435064302184531910484810053706146806749192781911979399520614196634287544406437451237181921799983910159195618146751426912397489409071864942319615679452080951465502252316038819301420937621378559566389377870830390697920773467221825625996615014215030680384477345492026054146659252014974428507325186660021324340881907104863317346496514539057962685610055081066587969981635747363840525714591028970641401109712062804390397595156771577004203378699360072305587631763594218731251471205329281918261861258673215791984148488291644706095752706957220917567116722910981690915280173506712748583222871835209353965725121083579151369882091444210067510334671103141267111369908658516398315019701651511685171437657618351556508849099898599823873455283316355076479185358932261854896321329330898570642046752590709154814165498594616371802709819943099244889575712828905923233260972997120844335732654893823911932597463667305836041428138830320382490375898524374417029132765618093773444030707469211201913020330380197621101100449293215160842444859637669838952286847831235526582131449576857262433441893039686426243410773226978028073189154411010446823252716201052652272111660396665573092547110557853763466820653109896526918620564769312570586356620185581007293606598764861179104533488503461136576867532494416680396265797877185560845529654126654085306143444318586769751456614068007002378776591344017127494704205622305389945613140711270004078547332699390814546646458807972708266830634328587856983052358089330657574067954571637752542021149557615814002501262285941302164715509792592309907965473761255176567513575178296664547791745011299614890304639947132962107340437518957359614589019389713111790429782856475032031986915140287080859904801094121472213179476477726224142548545403321571853061422881375850430633217518297986622371721591607716692547487389866549494501146540628433663937900397692656721463853067360965712091807638327166416274888800786925602902284721040317211860820419000422966171196377921337575114959501566049631862947265473642523081770367515906735023507283540567040386743513622224771589150495309844489333096340878076932599397805419341447377441842631298608099888

Page 18: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

How about a program?

long a[35014],b,c=35014,d,e,f=1e4,g,h;main(){ for(;b=c-=14;h=printf("%04ld",e+d/f)) for(e=d%=f;g=--b*2;d/=g) d=d*b+f*(h?a[b]:f/5), a[b]=d%--g;}

Page 19: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

pitiny.c

• This C program is just 143 characters long!

• And it “decompresses” into the first 10,000 digits of Pi.

long a[35014],b,c=35014,d,e,f=1e4,g,h; main(){for(;b=c-=14;h=printf("%04ld",e+d/f)) for(e=d%=f;g=--b*2;d/=g) d=d*b+f*(h?a[b]:f/5), a[b]=d%--g;}

Page 20: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Program Size Complexity

• There is an interesting idea here: Find the shortest program that

computes a certain output. A very important idea in theoretical

computer science. Can be used to define incompressible data (no shorter program will produce these data). Excellent source of examples/counterexamples.

Page 21: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

PSC versus Physics

In fact, PSC pops up naturally when one studies the physical limits of computation.

Crucial problem: How much heat must we dissipate when we perform a computation?

This is a HUGE problem for super-computers: detractors would say that a super-computer is a big refrigerator plus a few chips and disks.

Page 22: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

PSC versus Physics

Surprisingly, we only need to dissipate energy when we erase a bit.

Everything else can be done without energy cost (reversible computation), or at least with little cost.

Erasing information cannot be avoided in general.

BUT: before you erase, you can compress garbage bits, thus lowering the thermodynamic cost.

The limit for compression is given by PSC.

Page 23: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

PSC and Compression

Unfortunately, there is no algorithm that, given some binary string x, would compute the shortest program p(x) that generates x.

Also note that the shortest program might take a long time to generate x.

So, for data compression PSC is quite useless.

Page 24: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Extra credit

• Come up with the (a) shortest Java program that computes the first 10,000 digits of Pi and writes them to the screen.

• Incidentally, I don’t know how or why pitiny.c works.

Page 25: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Getting close

• In practice, the best we can hope for is a program that does good compression in interesting cases. Text files Numerical data Voice Music Images Video …

Page 26: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

How does compression work?

• Lossy algorithms are generally mathematically based. They work by applying transforms. Eg. JPEG – discrete cosine transform

• Lossy algorithms attempt to approximate the original data.

• Lossless algorithms cannot do that since they need to maintain the original data.

• So what can they do?

Page 27: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

How does compression work?

• They need to analyze the file and take advantage of certain properties it might have.

• Or its structure.

• We’ll look at two important lossless compression methodsHuffman compression LZW compression

Page 28: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Interlude:Bit-level Representation of Data

Page 29: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Bits and Bytes

• All data is stored on a computer as a sequence of 0’s and 1’s, called bits.

• This is a very natural way to represent data, for the following reason:

• A computer cannot, in general, infer 10 different values from the intensity of a signal.

• It can however infer 2 different values very easily. I.e. whether the signal is high or low.

Page 30: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Bits and Bytes

• The problem: If we use sequences of just 0’s and 1’s instead of 0…9 to represent data, regardless of the convenience, aren’t we using a lot more space?

• To address this issue, let’s consider a specific question…

Page 31: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Quiz Break

Page 32: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Bits and Bytes

• Suppose you had a text file (say, the complete works of Shakespeare) and you know that it has 32 different symbols and a total of 100,000 characters.

• How much space would be needed to represent this in base 10?

• How about base 2?

• In big-Oh terms, how much more space is needed by the base 2 representation?

Page 33: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Bits and Bytes

• Okay, so we’ve established that’s it’s easiest to store data as a sequence of 0’s and 1’s, but how does that help us?

• In particular, how do I take a text file and store it on the computer?

• To do this we need to invent a code.

Page 34: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Codes

Page 35: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Codes

Fix some alphabet A. The elements of A are characters or letters.

A (binary) code for A is a map C from A to binary sequences.

Apply C pointwise to define the code of a word over A.

Thus any word over A is transformed by C to a binary sequence.

Page 36: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Example

• “badcae” maps to 001 000 011 010 000 100

• Really just 001000011010000100

100011010001000codewords

edcbaSymbols

• Suppose we A = {a,b,c,d,e}.

• We can use the following 3-bit code:

Page 37: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Decoding

We need to be able to go from binary sequences back to words over alphabet A.

Note that not every binary sequence may be the code of a word over A.

What properties must C have so we can decode?

Clearly, any two words over A must translate into different binary sequences under C.

Page 38: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Fixed Length Codes

Easy case: All codewords C(a) have the same length.

Important Example: ASCII (7-bit and 8-bit)

Can use specialized hardware to digest whole blocks of bits.

Very simple, but not particularly flexible.

Page 39: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Decoding

Is fixed length necessary for decoding?

Clearly not: the following table defines a code and all codewords have different lengths.

In fact, this code is instantaneously decodable: as soon as we have read enough bits for a letter we can determine the right letter.

00001000100101codewords

dcbaSymbols

Page 40: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Decoding

Is instantaneous decodability necessary?

Or may it happen that we have to read a large part of the coded message before we can determine the first letter?

Note that this would probably cause a number of efficiency problems.

011001110100codewords

edcbaSymbols

Page 41: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Decoding

Try to decode

00001011010011

You'll need to do some look-ahead.

011001110100codewords

edcbaSymbols

Page 42: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Prefix (Free) Codes

There is a nice class of codes that are easily decodable:

No codeword C(a) is allowed to be a prefix of another codeword C(b) where a and b are letters.

How would you construct a decoder for a prefix code?

Page 43: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Good Prefix Codes

If we know nothing about the text to be encoded, we may as well use a fixed length code.

But if we are given the frequency distribution of the letters in A we can do better:

Frequent letters should get short codewords.

And, of course, we are not allowed to violate the prefix condition.

Page 44: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Example

450 bits

615 bits

205 chars

Total

110100000110Prefix code (optimal)

100011010001000Fixed-length code

7540152550Frequency

edcbaSymbols

Page 45: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Huffman’s Algorithm

Page 46: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Tree representation

• Represent prefix free codes as full binary trees

• Full: every node Is a leaf, or Has exactly 2 children.

• The encoding is then a (unique) path from the root to a leaf.

c

a

b

d0

0

0

1

1

1

a=1, b=001, c=000, d=01

Page 47: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Why a full binary tree?

• A node with no sibling can be moved up 1 level, improving the code.

• An optimal code for a string can always be represented by a full binary tree.

c

a

b

d0

1

c

a

b

d

1

Page 48: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Encoding cost

• Alphabet: A Symbol: c Symbol Frequency: f(c) Depth in tree T: d(c) (d(c) is also number of bits to encode c )

• Encoding cost:

• Q: How to construct a full binary tree that minimizes K ?

Page 49: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Huffman’s Algorithm

• Huffman’s algorithms will give you an optimal prefix free code by constructing an appropriate tree.

• Data structure used: A Priority Queue.

• insert(element, priority) inserts an element with a given priority into the queue.

• deleteMin() returns the element with least priority.

Page 50: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Huffman’s Algorithm

1. Compute f(c) for every symbol c C

2. insert(c, f(c)) into priority queue Q

3. for i = 1 to |C| - 1 (while Q is not empty)

4. z = new TreeNode()

5. x = z.left = Q.deleteMin()

6. y = z.right = Q.deleteMin()

7. f(z) = f(x) + f(y)

8. Q.insert(z, f(z))

9. return Q.deleteMin()

Page 51: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Example

Page 52: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Example

Page 53: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Example

Page 54: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Example

Page 55: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Example

Page 56: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Example

Page 57: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Huffman’s Algorithm

• Is a greedy algorithm that constructs an optimal prefix free code for a given piece of data

• Does it really generate an optimal prefix free code?

• Yes, but the proof is beyond the scope of today’s lecture. But see it in recitation…

Page 58: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Huffman’s Algorithm

• Why is it greedy?

• Because at each iteration in the loop, it picked the two “optimal” trees in the priority queue with which to create a new node without considering their implications from a global standpoint.

Page 59: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Back to Bits and Bytes

• Notice that Huffman’s algorithm, in the setting we studied it, can only compress files of characters since it needs to know what the alphabet is in order to count the frequencies.

• Do we need to modify the algorithm in order to compress arbitrary files?

• Take a minute to think about this.

Page 60: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Bits and Bytes

• No, we don’t!

• Suppose we have a file F to compress. We can treat F as a stream of bits.

• So we read the first byte and consider it in the context of our predefined alphabet. ASCII in this case.

• Implicitly, we then end up treating every file as a text file.

• Is that a good idea? What about images?

Page 61: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Bits and Bytes

• It doesn’t matter!

• So long as we reproduce the original bit sequence after decompression.

• We can treat the file as containing just the characters {a,b,c,d} if we want, it won’t affect the correctness of our algorithm.

• It will, however, affect the performance.

• Why?

Page 62: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Huffman compression

• Huffman trees provide a straightforward method for file compression. 1. Read the file and compute frequencies 2. Use frequencies to build Huffman codes 3. Encode file using the codes 4. Write the codes (or tree) and encoded

file into the output file.

Sometimes students find this to be tricky…

Page 63: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Variations

• Reading the file twice is a pain. Once to compute frequencies, and again

to do the compression.

• It is possible to build an adaptive Huffman tree that adjusts itself as more data becomes available.

Page 64: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Beating Huffman

• How about doing better than Huffman!

• Impossible! Huffman’s algorithm gives the optimal

prefix code!

• Right. But who says we have to use a prefix

code?

Page 65: Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004.

Example

• Suppose we have a file containing abcdabcdabcdabcdabcdabcd…

abcdabcd

• This could be expressed very compactly as abcd1000