Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Post on 23-Dec-2015

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Data Compression

Gabriel Laden

CS146 – Dr. Sin-Min Lee

Spring 2004

What is Data Compression?

• There is lossless and lossy compression, either way, file size is reduced

• This saves both time and space (premium)

• Data Compression Algorithms are more successful if they are based on statistical analysis of the frequency of the data and the accuracy needed to represent the data.

Examples in computers

• jpeg is a compressed image file

• mp3 is a compressed audio file

• zip is a compressed archive of files

• there are lots of encoding algorithms, we will look at Huffman’s Algorithm

(see our textbook pp.357-362)

What is Greedy Algorithm

• Solve a problem in stages

• Make a locally optimum decision

• Algorithm is good if local optimum is equal is to the global optimum

Examples of Greedy

• Dijkstra, Prim, Kruskal

• Bin Packing problem

• Huffman Code

Problem with Greedy• Greedy Algorithm does not always work with the

set of data, there can be some conflicts• What if all characters are equally distributed?• What if characters are very unequally distributed?• A problem from our text book:

If we had such a thing as a 12cent coin, and we are asked to make 15cents change,

Greedy Algorithm would produce :1(12cent) + 3(penny) = 15 incorrect answer1(dime) + 1(nickel) = 15 correct answer

David Huffman

• Paper published in 1952

• “A Method for the Construction of Minimum Redundancy Codes”

• What we call “Data Compression” is what he termed “Minimum Redundancy”

ASCII Code

• 128 characters includes punctuation

• log 128 = 7 bits

• 1 byte = 8 bits

• All characters are 8 bits long

• “Fixed-Length Encoding”

• “Etaoin Shrdlu” most common letters!!!

Intro to Huffman Algorithm

• Method of construction for an encoding tree

• Full Binary Tree Representation

• Each edge of the tree has a value,

(0 is the left child, 1 is the right child)

• Data is at the leaves, not internal nodes

• Result: encoding tree

• “Variable-Length Encoding”

Huffman Algorithm (English)

• 1. Maintain a forest of trees

• 2. Weight of tree = sum frequency of leaves

• 3. For 0 to N-1– Select two smallest weight trees– Form a new tree

Huffman Algorithm (Technical)

• n |C|• Q C• For i 1 to n – 1

– Do z AllocateNode()– x left[z] ExtractMin(Q)– y right[z] ExtractMin(Q)– f[z] f[x] + f[y]– Insert(Q, z)

• Return Extract-Min(Q)

Ambiguity in using code?

• What if you have an encoded string:

• 000010101101011000110001110

• How do you know where to break it up?

• Prefix Coding Rule– No code is a prefix of another– The way the tree is built disallows this– If there is a “00” code, there cannot be a “0”

Step0:

10 20 5 15 25 1 16

(q) (w) (e) (r) (t) (y) (u)

Step1:

( )6

10 20 15 25 16 / \

(q) (w) (r) (t) (u) (y) (e)

Step2: ( )16

/ \

( ) (q)

20 15 25 16 / \

(w) (r) (t) (u) (y) (e)

Step3: ( )16

/ \

( )31 ( ) (q)

20 25 / \ / \

(w) (t) (r) (u) (y) (e)

Step3: ( )16

/ \

( )31 ( ) (q)

20 25 / \ / \

(w) (t) (r) (u) (y) (e)

Step4: ( )36

/ \

( ) (w)

/ \

( )31 ( ) (q)

25 / \ / \

(t) (r) (u) (y) (e)

Step5: ( )36

/ \

( )56 ( ) (w)

/ \ / \

(t) ( ) ( ) (q)

/ \ / \

(r) (u) (y) (e)

Step6: ( )92

/ \

( ) ( )

/ \ / \

( ) (w) (t) ( )

/ \ / \

( ) (q) (r) (u)

/ \

(y) (e)

Step6: ( )92

/ \

( ) ( )

/ \ / \

( ) (w) (t) ( )

/ \ / \

( ) (q) (r) (u)

/ \

(y) (e)

When tree is used to encode a file it is written as a header above the body of the encoded bits of text.

• 0 is left, 1 is right edge

• use a stack to do this

Table: 01 w

0000 y 10 t

0001 e 110 r

001 q 111 u

Header:

0000y0001e001q01w10t110r111u

Proof: part 1

• Lemma: – Let C be an alphabet in which each character

c in C has frequency f[c]– Let x and y be two characters in C having

lowest frequencies– There exists an optimal prefix code in C in

which the codes for x and y have the same length and differ only in last bit

Proof: part 2

• Lemma:– Let T be a full binary tree representing an

optimal prefix code over an alphabet C– Let z be the parent of two leaves x and y– Then T” = T – {x,y} represents an optimal

prefix code for C” = C – {x,y}U{z}

Lengths of Encoding Set root

/ \

/ \ / \

/ \ / \ / \ / \

1 2 3 4 5 6 7 8

Length of set is:(8 nodes) * (3 edges) = 24bits

This is what you would get if the nodes are mostly random and equal in probability

Lengths of Encoding Set root

/ \

/ \ 8

/ \ 7

/ \ 6

/ \ 5

/ \ 4

/ \ 3

1 2

Length of set is:7+7+6+5+4+3+2+1 = 35bits

This is what you would get if the nodes vary the most in probability.

Expected Value / character

• In example 1:

• 8 * (1/2^3) * 3) = 3 bits

• In example 2:

• 2 * (1/2^7 * 7) + (1/2^6 * 6) + (1/2^5 * 5) + (1/2^4 * 4) + (1/2^3 * 3) + (1/2^2 * 2) + (1/2^1 * 1) = 1.98 bits

Main Point

• Statistical methods work better when the symbols in the data set have varying probabilities.

• Otherwise you need to use a different method for compression. (Example jpeg)

Image Compression

• “Lossy” – meaning details are lost

• An approximation of original image is made where large areas of similar color are combined into a single block

• This introduces a certain amount of error, which is a tradeoff

Steps to Image Compression

• Specify requested output file size

• Divide image into several areas

• Divide file size by the # of areas

• Quantize each area (information lost here)

• Encode each area separately, write to file

Image Decomposition

References

• Data Structures & Algorithm Analysis - Mark Allen Weiss

• Introduction to Algorithms – Thomas H. Cormen

top related