Top Banner
Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004
25

Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Dec 23, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Data Compression

Gabriel Laden

CS146 – Dr. Sin-Min Lee

Spring 2004

Page 2: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

What is Data Compression?

• There is lossless and lossy compression, either way, file size is reduced

• This saves both time and space (premium)

• Data Compression Algorithms are more successful if they are based on statistical analysis of the frequency of the data and the accuracy needed to represent the data.

Page 3: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Examples in computers

• jpeg is a compressed image file

• mp3 is a compressed audio file

• zip is a compressed archive of files

• there are lots of encoding algorithms, we will look at Huffman’s Algorithm

(see our textbook pp.357-362)

Page 4: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

What is Greedy Algorithm

• Solve a problem in stages

• Make a locally optimum decision

• Algorithm is good if local optimum is equal is to the global optimum

Page 5: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Examples of Greedy

• Dijkstra, Prim, Kruskal

• Bin Packing problem

• Huffman Code

Page 6: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Problem with Greedy• Greedy Algorithm does not always work with the

set of data, there can be some conflicts• What if all characters are equally distributed?• What if characters are very unequally distributed?• A problem from our text book:

If we had such a thing as a 12cent coin, and we are asked to make 15cents change,

Greedy Algorithm would produce :1(12cent) + 3(penny) = 15 incorrect answer1(dime) + 1(nickel) = 15 correct answer

Page 7: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

David Huffman

• Paper published in 1952

• “A Method for the Construction of Minimum Redundancy Codes”

• What we call “Data Compression” is what he termed “Minimum Redundancy”

Page 8: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

ASCII Code

• 128 characters includes punctuation

• log 128 = 7 bits

• 1 byte = 8 bits

• All characters are 8 bits long

• “Fixed-Length Encoding”

• “Etaoin Shrdlu” most common letters!!!

Page 9: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Intro to Huffman Algorithm

• Method of construction for an encoding tree

• Full Binary Tree Representation

• Each edge of the tree has a value,

(0 is the left child, 1 is the right child)

• Data is at the leaves, not internal nodes

• Result: encoding tree

• “Variable-Length Encoding”

Page 10: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Huffman Algorithm (English)

• 1. Maintain a forest of trees

• 2. Weight of tree = sum frequency of leaves

• 3. For 0 to N-1– Select two smallest weight trees– Form a new tree

Page 11: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Huffman Algorithm (Technical)

• n |C|• Q C• For i 1 to n – 1

– Do z AllocateNode()– x left[z] ExtractMin(Q)– y right[z] ExtractMin(Q)– f[z] f[x] + f[y]– Insert(Q, z)

• Return Extract-Min(Q)

Page 12: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Ambiguity in using code?

• What if you have an encoded string:

• 000010101101011000110001110

• How do you know where to break it up?

• Prefix Coding Rule– No code is a prefix of another– The way the tree is built disallows this– If there is a “00” code, there cannot be a “0”

Page 13: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Step0:

10 20 5 15 25 1 16

(q) (w) (e) (r) (t) (y) (u)

Step1:

( )6

10 20 15 25 16 / \

(q) (w) (r) (t) (u) (y) (e)

Step2: ( )16

/ \

( ) (q)

20 15 25 16 / \

(w) (r) (t) (u) (y) (e)

Step3: ( )16

/ \

( )31 ( ) (q)

20 25 / \ / \

(w) (t) (r) (u) (y) (e)

Page 14: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Step3: ( )16

/ \

( )31 ( ) (q)

20 25 / \ / \

(w) (t) (r) (u) (y) (e)

Step4: ( )36

/ \

( ) (w)

/ \

( )31 ( ) (q)

25 / \ / \

(t) (r) (u) (y) (e)

Step5: ( )36

/ \

( )56 ( ) (w)

/ \ / \

(t) ( ) ( ) (q)

/ \ / \

(r) (u) (y) (e)

Step6: ( )92

/ \

( ) ( )

/ \ / \

( ) (w) (t) ( )

/ \ / \

( ) (q) (r) (u)

/ \

(y) (e)

Page 15: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Step6: ( )92

/ \

( ) ( )

/ \ / \

( ) (w) (t) ( )

/ \ / \

( ) (q) (r) (u)

/ \

(y) (e)

When tree is used to encode a file it is written as a header above the body of the encoded bits of text.

• 0 is left, 1 is right edge

• use a stack to do this

Table: 01 w

0000 y 10 t

0001 e 110 r

001 q 111 u

Header:

0000y0001e001q01w10t110r111u

Page 16: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Proof: part 1

• Lemma: – Let C be an alphabet in which each character

c in C has frequency f[c]– Let x and y be two characters in C having

lowest frequencies– There exists an optimal prefix code in C in

which the codes for x and y have the same length and differ only in last bit

Page 17: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Proof: part 2

• Lemma:– Let T be a full binary tree representing an

optimal prefix code over an alphabet C– Let z be the parent of two leaves x and y– Then T” = T – {x,y} represents an optimal

prefix code for C” = C – {x,y}U{z}

Page 18: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Lengths of Encoding Set root

/ \

/ \ / \

/ \ / \ / \ / \

1 2 3 4 5 6 7 8

Length of set is:(8 nodes) * (3 edges) = 24bits

This is what you would get if the nodes are mostly random and equal in probability

Page 19: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Lengths of Encoding Set root

/ \

/ \ 8

/ \ 7

/ \ 6

/ \ 5

/ \ 4

/ \ 3

1 2

Length of set is:7+7+6+5+4+3+2+1 = 35bits

This is what you would get if the nodes vary the most in probability.

Page 20: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Expected Value / character

• In example 1:

• 8 * (1/2^3) * 3) = 3 bits

• In example 2:

• 2 * (1/2^7 * 7) + (1/2^6 * 6) + (1/2^5 * 5) + (1/2^4 * 4) + (1/2^3 * 3) + (1/2^2 * 2) + (1/2^1 * 1) = 1.98 bits

Page 21: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Main Point

• Statistical methods work better when the symbols in the data set have varying probabilities.

• Otherwise you need to use a different method for compression. (Example jpeg)

Page 22: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Image Compression

• “Lossy” – meaning details are lost

• An approximation of original image is made where large areas of similar color are combined into a single block

• This introduces a certain amount of error, which is a tradeoff

Page 23: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Steps to Image Compression

• Specify requested output file size

• Divide image into several areas

• Divide file size by the # of areas

• Quantize each area (information lost here)

• Encode each area separately, write to file

Page 24: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Image Decomposition

Page 25: Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

References

• Data Structures & Algorithm Analysis - Mark Allen Weiss

• Introduction to Algorithms – Thomas H. Cormen