Top Banner
CPS 100, Fall 2011 12.1 PF-AT-EOS Data compression Seriously important, why? Priority Queue Seriously important, why? Assignment overview Huff _______ Bits, Bytes, Atoms What's an int? ASCII? double? CPS 100, Fall 2011 12.2 PQ Application: Data Compression Compression is a high-profile application .zip, .mp3, .jpg, .gif, .gz, … What property of MP3 was a significant factor in what made Napster work (why did Napster ultimately fail?) Who invented Napster, how old, when? Why do we care? Secondary storage capacity doubles every year Disk space fills up quickly on every computer system More data to compress than ever before Will we ever need to stop worrying about storage? CPS 100, Fall 2011 12.3 More on Compression Different compression techniques .mp3 files and .zip files? .gif and .jpg? Lossless and lossy Impossible to compress/lossless everything: Why? Lossy methods Good for pictures, video, and audio (JPEG, MPEG, etc.) Lossless methods Run-length encoding, Huffman, LZW, … 11 3 5 3 2 6 2 6 5 3 5 3 5 3 10 CPS 100, Fall 2011 12.4 Huffman Coding Understand Huffman Coding Data compression Priority Queue Bits and Bytes Greedy Algorithm Many compression algorithms Huffman is optimal, per-character compression Still used, e.g., basis of Burrows-Wheeler Other compression 'better', sometimes slower? LZW, GZIP, BW, …
10

PF-AT-EOS PQ Application: Data Compression

Feb 12, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PF-AT-EOS PQ Application: Data Compression

CPS 100, Fall 2011 12.1

PF-AT-EOS   Data compression

  Seriously important, why?

  Priority Queue   Seriously important, why?

  Assignment overview   Huff _______

  Bits, Bytes, Atoms   What's an int? ASCII? double?

CPS 100, Fall 2011 12.2

PQ Application: Data Compression   Compression is a high-profile application

  .zip, .mp3, .jpg, .gif, .gz, …   What property of MP3 was a significant factor in what

made Napster work (why did Napster ultimately fail?)   Who invented Napster, how old, when?

  Why do we care?   Secondary storage capacity doubles every year   Disk space fills up quickly on every computer system   More data to compress than ever before   Will we ever need to stop worrying about storage?

CPS 100, Fall 2011 12.3

More on Compression   Different compression techniques

  .mp3 files and .zip files?   .gif and .jpg?   Lossless and lossy

  Impossible to compress/lossless everything: Why?   Lossy methods

  Good for pictures, video, and audio (JPEG, MPEG, etc.)

  Lossless methods   Run-length encoding, Huffman, LZW, …

11 3 5 3 2 6 2 6 5 3 5 3 5 3 10

CPS 100, Fall 2011 12.4

Huffman Coding   Understand Huffman Coding

  Data compression   Priority Queue   Bits and Bytes   Greedy Algorithm

  Many compression algorithms   Huffman is optimal, per-character compression   Still used, e.g., basis of Burrows-Wheeler   Other compression 'better', sometimes slower?   LZW, GZIP, BW, …

Page 2: PF-AT-EOS PQ Application: Data Compression

CPS 100, Fall 2011 12.5

Compression and Coding   What gets compressed?

  Save on storage, why is this a good idea?   Save on data transmission, how and why?

  What is information, how is it compressible?   Exploit redundancy, without that, hard to

compress   Represent information: code (Morse cf. Huffman)

  Dots and dashes or 0's and 1's   How to construct code?

CPS 100, Fall 2011 12.6

Coding/Compression/Concepts   For ASCII we use 8 bits, for Unicode 16 bits

  Minimum number of bits to represent N values?   Representation of genomic data (a, c ,g, t)?   What about noisy genomic data?

  We can use a variable-length encoding, e.g., Huffman   How do we decide on lengths? How do we decode?   Values for Morse code encodings, why?   … - - - …

CPS 100, Fall 2011 12.7

Huffman Coding   D.A Huffman in early 1950’s: story of invention

  Analyze and process data before compression   Not developed to compress data “on-the-fly”

  Represent data using variable length codes   Each letter/chunk assigned a codeword/bitstring   Codeword for letter/chunk is produced by traversing

the Huffman tree   Property: No codeword produced is the prefix of

another   Frequent letters/chunk have short encoding, while

those that appear rarely have longer ones   Huffman coding is optimal per-character coding method

CPS 100, Fall 2011 12.8

Mary Shaw   Software engineering and

software architecture   Tools for constructing

large software systems   Development is a small

piece of total cost, maintenance is larger, depends on well-designed and developed techniques

  Interested in computer science, programming, curricula, and canoeing, health-care costs

  ACM Fellow, Alan Perlis Professor of Compsci at CMU

Page 3: PF-AT-EOS PQ Application: Data Compression

CPS 100, Fall 2011 12.9

Huffman coding: go go gophers   choose two smallest weights

  combine nodes + weights   Repeat   Priority queue?

  Encoding uses tree:   0 left/1 right   How many bits?

ASCII 3 bits g 103 1100111 000 ?? o 111 1101111 001 ?? p 112 1110000 010 h 104 1101000 011 e 101 1100101 100 r 114 1110010 101 s 115 1110011 110 sp. 32 1000000 111

g o e r s * 3 3

h 1 2 1 1 1

2

p 1

h 1

2

e 1

r 1

3

s 1

* 2

2

p 1

h 1

2

e 1

r 1

4

g

3

o

3

6

1

p

CPS 100, Fall 2011 12.10

Huffman coding: go go gophers   Encoding uses tree/trie:

  0 left/1 right   How many bits? 37!!   Savings? Worth it?

ASCII 3 bits g 103 1100111 000 00 o 111 1101111 001 01 p 112 1110000 010 1100 h 104 1101000 011 1101 e 101 1100101 100 1110 r 114 1110010 101 1111 s 115 1110011 110 100 sp. 32 1000000 111 101

3

s 1

* 2

2

p 1

h 1

2

e 1

r 1

4 g

3

o

3

6

3

2

p 1

h 1

2

e 1

r 1

4

s 1

* 2

7

g 3

o 3

6

13

CPS 100, Fall 2011 12.11

Building a Huffman tree   Begin with a forest of single-node trees/tries (leaves)

  Each node/tree/leaf is weighted with character count   Node stores two values: character and count

  Repeat until there is only one node left: root of tree   Remove two minimally weighted trees from forest   Create new tree/internal node with minimal trees as

children, • Weight is sum of children’s weight (no char)

  How does process terminate? Finding minimum?   Remove minimal trees, hummm……

CPS 100, Fall 2011 12.12

How do we create Huffman Tree/Trie?   Insert weighted values into priority queue

  What are initial weights? Why?  

  Remove minimal nodes, weight by sums, re-insert   Total number of nodes?

PriorityQueue<TreeNode> pq = new PriorityQueue<TreeNode>(); for(int k=0; k < freq.length; k++){ pq.add(new TreeNode(k,freq[k],null,null)); } while (pq.size() > 1){ TreeNode left = pq.remove(); TreeNode right = pq.remove(); pq.add(new TreeNode(0,left.weight+right.weight, left,right)); } TreeNode root = pq.remove();

Page 4: PF-AT-EOS PQ Application: Data Compression

CPS 100, Fall 2011 12.13

Creating compressed file   Once we have new encodings, read every character

  Write encoding, not the character, to compressed file   Why does this save bits?   What other information needed in compressed file?

  How do we uncompress?   How do we know foo.hf represents compressed file?   Is suffix sufficient? Alternatives?

  Why is Huffman coding a two-pass method?   Alternatives?

CPS 100, Fall 2011 12.14

Uncompression with Huffman

  We need the trie to uncompress   000100100010011001101111

  As we read a bit, what do we do?   Go left on 0, go right on 1   When do we stop? What to do?

  How do we get the trie?   How did we get it originally? Store 256 int/counts

• How do we read counts?   How do we store a trie? 20 Questions relevance?

• Reading a trie? Leaf indicator? Node values?

CPS 100, Fall 2011 12.15

Other Huffman Issues   What do we need to decode?

  How did we encode? How will we decode?   What information needed for decoding?

  Reading and writing bits: chunks and stopping   Can you write 3 bits? Why not? Why?   PSEUDO_EOF   BitInputStream and BitOutputStream: API

  What should happen when the file won’t compress?   Silently compress bigger? Warn user? Alternatives?

CPS 100, Fall 2011 12.16

Huffman Complexities   How do we measure? Size of input file, size of alphabet

  Which is typically bigger?

  Accumulating character counts: ______   How can we do this in O(1) time, though not really

  Building the heap/priority queue from counts ____   Initializing heap guaranteed

  Building Huffman tree ____   Why?

  Create table of encodings from tree ____   Why?

  Write tree and compressed file _____

Page 5: PF-AT-EOS PQ Application: Data Compression

CPS 100, Fall 2011 12.17

Good Compsci 100 Assignment?   Array of character/chunk counts, or is this a map?

  Map character/chunk to count, why array?   Priority Queue for generating tree/trie

  Do we need a heap implementation? Why?   Tree traversals for code generation, uncompression

  One recursive, one not, why and which?   Deal with bits and chunks rather than ints and chars

  The good, the bad, the ugly   Create a working compression program

  How would we deploy it? Make it better?   Benchmark for analysis

  What’s a corpus? CPS 100, Fall 2011 12.18

Anita Borg 1949-2003   “Dr. Anita Borg tenaciously

envisioned and set about to change the world for women and for technology. … she fought tirelessly for the development technology with positive social and human impact.”

  “Anita Borg sought to revolutionize the world and the way we think about technology and its impact on our lives.”

  http://www.youtube.com/watch?v=1yPxd5jqz_Q

CPS 100, Fall 2011 12.19

YAQ, YAQ, haha! (Yet Another Queue)   What is the dequeue policy for a Queue?

  Why do we implement Queue with LinkedList •  Interface and class in java.util

  Can we remove an element other than first?

  How does queue help word-ladder/shortest path?   First item enqueued/added is the one we want   What if different element is “best”?

  PriorityQueue has a different dequeue policy   Best item is dequeued, queue manages itself to ensure

operations are efficient

CPS 100, Fall 2011 12.20

PriorityQueue raison d’être   Algorithms Using PQ for efficiency

  Shortest Path: Google Maps/Garmin to Internet Routing •  How is this like word-ladder? How different?

  Event based simulation •  Coping with explosion in number of particles or things

  Optimal A* search, game-playing, AI, •  Can't explore entire search space, can estimate good move

  Data compression facilitated by priority queue   Alltime best assignment in a Compsci 100 course?

•  Subject to debate, of course

  From A-Z, soup-to-nuts, bits to abstractions

Page 6: PF-AT-EOS PQ Application: Data Compression

CPS 100, Fall 2011 12.21

Priority Queue   Compression motivates ADT priority queue

  Supports two basic operations • add/insert -– an element into the priority queue • remove/delete – the minimal element from the

priority queue   Implementations allow getmin/peek as well as delete

• Analogous to top/pop, peek/dequeue in stacks, queues

  Think about implementing the ADT, choices?   Add compared to min/remove   Balanced search tree is ok, but can we do better?

CPS 100, Fall 2011 12.22

Priority Queue sorting   See PQDemo.java,

  code below sorts, complexity? String[] array = {...}; // array filled with data PriorityQueue<String> pq = new PriorityQueue<String>(); for(String s : array) pq.add(s); for(int k=0; k < array.length; k++){ array[k] = pq.remove(); }   Bottlenecks, operations in code above

  Add words one-at-a-time to PQ v. all-at-once   What if PQ is an array, add or remove fast/slow?   We’d like PQ to have tree characteristics, why?

CPS 100, Fall 2011 12.23

Priority Queue top-M sorting   What if we have lots and lots and lots of data

  code below sorts top-M elements, complexity? Scanner s = … // initialize; PriorityQueue<String> pq = new PriorityQueue<String>(); while (s.hasNext()) { pq.add(s.next()); if (pq.size() > M) pq.remove(); }   What’s advantageous about this code?

  Store everything and sort everything?   Store everything, sort first M?   What is complexity of sort: O(n log n)

CPS 100, Fall 2011 12.24

Priority Queue implementations   Priority queues: average and worst case

Insert average

Getmin (delete)

Insert worst

Getmin (delete)

Unsorted list

Sorted list

Search tree

Balanced tree

Heap

log n log n O(n) O(n)

O(1) log n log n log n

log n log n log n log n

O(n) O(1) O(n) O(1)

O(n) O(1) O(n) O(1)

  Heap has O(n) build heap from n elements

Page 7: PF-AT-EOS PQ Application: Data Compression

CPS 100, Fall 2011 12.25

PriorityQueue.java [java.util]   What about objects inserted into pq?

  Comparable, e.g., essentially sortable   How can we change what minimal means?   Implementation uses heap, tree stored in an array

  Use a Comparator for comparing entries we can make a min-heap act like a max-heap, see PQDemo   Where is class Comparator declaration? How used?   What if we didn't know about Collections.reverseOrder?

•  How do we make this ourselves?

CPS 100, Fall 2011 12.26

Big-Oh and a tighter look at inserts   log(1) + log(2) + log(3) + … + log(n)

  Property of logs, log(a) + log(b) = log(a*b)   log(1*2*3*…*n) = log(n!)

  We can show using Sterling’s formula:

  log(n!) = c1*log(n) + nlog(n) – c2*n   We can get O(n log n) easily, this is a tight bound,

lower, Ω(n log n) as well

CPS 100, Fall 2011 12.27

Priority Queue implementation   Heap data structure is fast and reasonably simple

  Why not use inheritance hierarchy as was used with Map?   Trade-offs when using HashMap and TreeMap:

•  Time, space, ordering properties, TreeMap support?

  Changing comparison when calculating priority?   Create object to replace, or in lieu of compareTo

•  Comparable interface compares this to passed object •  Comparator interface compares two passed objects

  Both comparison methods: compareTo() and compare() •  Compare two objects (parameters or self and parameter) •  Returns –1, 0, +1 depending on <, ==, >

CPS 100, Fall 2011 12.28

Creating Heaps   Heap: array-based implementation of binary tree

used for implementing priority queues:   add/insert, peek/getmin, remove/deletemin, O(???)

  Array minimizes storage (no explicit pointers), faster too, contiguous (cache) and indexing

  Heap has shape property and heap/value property   shape: tree filled at all levels (except perhaps last) and

filled left-to-right (complete binary tree)   each node has value smaller than both children

Page 8: PF-AT-EOS PQ Application: Data Compression

CPS 100, Fall 2011 12.29

Array-based heap   store “node values” in array

beginning at index 1   for node with index k

  left child: index 2*k   right child: index 2*k+1

  why is this conducive for

maintaining heap shape?   what about heap property?   is the heap a search tree?   where is minimal node?   where are nodes added?

deleted?

0 1 2 3 4 5 6 7 8 9 10

6 10 7 17 13 25 9 21 19

6

10 7

17 13 9 21

19 25

CPS 100, Fall 2011 12.30

Thinking about heaps   Where is minimal element?

  Root, why?   Where is maximal element?

  Leaves, why?   How many leaves are there in

an N-node heap (big-Oh)?   O(n), but exact?

  What is complexity of find max in a minheap? Why?   O(n), but ½ N?

  Where is second smallest element? Why?   Near root?

6

10 7

17 13 9 21

19 25

0 1 2 3 4 5 6 7 8 9 10

6 10 7 17 13 25 9 21 19

CPS 100, Fall 2011 12.31

Adding values to heap   to maintain heap shape, must

add new value in left-to-right order of last level   could violate heap property   move value “up” if too

small

  change places with parent if heap property violated   stop when parent is smaller   stop when root is reached

  pull parent down, swapping isn’t necessary (optimization)

13

6 10 7

17 9 21 19 25 8

13

6 10 7

17 9 21 19 25

6 10 7

17 9 21 19 25 13

8

insert 8

bubble 8 up

6 7

17 9 21 19 25

8

13

10

CPS 100, Fall 2011 12.32

Adding values, details (pseudocode) void add(Object elt) { // add elt to heap in myList myList.add(elt); int loc = myList.size()-1; while (1 < loc && elt < myList.get(loc/2)){ myList.set(loc,myList.get(loc/2)); loc = loc/2; // go to parent } // what’s true here? myList.set(loc,elt); }

13

6 10 7

17 9 21 19 25

8

13

6 10 7

17 9 21 19 25

0 1 2 3 4 5 6 7 8 9 10

6 10 7 17 13 25 9 21 19

array myList

Page 9: PF-AT-EOS PQ Application: Data Compression

CPS 100, Fall 2011 12.33

Removing minimal element   Where is minimal element?

  If we remove it, what changes, shape/property?

  How can we maintain shape?   “last” element moves to root   What property is violated?

  After moving last element, subtrees of root are heaps, why?   Move root down (pull child

up) does it matter where?   When can we stop “re-heaping”?

  Less than both children   Reach a leaf

13

6 10 7

17 9 21 19 25

13

25 10 7

17 9 21 19

13

7 10 25

17 9 21 19

13

7 10 9

17 25 21 19

CPS 100, Fall 2011 12.34

Views of programming   Writing code from the method/function view is

pretty similar across languages   Organizing methods is different, organizing code is

different, not all languages have classes,   Loops, arrays, arithmetic, …

  Program using abstractions and high level concepts   Do we need to understand 32-bit twos-complement storage

to understand x =x+1?   Do we need to understand how arrays map to contiguous

memory to use ArrayLists?

CPS 100, Fall 2011 12.35

Bottom up meets Top down   We teach programming by teaching abstractions

  What's an array? What’s an ArrayList?   What's an array in C?   What's a map, Hashmap? Treemap?

  How are programs stored and executed in the JVM?   Differences on different architectures?   Similarities between Java and C++ and Python and PHP?

  What about how the CPU works? How memory works? Shouldn't we study this too?

CPS 100, Fall 2011 12.36

From bit to byte to char to int to long   Ultimately everything is stored as either a 0 or 1

  Bit is binary digit a byte is a binary term (8 bits)   We should be grateful we can deal with Strings rather than

sequences of 0's and 1's.   We should be grateful we can deal with an int rather than

the 32 bits that comprise an int

  If we have 255 values for R, G, B, how can we pack this into an int?   Why should we care, can’t we use one int per color?   How do we do the packing and unpacking?

Page 10: PF-AT-EOS PQ Application: Data Compression

CPS 100, Fall 2011 12.37

More information on bit, int, long   int values are stored as two's complement numbers

with 32 bits, for 64 bits use the type long, a char is 16 bits   Standard in Java, different in C/C++   Facilitates addition/subtraction for int values   We don't need to worry about this, except to note:

•  Infinity + 1 = - Infinity (see Integer.MAX_VALUE) •  Math.abs(-Infinity) > Infinity

  Java byte, int, long are signed values, char unsigned   What are values for 16-bit char? 8-bit byte?   Why will this matter in Burrows Wheeler?

CPS 100, Fall 2011 12.38

More details about bits   How is 13 represented?

  … _0_ _0_ _1_ _1_ _0_ _1_ 24 23 22 21 20

  Total is 8+4+1 = 13

  What is bit representation of 32? Of 15? Of 1023?   What is bit-representation of 2n - 1?   What is bit-representation of 0? Of -1?

•  Study later, but -1 is all 1’s, left-most bit determines < 0

  Java signed byte: -128..127, # bits?   What if we only want 0-255? (Huff, pixels, …)   Convert negative values or use char, trade-offs?

  Java char unsigned: 0..65,536 # bits?   Why is char unsigned? Why not as in C++/C?

CPS 100, Fall 2011 12.39

How are data stored?   To facilitate Huffman coding we need to read/write

one bit   Why do we need to read one bit?   Why do we need to write one bit?   When do we read 8 bits at a time? 32 bits?

  We can't actually write one bit-at-a-time. We can't really write one char at a time either.   Output and input are buffered,minimize memory accesses

and disk accesses   Why do we care about this when we talk about data

structures and algorithms? •  Where does data come from?

CPS 100, Fall 2011 12.40

How do we buffer char output?   Done for us as part of InputStream and Reader

classes   InputStreams are for reading bytes   Readers are for reading char values   Why do we have both and how do they interact? Reader r = new InputStreamReader(System.in);   Do we need to flush our buffers?

  In the past Java IO has been notoriously slow   Do we care about I? About O?   This is changing, and the java.nio classes help

•  Map a file to a region in memory in one operation