Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron.

Data Compression with finitewindows

Fiala and Greene

Speaker: Giora Alexandron

Overview:-----------------------

Our main purpose:

See how Suffix Tree supports a compression algorithm.

Overview:-----------------------

Our main purpose:

See how Suffix Tree supports a compression algorithm.

What we would see:

A data compression method, which works by substituting text. It uses a modification of the basic suffix tree, to support cyclic maintenance of the most

recent strings seen in file .

Outlines------------------------

1 .Compression: - In General

- Our Algorithm

2 .Data Structure: - Modification of the suffix tree.

3 .Theoretical Considerations: - Prooves.

4 .Improvments.

Compression-------------------------------

What is Compression:

Compression is the coding of data to minimize its representation. We would focus on

lossless, adaptive, one-pass methods .

Compression-------------------------------

What is Compression: Compression is the coding of data to minimize its

representation. We would focus on lossless, adaptive, one-pass methods .

Main approaches- Statistical approach- try to predict the next symbol .

Substitutional approach- replace blocks of texts with references to earlier occurrences of identical text.

**We would focus on a Substitutional method**

Compression-cont.------------------------------

What characterize a good compressor:

- Good compressing ratio.

- Run fast in Compression.

- Use minimum of space.

-Run fast in Expansion.

Compression-cont.------------------------------

What characterize a good compressor: - Good compressing ratio. - Run fast in Compression.

- Use minimum of space. -Run fast in Expansion.

There are trade-offs between all of those.Naturally, we want to achieve them all=

A good Algorithm + a matching Data Structure

Substitutional Compressing---------------------------------------

Consider the following basic scheme:

The compressed files would contain two types of codewords:

literal x pass the next x characters directly to the output.

copy x, y go back y characters and copy the next x

characters start at that position.

Example------------------------------------------------

..it was the best of times, it was the worst of times..

Would compress to-

Example------------------------------------------------


Would compress to-

(literal 26 )it was the best of times,

+26

Example------------------------------------------------


Would compress to-


(copy 11-26)

-26 +11

+26

Example------------------------------------------------


Would compress to-


(copy 11-26) wor )copy 11-27(

-26 +11 -27 +11

+26

Example-cont.------------------------------------------------

And we get a very simple lossless method:

The compression achieved depends on the size of the copy and literal codewords.

..it was the best of times ,

it was the worst of times.

Compression

Expansion

..it was the best of times ,

it was the worst of times.


(copy 11-26) wor )copy 11-27(.

A1------------------------------------------------------

The encoding of A1:

-8 bits for a literal codeword

-16 bit for a copy codeword

(can you figure what’s the logic behind)?

literal length[1..16]

length[2..16]

displacement[1..4096]

0 15

0 7

0000xxxx

xxxxyy..yy

A1------------------------------------------------------

The encoding of A1: -8 bits for a literal codeword

-16 bit for a copy codeword

And we get )a compression of 51 to 36(: (literal 16 )it was the best )literal 10(of times,

(copy 11-26) wor )copy 11-27(

literal length[1..16]

length[2..16]

displacement[1..4096]

0 15

0 7

0000xxxx

xxxxyy..yy

A1’s policy----------------------------

If the compressor is idle )just finish a word(:

look for a copy >= 2

otherwise, start a literal.

If the compressor is in the middle of a literal:

extend it until a copy >= 3 is found.


- Our Algorithm



Done

( here )

Where do we stand?

The Data Structure-----------------------------------------

What do we need?

Find the current longest match )for copy(.

The Data Structure-----------------------------------------

What do we need?

Find the current longest match )for copy(.

-What could we use ?

Naive solution-

Suffix tree with all strings of length <= 16 in the previous 4096-bytes window.

Naive solution---------------------------------

Suffix tree with all strings of length <= 16 in the previous 4096-bytes window:

current4096

1616

16

The cost --------------------------------------------

If we descended d levels to insert string starts at position j ,

we will descend at least d-1 levels to insert string starts at j+1.

The cost-cont.------------------------------------------

If we descended d levels to insert string starts at position j ,

we would descend at least d-1 levels to insert string starts at j+1.

So the cost is O)nd( for insertion.

But we want to eliminate d.

j4096

dd

dd-1

j+1

Modifications------------------------------------

a.Suffix links:

Each node represents the string aX

has a pointer to the node represents

the string X.

Immediate advantage:

We don’t need to return to the root after each insertion.

aX X

Y Y

k

Suffix Links------------------------------------

How we use and create suffix links:

..aXYb..

aX X

Y Y

k

Suffix Links------------------------------------


..aXYb..

aX X

Y Y

k

x

Suffix Links-cont.------------------------------------


..aXYb..

1 .Create a new node , and insert b.

aX X

Y Y

bk

x


How we use and create suffix

links:

..aXYb..


2 .a. Use suffix link to insert XYb:

a.1 we go up to and cross to using the suffix link.

aX X

Y Y

bk

x



..aXYb..

1 .Create a new node , and insert b.2 .a. Use suffix link to insert XYb: a.1 we go up to and cross to

using the suffix link. a.2 rescan to )not necessarily

exist(

aX X

Y Y

bk

rescan

x

If doesn’t exist, create it!

Rescan means wedon’t need to check string again, but go stright to



..aXYb..

1 .Create a new node , and insert b.2 .a. Use suffix link to insert XYb: a.1 we go up to and cross to

using the suffix link. a.2 rescan to

a.3 scan from to insert XYb.

aX X

Y Y

bk

rescan

scan

x



..aXYb..


2 .Use suffix link to insert XYb.

3 .Add ’s suffix link (And we finish with the insertion!

aX X

Y Y

bk

rescan

scan

x

Invariant kept: every internal node has a suffix link )except one just created(.

Demends from DS:

……………………gffghk……

We explained insertion.

What about deletion?

4096

match

deleteinsert

Modifications- cont.------------------------------------

Deletion:

b. Leaves in a circular buffer-

identify oldest and delete it.

c.’Son count-’

when it falls to one, delete node

and combine arcs.

aX X

Y Y

bk

1 4096

Son count=3

Circular buffer

Is it enough?------------------------------------

NO.

We still have a problem.

Higher pointers can become out-of-date.

But, climb up and update those pointers would take out the advantegaes of using the suffix links!

aX X

Y Y

bk

..fkjg…

Modifications- Last ------------------------------------

d. Percolating updates:

Each internal node has an update bit.aX X

Y Y

k

True/false bit

Percolating updates ------------------------------------

d. Percolating updates -

When updating a node:

bit = true

1 .set bit to false.

2 .propagate update to parent.

bit = false

1 .set bit to true.

2 .stop update.

aX X

Y Y

k

True/false bit

Percolating updates-cont.-------------------------------------------

Effect:

Keep all internal pointers on position

within the 4096-window in file.

Percolating updates-cont.-------------------------------------------

Effect:

Keep all internal pointers on position

within the 4096-window in file.

Cost:

worst case -

update propagates till root .

amortized-

summing over all new leaves, we get constant cost.

Summary of the inner loop---------------------------------------------------------

The operations: 1 .Insert:

a. insert the previous string. b. use suffix link to insert next string.

2 .Percolate update from leaf: if bit is true

set position field of the node to current position. set bit to false and propagate to parent.

if bit is false set it true, and stop.

Summary- cont---------------------------------------------------------

3 .Circular buffer:

a. replace oldest leaf with the new one.

b. if its parent has only one remaining son-

1 .delete parent, and attach remaining son

to grandparent.

2 .percolate the deleted node’s position-

( *special case- comparative percolation)


- Our Algorithm



Done 1

( here )

Where do we stand?

Done 2

Theoretical Considerations----------------------------------------------------

Correctness and linearity of suffix tree construction-

we already saw that.

We need to be convinced about destruction:

Theorm 1:

Deleting leaves in FIFO order and deleting internal nodes

with single sons will never leave dangling suffix pointers..

Proof:

Assume the contrary:

points to that was deleted.

The existence of means: two strings agree for l differ at l+1

……df..gb…df..gz..

l

b z

Proof-cont:

Assume the contrary:

points to that was deleted.

The existence of means: two strings agree for l differ at l+1

……df..gb…df..gz.. two strings agree for l-1 differ at l

This contradicts that has one son, and therefore deleted.

l

b z

l-1

Theoretical Considerations-----------------------------------------------------

Theorm 2:

Each percolated update has constant amortized cost.

Proof:

Assume a ‘credit’ on each internal node

with ‘update’ flag true.

A new node is added with two ‘credits-’

One is spent to update parent.

Second - give to parent and terminate )parent is false(.

2

false

1

0 1 true

A new node is added with two ‘credits-’

One is spent to update parent.

Second - give to parent and terminate )parent is false(.

or - obtain two on parent and continue )true(.

Result-

invariant is kept, and we get amortized cost of two

updates per new leaf .

2 2

false

1

0 1 true true1

1

2

Apply recursively on parent

Theoretical Considerations-----------------------------------------------------

Theorm 3 )effectiveness(:

Using the percolating update, every internal node will

be updated at least once in a period (4096).

Proof:

We would prove that every internal node will be

updated at least twice in a period, thus propagate

at least one update up.

(in contradiction )Find - the farthest node from the root that

doesn’t propagate an update to its parent.

3 cases:

a. has two )or more( remained* children:

both are farther from root. Thus- updated it.

Child that has remained for the entire period.

(in contradiction )Find - the farthest node from the root that

doesn’t propagate an update to its parent.

3 cases:

a. has two )or more( remained* children:

both are farther from root. Thus- updated it.

b. has only one remaining child:

one update from it. Second from new child when created.

( new arc causes son to update parent)


(in contradiction )Find - the farthest node from the root that doesn’t propagate an update to its parent.

3 cases: a. has two )or more( remained* children: both are farther from root. Thus- updated it.

b. has only one remaining child: one update from it. Second from new child when created.

( new arc causes son to update parent) c.has two new children- similar.

In all cases, will receive two updates during a period, and thus- propagate an update. Contradiction .


Other Theoretical Considerations)bounds on the compression(

-----------------------------------------------------------

We have focused on the Data Structure.

There are other questions, about the compression.

אבל על כך,

בפעם אחרת!)ובקורס אחר(

ורק נציין אותם בקצרה:

Other Theoretical Considerations)bounds on the compression(

-----------------------------------------------------------

Consider the following:

1 3 16 15 14 13

A1 )literal 1(x)copy 3 y()copy 14 y( 6 bytesOptimal )literal 2(xx)copy 16 y( 5 bytes

How bad can it get?

Position j j+1 j+2 j+3 j+5 j+6

Copy length available

Encoder is here

A1

Optimal

Heuristic vs. Optimal-------------------------------

Foresight algorithms:

Must have more than one-pass: we pay big time.

And the Gain?

(Optimal vs. A1-)

On average- about 1% better.

On Worst case- 20%.

Back to our business

A1’s virtues-------------------------

-Simple one-pass adaptive lossless method.

-Natural approach to 8-bit per character.

Performances:

-Compression ratio - up to 1/8.

-Expander- fast, simple, small storage requirements.

-Compressor- much slower and larger.

(all in comparison to other copy/literal methods )

Improvements--------------------------------

-Enlarge the window- gain compression ratio.

pay space and speed.

-Enlarge copy length- same.

-Change encoding- gain performance, pay simplicity.

-Change update policy-gain compression speed,

pay in space and expansion speed.

SummaryWe introduce the compression problem, and propose a simple substitutional compressing algorithm, based on the copy/literal codewords.

Our main interest was the Data structure. We saw how a

modification of the basic Suffix tree answers the

algorithm demands, on what cost.

EXIT

Don’t push

Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron.

Documents

best of times

worst of times

compression algorithm

data compression method

copy x

literal x

good algorithm

coding of data