Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron
Jan 21, 2016
Data Compression with finitewindows
Fiala and Greene
Speaker: Giora Alexandron
Overview:-----------------------
Our main purpose:
See how Suffix Tree supports a compression algorithm.
Overview:-----------------------
Our main purpose:
See how Suffix Tree supports a compression algorithm.
What we would see:
A data compression method, which works by substituting text. It uses a modification of the basic suffix tree, to support cyclic maintenance of the most
recent strings seen in file .
Outlines------------------------
1 .Compression: - In General
- Our Algorithm
2 .Data Structure: - Modification of the suffix tree.
3 .Theoretical Considerations: - Prooves.
4 .Improvments.
Compression-------------------------------
What is Compression:
Compression is the coding of data to minimize its representation. We would focus on
lossless, adaptive, one-pass methods .
Compression-------------------------------
What is Compression: Compression is the coding of data to minimize its
representation. We would focus on lossless, adaptive, one-pass methods .
Main approaches- Statistical approach- try to predict the next symbol .
Substitutional approach- replace blocks of texts with references to earlier occurrences of identical text.
**We would focus on a Substitutional method**
Compression-cont.------------------------------
What characterize a good compressor:
- Good compressing ratio.
- Run fast in Compression.
- Use minimum of space.
-Run fast in Expansion.
Compression-cont.------------------------------
What characterize a good compressor: - Good compressing ratio. - Run fast in Compression.
- Use minimum of space. -Run fast in Expansion.
There are trade-offs between all of those.Naturally, we want to achieve them all=
A good Algorithm + a matching Data Structure
Substitutional Compressing---------------------------------------
Consider the following basic scheme:
The compressed files would contain two types of codewords:
literal x pass the next x characters directly to the output.
copy x, y go back y characters and copy the next x
characters start at that position.
Example------------------------------------------------
..it was the best of times, it was the worst of times..
Would compress to-
Example------------------------------------------------
..it was the best of times, it was the worst of times..
Would compress to-
(literal 26 )it was the best of times,
+26
Example------------------------------------------------
..it was the best of times, it was the worst of times..
Would compress to-
(literal 26 )it was the best of times,
(copy 11-26)
-26 +11
+26
Example------------------------------------------------
..it was the best of times, it was the worst of times..
Would compress to-
(literal 26 )it was the best of times,
(copy 11-26) wor )copy 11-27(
-26 +11 -27 +11
+26
Example-cont.------------------------------------------------
And we get a very simple lossless method:
The compression achieved depends on the size of the copy and literal codewords.
..it was the best of times ,
it was the worst of times.
Compression
Expansion
..it was the best of times ,
it was the worst of times.
(literal 26 )it was the best of times,
(copy 11-26) wor )copy 11-27(.
A1------------------------------------------------------
The encoding of A1:
-8 bits for a literal codeword
-16 bit for a copy codeword
(can you figure what’s the logic behind)?
literal length[1..16]
length[2..16]
displacement[1..4096]
0 15
0 7
0000xxxx
xxxxyy..yy
A1------------------------------------------------------
The encoding of A1: -8 bits for a literal codeword
-16 bit for a copy codeword
And we get )a compression of 51 to 36(: (literal 16 )it was the best )literal 10(of times,
(copy 11-26) wor )copy 11-27(
literal length[1..16]
length[2..16]
displacement[1..4096]
0 15
0 7
0000xxxx
xxxxyy..yy
A1’s policy----------------------------
If the compressor is idle )just finish a word(:
look for a copy >= 2
otherwise, start a literal.
If the compressor is in the middle of a literal:
extend it until a copy >= 3 is found.
1 .Compression: - In General
- Our Algorithm
2 .Data Structure: - Modification of the suffix tree.
3 .Theoretical Considerations: - Prooves.
Done
( here )
Where do we stand?
The Data Structure-----------------------------------------
What do we need?
Find the current longest match )for copy(.
The Data Structure-----------------------------------------
What do we need?
Find the current longest match )for copy(.
-What could we use ?
Naive solution-
Suffix tree with all strings of length <= 16 in the previous 4096-bytes window.
Naive solution---------------------------------
Suffix tree with all strings of length <= 16 in the previous 4096-bytes window:
current4096
1616
16
The cost --------------------------------------------
If we descended d levels to insert string starts at position j ,
we will descend at least d-1 levels to insert string starts at j+1.
The cost-cont.------------------------------------------
If we descended d levels to insert string starts at position j ,
we would descend at least d-1 levels to insert string starts at j+1.
So the cost is O)nd( for insertion.
But we want to eliminate d.
j4096
dd
dd-1
j+1
Modifications------------------------------------
a.Suffix links:
Each node represents the string aX
has a pointer to the node represents
the string X.
Immediate advantage:
We don’t need to return to the root after each insertion.
aX X
Y Y
k
Suffix Links------------------------------------
How we use and create suffix links:
..aXYb..
aX X
Y Y
k
Suffix Links------------------------------------
How we use and create suffix links:
..aXYb..
aX X
Y Y
k
x
Suffix Links-cont.------------------------------------
How we use and create suffix links:
..aXYb..
1 .Create a new node , and insert b.
aX X
Y Y
bk
x
Suffix Links-cont.------------------------------------
How we use and create suffix
links:
..aXYb..
1 .Create a new node , and insert b.
2 .a. Use suffix link to insert XYb:
a.1 we go up to and cross to using the suffix link.
aX X
Y Y
bk
x
Suffix Links-cont.------------------------------------
How we use and create suffix links:
..aXYb..
1 .Create a new node , and insert b.2 .a. Use suffix link to insert XYb: a.1 we go up to and cross to
using the suffix link. a.2 rescan to )not necessarily
exist(
aX X
Y Y
bk
rescan
x
If doesn’t exist, create it!
Rescan means wedon’t need to check string again, but go stright to
Suffix Links-cont.------------------------------------
How we use and create suffix links:
..aXYb..
1 .Create a new node , and insert b.2 .a. Use suffix link to insert XYb: a.1 we go up to and cross to
using the suffix link. a.2 rescan to
a.3 scan from to insert XYb.
aX X
Y Y
bk
rescan
scan
x
Suffix Links-cont.------------------------------------
How we use and create suffix links:
..aXYb..
1 .Create a new node , and insert b.
2 .Use suffix link to insert XYb.
3 .Add ’s suffix link (And we finish with the insertion!
aX X
Y Y
bk
rescan
scan
x
Invariant kept: every internal node has a suffix link )except one just created(.
Demends from DS:
……………………gffghk……
We explained insertion.
What about deletion?
4096
match
deleteinsert
Modifications- cont.------------------------------------
Deletion:
b. Leaves in a circular buffer-
identify oldest and delete it.
c.’Son count-’
when it falls to one, delete node
and combine arcs.
aX X
Y Y
bk
1 4096
Son count=3
Circular buffer
Is it enough?------------------------------------
NO.
We still have a problem.
Higher pointers can become out-of-date.
But, climb up and update those pointers would take out the advantegaes of using the suffix links!
aX X
Y Y
bk
..fkjg…
Modifications- Last ------------------------------------
d. Percolating updates:
Each internal node has an update bit.aX X
Y Y
k
True/false bit
Percolating updates ------------------------------------
d. Percolating updates -
When updating a node:
bit = true
1 .set bit to false.
2 .propagate update to parent.
bit = false
1 .set bit to true.
2 .stop update.
aX X
Y Y
k
True/false bit
Percolating updates-cont.-------------------------------------------
Effect:
Keep all internal pointers on position
within the 4096-window in file.
Percolating updates-cont.-------------------------------------------
Effect:
Keep all internal pointers on position
within the 4096-window in file.
Cost:
worst case -
update propagates till root .
amortized-
summing over all new leaves, we get constant cost.
Summary of the inner loop---------------------------------------------------------
The operations: 1 .Insert:
a. insert the previous string. b. use suffix link to insert next string.
2 .Percolate update from leaf: if bit is true
set position field of the node to current position. set bit to false and propagate to parent.
if bit is false set it true, and stop.
Summary- cont---------------------------------------------------------
3 .Circular buffer:
a. replace oldest leaf with the new one.
b. if its parent has only one remaining son-
1 .delete parent, and attach remaining son
to grandparent.
2 .percolate the deleted node’s position-
( *special case- comparative percolation)
1 .Compression: - In General
- Our Algorithm
2 .Data Structure: - Modification of the suffix tree.
3 .Theoretical Considerations: - Prooves.
Done 1
( here )
Where do we stand?
Done 2
Theoretical Considerations----------------------------------------------------
Correctness and linearity of suffix tree construction-
we already saw that.
We need to be convinced about destruction:
Theorm 1:
Deleting leaves in FIFO order and deleting internal nodes
with single sons will never leave dangling suffix pointers..
Proof:
Assume the contrary:
points to that was deleted.
The existence of means: two strings agree for l differ at l+1
……df..gb…df..gz..
l
b z
Proof-cont:
Assume the contrary:
points to that was deleted.
The existence of means: two strings agree for l differ at l+1
……df..gb…df..gz.. two strings agree for l-1 differ at l
This contradicts that has one son, and therefore deleted.
l
b z
l-1
Theoretical Considerations-----------------------------------------------------
Theorm 2:
Each percolated update has constant amortized cost.
Proof:
Assume a ‘credit’ on each internal node
with ‘update’ flag true.
A new node is added with two ‘credits-’
One is spent to update parent.
Second - give to parent and terminate )parent is false(.
2
false
1
0 1 true
A new node is added with two ‘credits-’
One is spent to update parent.
Second - give to parent and terminate )parent is false(.
or - obtain two on parent and continue )true(.
Result-
invariant is kept, and we get amortized cost of two
updates per new leaf .
2 2
false
1
0 1 true true1
1
2
Apply recursively on parent
Theoretical Considerations-----------------------------------------------------
Theorm 3 )effectiveness(:
Using the percolating update, every internal node will
be updated at least once in a period (4096).
Proof:
We would prove that every internal node will be
updated at least twice in a period, thus propagate
at least one update up.
(in contradiction )Find - the farthest node from the root that
doesn’t propagate an update to its parent.
3 cases:
a. has two )or more( remained* children:
both are farther from root. Thus- updated it.
Child that has remained for the entire period.
(in contradiction )Find - the farthest node from the root that
doesn’t propagate an update to its parent.
3 cases:
a. has two )or more( remained* children:
both are farther from root. Thus- updated it.
b. has only one remaining child:
one update from it. Second from new child when created.
( new arc causes son to update parent)
Child that has remained for the entire period.
(in contradiction )Find - the farthest node from the root that doesn’t propagate an update to its parent.
3 cases: a. has two )or more( remained* children: both are farther from root. Thus- updated it.
b. has only one remaining child: one update from it. Second from new child when created.
( new arc causes son to update parent) c.has two new children- similar.
In all cases, will receive two updates during a period, and thus- propagate an update. Contradiction .
Child that has remained for the entire period.
Other Theoretical Considerations)bounds on the compression(
-----------------------------------------------------------
We have focused on the Data Structure.
There are other questions, about the compression.
אבל על כך,
בפעם אחרת!)ובקורס אחר(
ורק נציין אותם בקצרה:
Other Theoretical Considerations)bounds on the compression(
-----------------------------------------------------------
Consider the following:
1 3 16 15 14 13
A1 )literal 1(x)copy 3 y()copy 14 y( 6 bytesOptimal )literal 2(xx)copy 16 y( 5 bytes
How bad can it get?
Position j j+1 j+2 j+3 j+5 j+6
Copy length available
Encoder is here
A1
Optimal
Heuristic vs. Optimal-------------------------------
Foresight algorithms:
Must have more than one-pass: we pay big time.
And the Gain?
(Optimal vs. A1-)
On average- about 1% better.
On Worst case- 20%.
Back to our business
A1’s virtues-------------------------
-Simple one-pass adaptive lossless method.
-Natural approach to 8-bit per character.
Performances:
-Compression ratio - up to 1/8.
-Expander- fast, simple, small storage requirements.
-Compressor- much slower and larger.
(all in comparison to other copy/literal methods )
Improvements--------------------------------
-Enlarge the window- gain compression ratio.
pay space and speed.
-Enlarge copy length- same.
-Change encoding- gain performance, pay simplicity.
-Change update policy-gain compression speed,
pay in space and expansion speed.
SummaryWe introduce the compression problem, and propose a simple substitutional compressing algorithm, based on the copy/literal codewords.
Our main interest was the Data structure. We saw how a
modification of the basic Suffix tree answers the
algorithm demands, on what cost.
EXIT
Don’t push