Top Banner
Paolo Ferragina, Università di P isa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di Pisa
33

Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Mar 28, 2015

Download

Documents

Kaila Davisson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

On Compression and Indexing:

two sides of the same coin

Paolo FerraginaDipartimento di Informatica, Università di Pisa

Page 2: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

What do we mean by “Indexing” ?

Linguistic or tokenizable textRaw sequence of characters or bytes

Types of data

Types of query

Word-based queryCharacter-based query

Two indexing approaches :

• Word-based indexes, here a notion of “word” must be devised !» Inverted files, Signature files, Bitmaps.

• Full-text indexes, no constraint on text and queries !» Suffix Array, Suffix tree, String B-tree [Ferragina-Grossi, JACM 99].

DNA sequencesAudio-video filesExecutables

Arbitrary substringComplex match

Page 3: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

What do we mean by “Compression” ?

Compression has two positive effects:

Space saving (or, double memory at the same cost)

Performance improvement

Better use of memory levels close to processor

Increased disk and memory bandwidth

Reduced (mechanical) seek time

From March 2001 the Memory eXpansion Technology (MXT) is available on IBM eServers x330MXT Same performance of a PC with double memory but at half cost

Moral: More economical to store data in compressed form than

uncompressed

» CPU speed nowadays makes (de)compression “costless” !!

Page 4: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

Compression and Indexing: Two sides of the same coin !

Do we witness a paradoxical situation ?

An index injects redundant data, in order to speed up the pattern

searches

Compression removes redundancy, in order to squeeze the space occupancy

Moral:

CPM researchers must have a multidisciplinary background, ranging from

Data structure design to Data compression, from Architectural Knowledge to Database principles, till Algoritmic Engineering and

more...

NO, new results proved a mutual reinforcement behaviour !

Better indexes can be designed by exploiting compression techniques

Better compressors can be designed by exploiting indexing techniquesIn terms of space occupancy

Also in terms of compression ratio

Page 5: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

Our journey, today...

Suffix Array(1990)

Index design (Weiner ’73) Compressor design (Shannon ’48)

Burrows-Wheeler Transform(1994)

Compressed Index-Space close to gzip, bzip- Query time close to O(|P|)

Compression BoosterTool to transform a poor compressorinto a better compression algorithm

Improved Indexes and Compressors

Wavelet Tree

Page 6: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

The Suffix Array [BaezaYates-Gonnet, 87 and Manber-Myers, 90]

Prop 1. All suffixes in SUF(T) having prefix P are contiguous

P=si

T = mississippi#

#i#ippi#issippi#ississippi#mississippi#pi#ppi#sippi#sissippi#ssippi#ssissippi#

SUF(T)

SA + T occupy N log2 N) bits

(N2) space

SA

121185211097463

T = mississippi#

suffix pointer

5

Prop 2. These suffixes follow P’s lexicographic position

Page 7: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

ppi#sippi#sissippi#

P=sifinal

Searching in Suffix Array [Manber-Myers, 90]

T = mississippi#SA

121185211097463

P=si

P=si

Suffix Array search• O(log2 N) binary-search steps

• Each step takes O( |P| ) char cmp

overall, O(|P| log2 N) time

Suffix permutation cannot be any of {1,...,N}

# binary texts = 2N « N! = # permutations on {1, 2, ..., N}

N log2 N is not a lower bound to the bit space occupancy

O(|P| + log2 N) time

O(|P|/B + logB N) I/Os [JACM 99]

Self-adjusting version on disk [FOCS 02]

Page 8: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

The Burrows-Wheeler Transform (1994)

Let us given a text T = mississippi#

mississippi#ississippi#mssissippi#mi sissippi#mis

sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi

ssippi#missiissippi#miss Sort the rows

# mississipp ii #mississip pi ppi#missis s

F L

T

Page 9: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

T = m i s s i s s i p p i #

i #mississip p

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

Why L is so interesting for compression ?

# mississipp i

i ppi#missis s

F L

A key observation: L is locally homogeneous

Algorithm Bzip :

Move-to-Front coding of L

Run-Length coding

Statistical coder: Arithmetic,

Huffman

L is highly compressible

unknown

Building the BWT SA construction

Inverting the BWT array visit

...overall O(N) time...

Page 10: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

Rotated text

#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m

#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m

Suffix Array vs. BW-transform

ipssm#pissii

L

12

1185211097463

SA L includes SA and T. Can we search within L ?

mississippi

Page 11: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

A compressed index [Ferragina-Manzini, IEEE Focs 2000]

Bridging data-structure design and compression techniques: Suffix array data structure Burrows-Wheeler Transform

The corollary is that: The Suffix Array is compressible It is a self-index

In practice, the index is much appealing: Space close to the best known compressors, ie. bzip Query time of few millisecs on hundreds of MBs

The theoretical result:

Query complexity: O(p + occ log N) time

Space occupancy: O( N Hk(T)) + o(N) bitsk-th order empirical entropy

o(N) if T compressible

O(n) space: A plethora of papers

Hk : Grossi-Gupta-Vitter (03), Sadakane (02),...

Now, more than 20 papers with more than 20 authors on related subjects

Index does not depend on k

Bound holds for all k, simultaneously

Page 12: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

A useful tool: L F mapping

# mississipp ii #mississip pi ppi#missis s

F L

How do we map L’s onto F’s chars ?

... Need to distinguish equal chars...

unknown

Page 13: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

frocc=2[lr-fr+1]

rows prefixedby “si”

Substring search in T (Count occurrences)

#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m

ipssm#pissii

L

mississippi

# 0i 1m 6p 7s 9

C

Availa

ble in

foP = siFirst step

fr

lr Inductive step: Given fr,lr for P[j+1,p] Take P[j]

P[ j ]

Find first P[j] in L[fr, lr]

Find last P[j] in L[fr, lr]

L-to-F mapping of these charslr

rows prefixedby char “i” s

s

unknown

Page 14: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

Many details are missing...

• Still guarantee O(p) time to count the P’s occurrences

The column L is actually kept compressed:

Interesting issues:

What about arbitrary alphabets ? [Grossi et al., 03; Ferragina et al., 04]

What about disk-aware, or cache-oblivious, or self-adjusting versions ?

What about challenging applications: bio,db,data mining,handheld PC, ...

Efficient and succinct index construction [Hon et al., Focs 03]

The Locate operation takes O(log N) time• Some additional data structure, in o(n) bits

Bio-application: fit Human-genome index in a PC [Sadakane et al., 02]

Page 15: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

We investigated the reinforcement relation:

Compression ideas Index design

Let’s now turn to the other direction

Indexing ideas Compressor design

Where we are ...

BoosterBooster

Page 16: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

What do we mean by “boosting” ?

It is a technique that takes a poor compressor A and turns it

into a compressor with better performance guarantee

A memoryless compressor is poor in that it assigns codewords to symbols according only to their frequencies (e.g. Huffman)

It incurs in some obvious limitations: T = anbn

T’= random string of length 2n and same number of ‘a,b’

Page 17: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

Qualitatively, we would like to achieve

c’ is shorter than c, if T is compressible

Time(Aboost) = O(Time(A)), i.e. no slowdown

A is used as a black-box

What do we mean by “boosting” ?

c’

Booster

AT c

The more compressible is T, the shorter is c’

It is a technique that takes a poor compressor A and turns it

into a compressor with better performance guarantee

Two Key Components: Burrows-Wheeler Transform and Suffix Tree

Page 18: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

The emprirical entropy H0

H0(T) = ̶ ∑i (ni/n) log2 (ni/n)Frequency in T of the i-th symbol

We get a better compression using a codeword that depends on the k symbols preceding the one to be compressed (context)

|T| H0(T) is the best you can hope for a memoryless compressor E.g. Huffman or Arithmetic coders come close to this bound

Page 19: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

The empirical entropy Hk

Hk(T)= (1/|T|) ∑||=k | T[] | H0(T[])

Example: Given T = “mississippi”, we have

T[] = string of symbols that precede in T

T[i]= mssp,T[is] = ms

Problems with this approach:• How to go from all T[] back to the string T ?

• How do we choose efficiently the best k ?

Compress T up to Hk(T)

compress all T[] up to their

H0

Use Huffmanor ArithmeticFor any k-long context

BWT

Suffix Tree

Page 20: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

T = mississippi#

pi#mississi pppi#mississ isippi#missi ssissippi#mi sssippi#miss ississippi#m i

issippi#mis s

mississippi #ississippi# m

Compress T up to Hk(T)

compress all T[] up to their H0

Use BWT to approximate Hk

bwt(T)

#mississipp ii#mississip pippi#missis s

compress pieces of bwt(T) up to H0

Hk(T) = (1/|T|) ∑||=k |T[]| H0(T[])

Remember that...unknown

T[] is a permutation of a piece of bwt(T)

T[is] = ms

Page 21: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

H1(T)H2(T)

Compress T up to Hk

compress pieces of bwt(T) up to H0

pi#mississi pppi#mississ isippi#missi ssissippi#mi sssippi#miss ississippi#m i

issippi#mis s

mississippi #ississippi# m

#mississipp ii#mississip pippi#missis s

What are the pieces of BWT to compress ?

Bwt(T)

Compressing those pieces up totheir H0 , we achieve H1(T)Compressing those pieces up totheir H0 , we achieve H2(T)

We have a workableway to approximate Hk

unknown

Recall that

T[]’s permutation

Page 22: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

Goal: find the best BWT-partition induced by a Leaf Cover !!Some leaf covers are “related” to Hk !!!

pi#mississi pppi#mississ isippi#missi ssissippi#mi sssippi#miss ississippi#m i

issippi#mis s

mississippi #ississippi# m

#mississipp ii#mississip pippi#missis s

bwt(T)

1211952110

97463

Finding the “best pieces” to compress...

12 1

# i pm

s

11 9

#ppi#

ssi

5 2

ppi#

ssippi#

10 9

i# pi# i si

7 4

ppi# ssippi#

6 3

ppi#

ssippi#

Row order

i

p

s

ii

s m

is

#

p

s

Leaf cover ?

12 112 112 1

11 9 10 9

L1L2

unknown

H1(T)H2(T)

Page 23: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

Let A be the compressor we wish to boost

Let bwt(T)=t1, …, tr be the partition induced by the leaf cover L, and

let us define cost(L,A)=∑j |A(tj)|

Goal: Goal: Find the leaf cover L* of minimum cost

It suffices a post-order visit of the suffix tree, hence linear time

We have: Cost(L*, A) ≤ Cost(Lk, A) Hk(T), k

|c | ≤ λ |s| H (s) +f(|s|)

Technically, we show that

0k

A compression booster [Ferragina-Manzini, SODA 04]

k+ log2 |s| + k ’

Researchers may now concentrate on the “apparently”

simpler task of designing 0-th order compressors

[further results joint with Giancarlo-Sciortino]

Page 24: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

May we close “the mutual reinforcement cycle” ?

The Wavelet Tree [Grossi-Gupta-Vitter, Soda 03]

Using the WT within each piece of the optimal BWT-partition, we get:

A compressed index that scales well with the alphabet size

Reduce the compression problem to achieve H0 on binary strings[joint with Manzini, Makinen, Navarro]

[joint with Giancarlo, Manzini,Sciortino]

Interesting issues:

What about space construction of BWT ?

What about these tools for “XML” or “Images” ?

Other application contexts: bio,db,data mining, network, ...

From Theory to Technology ! Libraries, Libraries,.... [e.g. LEDA]

Page 25: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

Page 26: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

Page 27: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

A historical perspective

Shannon showed a “narrower” result for a stationary ergodic S

Idea: Compress groups of k chars in the string T

Result: Compress ratio the entropy of S, for k

Various limitations It works for a source S

It must modify A’s structure, because of the alphabet change

For a given string T, the best k is found by trying k=0,1,…,|T|

(|T|2) time slowdown

k is eventually fixed and this is not an optimal choice !

Any string s

Black-box

O(|s|) time

Variable lengthcontexts

Two Key Components: Burrows-Wheeler Transform and Suffix Tree

Page 28: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

How do we find the “best” partition (i.e.

k) “Approximate” via MTF [Burrows-Wheeler,

‘94]

MTF is efficient in practice [bzip2]

Theory and practice showed that we can aim for

more !

Use Dynamic Programming [Giancarlo-Sciortino, CPM ’03]

It finds the optimal partition

Very slow, the time complexity is cubic in |T|Surprisingly, full-text indexes help in findingthe optimal partition in optimal linear time !!

Page 29: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

s = (bad)n (cad)n (xy)n (xz)n

Example

Example: not one k

1-long contexts 2-long contexts xs= ynzn > yxs= yn-1 , zxs= zn-1

as= d2n < bas= dn , cas= dn

Page 30: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

Word-based compressed index

T = …bzip…bzip2unbzip2unbzip …� � � � � �

What about word-based occurrences of P ?

The FM-index can be adapted to support word-based searches: Preprocess T and transform it into a “digested” text DT

word prefix substring suffix P=bzip

...the post-processing phase can be time consuming !

Use the FM-index over the “digested” DT

Word-search in T Word-search in T Substring-search in Substring-search in DTDT

Page 31: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

The WFM-index

Variant of Huffman algorithm:

Symbols of the huffman tree are the words of T

The Huffman tree has fan-out 128

Codewords are byte-aligned and tagged

1 0 0

Byte-aligned codeword

tagging

yes no

no

Any word7 bits

Codeword

huffman

WFM-index1. Dictionary of words

2. Huffman tree

3. FM-index built on DT

0

0 0

1 1

1

0 01

1 01

[bzip] [ ]

[bzip][ ] [not]

[or]

T = “bzip or not bzip”

1

[ ]

DT

space ~ 22 %word search ~ 4 ms

P= bzip

yes

1. Dictionary of words

= 10

2. Huffman tree

3. FM-index built on DT

Page 32: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

pT = .... #

i #mississip p

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

The BW-Trasform is invertible

# mississipp i

i ppi#missis s

F Lunknown

1. We can map L’s to F’s chars

2. T = .... L[ i ] F[ i ] ...

Two key properties:

Reconstruct T backward:

ipi

Building the BWT SA construction

Inverting BWT array visit

...overall O(N) time...

Page 33: Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina, Università di Pisa

miiii sssspp

The Wavelet Tree [Grossi-Gupta-Vitter, Soda 03]

[collaboration: Giancarlo, Manzini, Makinen, Navarro, Sciortino]

mississippi

i m p s

10000 111100

00110110110

• Use WT within each BWT piece

Alphabet independence Binary string

compression/indexing