Top Banner
Compact Data Strutures (To compress is to Conquer) Antonio Fariña , Javier D. Fernández and Miguel A. Martinez-Prieto 23 TH AUGUST 2017 3rd KEYSTONE Training School Keyword search in Big Linked Data
76

Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

Apr 14, 2018

Download

Documents

lamque
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

Compact Data Strutures

(To compress is to Conquer)

Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto

23TH AUGUST 2017

3rd KEYSTONE Training SchoolKeyword search in Big Linked Data

Page 2: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

Introduction

Basic compression

Sequences

Bit sequences

Integer sequences

A brief Review about Indexing

PAGE 2

Agenda

images: zurb.com

Page 3: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

Compact data structures lie at the intersection of Data Structures (indexing) and Information Theory (compression): One looks at data representations that not only permit space close to the minimum possible (as in compression) but also require that those representations allow one to efficiently carry out some operations on the data.

Introduction to Compact Data Structures

COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUERPAGE 3

Page 4: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 784

IntroductionWhy compression?

Disks are cheap !! But they are also slow!

Compression can help more data to fit in main memory.

(access to memory is around 106 times faster than HDD)

CPU speed is increasing faster

We can trade processing time (needed to uncompress data) by space.

Page 5: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 785

IntroductionWhy compression?

Compression does not only reduce space!

I/O access on disks and networks

Processing time* (less data has to be processed)

… If appropriate methods are used

For example: Allowing handling data compressed all the time.

Text collection (100%)

Doc 1 Doc 2 Doc 3 Doc nCompressed Text collection (30%)

Doc 1 Doc 2 Doc 3 Doc n

Compressed Text collection (20%)P7zip, others

Doc 1 Doc 2 Doc 3 Doc n

Let’s search for “Keystone"

Page 6: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 786

IntroductionWhy indexing?

Indexing permits sublinear search time

Text collection (100%)

Doc 1 Doc 2 Doc 3 Doc nCompressed Text collection (30%)

Doc 1 Doc 2 Doc 3 Doc n

term 1

Keystone

term n

(> 5-30%)Index

Let’s search for “Keystone"

Page 7: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 787

IntroductionWhy compact data structures?

Self-indexes:

sublinear search time

Text implicitly kept

Text collection

Doc 1 Doc 2 Doc 3 Doc n

term 1

Keystone

term n

(> 5-30%)Index

0 0 0 01 1

0 1

0 1 0 10 0

1

0

Self-index (WT, WCSA,…)

term 1

Keystone

term n

Let’s search for “Keystone"

Page 8: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

Introduction

Basic compression

Sequences

Bit sequences

Integer sequences

A brief Review about Indexing

PAGE 8

Agenda

images: zurb.com

Page 9: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

Compressing aims at representing data within less space. How does it work? Which are the most traditional compression techniques?

Compression

COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUERPAGE 9

Page 10: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7810

Basic CompressionModeling & Coding

A compressor could use as a source alphabet:

A fixed number of symbols (statistical compressors)

1 char, 1 word

A variable number of symbols (dictionary-based compressors)

1st occ of ‘a’ encoded alone, 2nd occ encoded with next one ‘ax’

Codes are built using symbols of a target alphabet:

Fixed length codes (10 bits, 1 byte, 2 bytes, …)

Variable length codes (1,2,3,4 bits/bytes …)

Classification (fixed-to-variable, variable-to-fixed,…)

-- statisticalInput alphabet

dictionary var2var

Target alphabet

fixed

var

fixed var

Page 11: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7811

Basic CompressionMain families of compressors

Taxonomy

Dictionary based (gzip, compress, p7zip… )

Grammar based (BPE, Repair)

Statistical compressors (Huffman, arithmetic, Dense, PPM,… )

Statistical compressors

Gather the frequencies of the source symbols.

Assign shorter codewords to the most frequent symbols.

Obtain compression

Page 12: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7812

Basic CompressionDictionary-based compressors

How do they achieve compression?

Assign fixed-length codewords to variable-length symbols (text substrings)

The longer the replaced substring the better compression

Well-known representatives: Lempel-Ziv family

LZ77 (1977): GZIP, PKZIP, ARJ, P7zip

LZ78 (1978)

LZW (1984): Compress, GIF images

Page 13: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7813

Basic CompressionLZW

Starts with an initial dictionary D (contains symbols in S)

For a given position of the text.

while D contains w, reads prefix w=w0 w1 w2 …

If w0 …wk wk+1 is not in D (w0 …wk does!)

output (i = entryPos(w0 …wk)) (Note: codeword = log2 (|D|))

Add w0 …wk wk+1 to D

Continue from wk+1 on (included)

Dictionary has limited length? Policies: LRU, truncate& go, …

EX

AM

PLE

Page 14: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7814

Basic CompressionLZW

Starts with an initial dictionary D (contains symbols in S)

For a given position of the text.

while D contains w, reads prefix w=w0 w1 w2 …

If w0 …wk wk+1 is not in D (w0 …wk does!)

output (i = entryPos(w0 …wk)) (Note: codeword = log2 (|D|))

Add w0 …wk wk+1 to D

Continue from wk+1 on (included)

Dictionary has limited length? Policies: LRU, truncate& go, …

EX

AM

PLE

Page 15: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7815

Basic CompressionGrammar-based – BPE - Repair

Replaces pairs of symbols by a new one, until no pair repeats twice

Adds a rule to a Dictionary.

A B C D E A B D E F D E D E F A B E C D

A B C G A B G F G G F A B E C D

H C G H G F G G F H E C D

H C G H I G I H E C D

DE G

AB H

GF I

Source sequence

Dictionary of Rules

Final Repair Sequence

Page 16: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7816

Basic CompressionStatistical compressors

Assign shorter codewords to the most frequent symbols

Must gather symbol frequencies for each symbol c in S.

Compression is lower bounded by the (zero-order) empirical entropy of thesequence (S).

Most representative method: Huffman coding

n= num of symbols

nc= occs of symbol c

H0(S) <= log (|S|)n H0(S) = lower bound of the size of S compressed with a zero-order compressor

Page 17: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7817

Basic CompressionStatistical compressors: Huffman coding

Optimal prefix free coding

No codeword is a prefix of one another.

Decoding requires no look-ahead!

Asymptotically optimal: |Huffman(S)| <= n(H0(S)+1)

Typically using bit-wise codewords

Yet D-ary Huffman variants exist (D=256 byte-wise)

Builds a Huffman tree to generate codewords

Page 18: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7818

Basic CompressionStatistical compressors: Huffman coding

Sort symbols by frequency: S=ADBAAAABBBBCCCCDDEEE

Page 19: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7819

Basic CompressionStatistical compressors: Huffman coding

Bottom – Up tree construction

Page 20: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7820

Basic CompressionStatistical compressors: Huffman coding

Bottom – Up tree construction

Page 21: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7821

Basic CompressionStatistical compressors: Huffman coding

Bottom – Up tree construction

Page 22: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7822

Basic CompressionStatistical compressors: Huffman coding

Bottom – Up tree construction

Page 23: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7823

Basic CompressionStatistical compressors: Huffman coding

Bottom – Up tree construction

Page 24: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7824

Basic CompressionStatistical compressors: Huffman coding

Branch labeling

Page 25: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7825

Basic CompressionStatistical compressors: Huffman coding

Code assignment

Page 26: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7826

Basic CompressionStatistical compressors: Huffman coding

Compression of sequence S= ADB…

ADB… 01 000 10 …

Page 27: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7827

Basic CompressionBurrows-Wheeler Transform (BWT)

Given S= mississipii$, BWT(S) is obtained by: (1) creating a Matrix M with allcircular permutations of S$, (2) sorting the rows of M, and (3) taking the lastcolumn.

mississippi$

$mississippi

i$mississipp

pi$mississip

ppi$mississi

ippi$mississ

sippi$missis

ssippi$missi

issippi$miss

sissippi$mis

ssissippi$mi

ississippi$m

$mississippi

i$mississipp

ippi$mississ

issippi$miss

ississippi$m

mississippi$

pi$mississip

ppi$mississi

sippi$missis

sissippi$mis

ssippi$missi

ssissippi$mi

sort

L = BWT(S)F

Page 28: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7828

Basic CompressionBurrows-Wheeler Transform: reversible (BWT-1)

Given L=BWT(S), we can recover S=BWT-1(L)

$mississippi

i$mississipp

ippi$mississ

issippi$miss

ississippi$m

mississippi$

pi$mississip

ppi$mississi

sippi$missis

sissippi$mis

ssippi$missi

ssissippi$mi

LF

1

2

3

4

5

6

7

8

9

10

11

12

Steps:

1. Sort L to obtain F

2. Build LF mapping so that

If L[i]=‘c’, and

k= the number of times ‘c’ occurs in L[1..i], and

j=position in F of the kth occurrence of ‘c’

Then set LF[i]=j

Example: L[7] = ‘p’, it is the 2nd ‘p’ in L LF[7] = 8 which is the 2nd occ of ‘p’ in F

2

7

9

10

6

1

8

3

11

12

4

5

LF

Page 29: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7829

Basic CompressionBurrows-Wheeler Transform: reversible (BWT-1)

Given L=BWT(S), we can recover S=BWT-1(L)

$mississippi

i$mississipp

ippi$mississ

issippi$miss

ississippi$m

mississippi$

pi$mississip

ppi$mississi

sippi$missis

sissippi$mis

ssippi$missi

ssissippi$mi

1

2

3

4

5

6

7

8

9

10

11

12

2

7

9

10

6

1

8

3

11

12

4

5

LF

Steps:

1. Sort L to obtain F

2. Build LF mapping so that

If L[i]=‘c’, and

k= the number of times ‘c’ occurs in L[1..i], and

j=position in F of the kth occurrence of ‘c’

Then set LF[i]=j

Example: L[7] = ‘p’, it is the 2nd ‘p’ in L LF[7] = 8 which is the 2nd occ of ‘p’ in F

3. Recover the source sequence S in n steps:

Initially p=l=6 (position of $ in L); i=0; n=12;

In each step: S[n-i] = L[p];

p = LF[p];

i = i+1;

-

-

-

-

-

-

-

-

-

-

-

$

SLF

Page 30: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7830

Basic CompressionBurrows-Wheeler Transform: reversible (BWT-1)

Given L=BWT(S), we can recover S=BWT-1(L)

Steps:

1. Sort L to obtain F

2. Build LF mapping so that

If L[i]=‘c’, and

k= the number of times ‘c’ occurs in L[1..i], and

j=position in F of the kth occurrence of ‘c’

Then set LF[i]=j

Example: L[7] = ‘p’, it is the 2nd ‘p’ in L LF[7] = 8 which is the 2nd occ of ‘p’ in F

3. Recover the source sequence S in n steps:

Initially p=l=6 (position of $ in L); i=0; n=12;

Step i=0: S[n-i] = L[p]; S[12]=‘$’

p = LF[p]; p = 1

i = i+1; i=1

$mississippi

i$mississipp

ippi$mississ

issippi$miss

ississippi$m

mississippi$

pi$mississip

ppi$mississi

sippi$missis

sissippi$mis

ssippi$missi

ssissippi$mi

1

2

3

4

5

6

7

8

9

10

11

12

2

7

9

10

6

1

8

3

11

12

4

5

LF

-

-

-

-

-

-

-

-

-

-

-

$

SLF

Page 31: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7831

Basic CompressionBurrows-Wheeler Transform: reversible (BWT-1)

Given L=BWT(S), we can recover S=BWT-1(L)

Steps:

1. Sort L to obtain F

2. Build LF mapping so that

If L[i]=‘c’, and

k= the number of times ‘c’ occurs in L[1..i], and

j=position in F of the kth occurrence of ‘c’

Then set LF[i]=j

Example: L[7] = ‘p’, it is the 2nd ‘p’ in L LF[7] = 8 which isthe 2nd occ of ‘p’ in F

3. Recover the source sequence S in n steps:

Initially p=l=6 (position of $ in L); i=0; n=12;

Step i=1: S[n-i] = L[p]; S[11]=‘i’

p = LF[p]; p = 2

i = i+1; i=2

$mississippi

i$mississipp

ippi$mississ

issippi$miss

ississippi$m

mississippi$

pi$mississip

ppi$mississi

sippi$missis

sissippi$mis

ssippi$missi

ssissippi$mi

1

2

3

4

5

6

7

8

9

10

11

12

2

7

9

10

6

1

8

3

11

12

4

5

LF

-

-

-

-

-

-

-

-

-

-

i

$

SLF

Page 32: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7832

Basic CompressionBurrows-Wheeler Transform: reversible (BWT-1)

Given L=BWT(S), we can recover S=BWT-1(L)

Steps:

1. Sort L to obtain F

2. Build LF mapping so that

If L[i]=‘c’, and

k= the number of times ‘c’ occurs in L[1..i], and

j=position in F of the kth occurrence of ‘c’

Then set LF[i]=j

Example: L[7] = ‘p’, it is the 2nd ‘p’ in L LF[7] = 8 which isthe 2nd occ of ‘p’ in F

3. Recover the source sequence S in n steps:

Initially p=l=6 (position of $ in L); i=0; n=12;

Step i=1: S[n-i] = L[p]; S[11]=‘i’

p = LF[p]; p = 2

i = i+1; i=2

m

i

s

s

i

s

s

i

p

p

i

$

$mississippi

i$mississipp

ippi$mississ

issippi$miss

ississippi$m

mississippi$

pi$mississip

ppi$mississi

sippi$missis

sissippi$mis

ssippi$missi

ssissippi$mi

1

2

3

4

5

6

7

8

9

10

11

12

2

7

9

10

6

1

8

3

11

12

4

5

LF SLF

Page 33: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7833

Basic CompressionBzip2: Burrows-Wheeler Transform (BWT)

BWT. Many similar symbols appear adjacent

MTF.

Output the position of the current symbol within S ‘

Keep the alphabet S ‘= {a,b,c,d,e,… } sorted so that the last used symbol is moved to the begining of S ‘ .

RLE.

If a value (0) appears several times (000000 6 times)

replace it by a pair <value,times> <0,6>

Huffman stage.

Why does it work?In a text it is likely that “he” is preceeded by “t”, “ssisii” by “i”, …

Page 34: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

Introduction

Basic compression

Sequences

Bit sequences

Integer sequences

A brief Review about Indexing

PAGE 34

Agenda

images: zurb.com

Page 35: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

We want to represent (compactly) a sequence of elements and to efficiently handle them.

(Who is in the 2nd position?? How many Barts up to position 5?? Where is the 3rd Bart??)

Sequences

COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUERPAGE 35

1 2 3 4 5 6 7 8 9

Page 36: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7836

SequencesPlain Representation of Data

Given a Sequence of

n integers

m = maximum value

We can represent it with n ⌈log2(m+1)⌉ bits

16 symbols x 3 bits per symbol = 48 bits array of two 32-bit ints

Direct access (access to an integer + bit operations)

4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 41 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

100 010 100 100 100 100 001 100 010 100 001 001 010 011 100 1001 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Page 37: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7837

SequencesCompressed Representation of Data (H0)

Is it compressible?

Ho(S) = 1.59 (bits per symbol)

Huffman: 1.62 bits per symbol

26 bits: No direct access!

(but we could add sampling)

Symbol 4 1 2 3

Occurrences (nc) 9 4 2 1

0 1

16

7

1

43

0

1

2

0

1

2 3 1 4

9

4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 41 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1 01 000 0011 1 1 1 01 1 000 1 01 01 1 11 5 10 15 20 25

Page 38: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7838

SequencesSummary: Plain/Compressed access/rank/select

Operations of interest:

Access(i) : Value of the ith symbol

Ranks(i) : Number of occs of symbol s up to position i (count)

Selects (i) : Where the ith occ of symbol s? (locate)

4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 41 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

100 010 100 100 100 100 001 100 010 100 001 001 010 011 100 1001 4 5 10 13 16 19 22 25 28 31 34 37 40 43 46

1 01 000 0011 1 1 1 01 1 000 1 01 01 1 11 5 10 15 20 25

Page 39: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

Introduction

Basic compression

Sequences

Bit sequences

Integer sequences

A brief Review about Indexing

PAGE 39

Agenda

images: zurb.com

Page 40: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7840

Bit Sequencesaccess/rank/select on bitmaps

Rank1(6) = 3

Rank0(10) = 5

0 1 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 01 2 3 4 5 6 7 8 9 10 11 12 13 141516 1718 19 20 21

B=

select0(10) =15

access (19) = 0

see [Navarro 2016]

Page 41: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7841

Bit SequencesApplications

Bitmaps a basic part of most Compact Data Structures

Example: (We will see it later in the CSA)

S: AAABBCCCCCCCCDDDEEEEEEEEEEFG n log s bits

B: 1001010000000100100000000011 n bits

D: ABCDEFG s log s bits

Saves space:

Fast access/rank/select is of interest !!

Where is the 2nd C?

How many Cs up to position k?

HDT Bitmaps from

Javi's talk !!!

Page 42: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7842

Bit SequencesReaching O(1) rank & o(n) bits of extra space

Jacobson, Clark, Munro

Variant by Fariña et al.

Assuming 32 bit machine-word

Step 1: Split de Bitmap into superblocks of 256 bits, and store thenumber of 1s up to positions 1+256k (k= 0,1,2,…)

O(1) time to superblock. Space: n/256 superblocks and 1 int each

0 1 0 ... 11 2 3 256

35 bits set to 1

1 ... 1257 512

27 bits set to 1

350

1 2

Ds = 62

3

0 ... 1513 768

45 bits set to 1

...

97

3

...

Page 43: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7843

Bit SequencesReaching O(1) rank & o(n) bits of extra space

Step 2: For each superblock of 256 bits

Divide it into 8 blocks of 32 bits each (machine word size)

Store the number of ones from the beginning of the superblock

O(1) time to the blocks, 8 blocks per superblock, 1 byte each

1 1 0 ... 11 2 3 256

35 bits set to 1

1 ... 0257 512

27 bits set to 1

350

1 2

Ds = 62

3

0 ... 1513 768

45 bits set to 1

...

97

3

...

1 1 0 ... 11 2 3 32

4 bits set to 1

0 ... 133 64

6 bits set to 1

...

40

1 2

Db = 25

7

...

1 ... 0224 256

8 bits set to 1

300

44

Page 44: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7844

Bit SequencesReaching O(1) rank & o(n) bits of extra space

Step 3: Rank within a 32 bit block

Finally solving:

rank1( D , p ) = Ds[ p / 256 ] + Db[ p / 32 ] + rank1(blk, i)

where i= p mod 32

Ex: rank1(D,300) = 35 + 4 + 4 = 43

Yet, how to compute rank1(blk, i) in constant time?

1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1blk =

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Page 45: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7845

Bit SequencesReaching O(1) rank & o(n) bits of extra space

How to compute rank1 (blk, i) in constant time?

Option 1: popcount within a machine word

Option 2: Universal Table onesInByte (solution for each byte)

Only 256 entries storing values [0..8]

Finally, sum value onesInByte for the 4 bytes in blk

Overall space: 1.375 n bits

1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1blk =

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0blks =

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

1 0 0 1 0 0 0 0 1 1 0 0

Shift 32 – 12 = 20 posicións

Rank1(blk,12)

Val binary OnesInByte

0 00000000 0

1 00000001 1

2 00000010 1

3 00000011 2

252 11111100 6

253 11111101 7

254 11111110 7

255 11111111 8

... ... ...

Page 46: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7846

Bit SequencesSelect1 in O(log n) with the same structures

select1(p)

In practice, binary search using rank

Page 47: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7848

Bit SequencesCompressed representations

Compressed Bit-Sequence representations exist !!

Compressed [Raman et al, 2002]

For very sparse bitmaps [Okanohara and Sadakane, 2007]

... see [Navarro 2016]

Page 48: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

Introduction

Basic compression

Sequences

Bit sequences

Integer sequences

A brief Review about Indexing

PAGE 49

Agenda

images: zurb.com

Page 49: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7850

Integer Sequencesaccess/rank/select on general sequences

Rank2(9) = 3

S=

select4(3) =7

access (13) = 3

4 4 3 2 6 2 4 2 4 1 1 2 3 51 2 3 4 5 6 7 8 9 10 11 12 13 14

see [Navarro 2016]

Page 50: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7851

[Grossi et al 2003]

Given a sequence of symbols and an encoding

The bits of the code of each symbol are distributed along the differentlevels of the tree

000100101100 A B A C D A C

0 0 0 0

101 1

0 1

A B A A C D C

0 1 0 10 0

1

0

DATA

SYMBOL CODE

WAVELET TREEA B A C D A C

CD

00011011

BA

Integer SequencesWavelet tree (construction)

Page 51: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7852

Searching for the 1st occurrence of ‘D’?

52 OF 74

DATA

SYMBOL CODE

WAVELET TREEA B A C D A C

CD

00011011

BA

A B A C D A C

0 0 0 01 10 1

A B A A C D C

0 1 0 10 0

it is the 2nd bit in B1

Where is the 2nd ‘1’?

at pos 5.

0

1

Where is the1st ‘1’?

at pos 2.

Broot

B0 B1

Integer SequencesWavelet tree (select)

Page 52: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7853

Recovering Data: extracting the next symbol

Which symbol appears in the 6th position?

A B A C D A C

0 0 0 01 10 1

A B A A C D C

0 1 0 10 0

Which bit occurs at position 4 in B0?

How many ‘0’s are there up to pos 6?

it is the 4th‘0’

0

1

It is set to 0

The codeword read is ’00’ A

DATA

SYMBOL CODE

WAVELET TREEA B A C D A C

CD

00011011

BA

Broot

B0 B1

Integer SequencesWavelet tree (access)

Page 53: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7854

Recovering Data: extracting the next symbol

Which symbol appears in the 7th position?

A B A C D A C

0 0 0 01 10 1

A B A A C D C

0 1 0 10 0

Which bit occurs at position 3 in B1?

How many ‘1’s are there up to pos 7?

it is the 3rd‘1’

0

1

It is set to 0

The codeword read is ’10’ C

TEXT

SYMBOL CODE

WAVELET TREEA B A C D A C

CD

00011011

BA

B1

Broot

B0

Integer SequencesWavelet tree (access)

Page 54: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7855

How many C’s are there up to position 7?

A B A C D A C

0 0 0 01 10 1

A B A A C D C

0 1 0 10 0

How many 0s up to position 3 in B1?

How many ‘1’s are there up to pos 7?

it is the 3rd‘1’

0

1

2 !!

TEXT

SYMBOL CODE

WAVELET TREEA B A C D A C

CD

00011011

BA

B1

Broot

B0

Select (locate symbol)

Access and Rank:

Integer SequencesWavelet tree (rank)

Page 55: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7856

Each level contains n + o(n) bits

Rank/select/access expected O(log s) time

A B A C D A C

0 0 0 01 10 1

A B A A C D C

0 1 0 10 0

1

0

WAVELET TREE

00010010110010

DATA

SYMBOL CODE

A B A C D A C

CD

00011011

BA

n + o(n) bits

n + o(n) bits

n ⌈log s ⌉ (1 + o(1)) bits

Integer SequencesWavelet tree (space and times)

Page 56: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7857

Using Huffman coding (or others) unbalanced

Rank/select/access O(H0(S)) time

A B A C D A C

1 0 1 10 00 1

B C D C A A A

0 1 0 0

0

WAVELET TREE

1 000 1 01 001 1 01

DATA

SYMBOL CODE

A B A C D A C

CD

100001001

BA

nH0(S) + o(n) bits0 1

B D C C

1 0

Integer SequencesHuffman-shaped (or others) Wavelet tree

Page 57: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

Introduction

Basic compression

Sequences

Bit sequences

Integer sequences

A brief Review about Indexing

PAGE 58

Agenda

images: zurb.com

Page 58: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

Inverted Indexes are the most well-known index for text […]

Suffix Arrays are powerful but huge full-text indexes. Self-indexes trade a more compact space by performance

A brief review about indexing

COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUERPAGE 59

Page 59: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7860

A brief Review about IndexingText indexing: well-known structures from the Web

Traditional indexes (with or without compression)

Inverted Indexes, Suffix Arrays,...

Compressed Self-indexes

Wavelet trees, Compressed Suffix Arrays, FM-index, LZ-index, …

implicit text

auxiliar structure explicit text

Page 60: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7861

A brief Review about IndexingInverted indexes

Space-time trade-off

DCCcommunicationscompressionimagedatainformationCliffLogde

0 142

104 165 341506368

219 445

DCC is held at the Cliff Lodge convention center. It is an

international forum for current work on data compression

and related applications. DCC addresses not only

compression methods for specific types of data (text,

image, video, audio, space, graphics, web content, [...]

... also the use of techniques from information theory and

data compression in networking, communications, and

storage applications involving large datasets (including

image and information mining, retrieval, archiving,

backup, communications, and HCI).

99 207 336128 3951925

Vocabulary Posting Lists

Indexed text

Searches

Word posting of that word

Phrase intersection of postingsDo

c1

Do

c2

Compression

- Indexed text (Huffman,...)

- Posting lists (Rice,...)

1

1 22

1 21 21 211

DCCcommunicationscompressionimagedatainformationCliffLodge

Vocabulary Posting Lists

Full-positional information Doc-addressing inverted index

Page 61: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7862

A brief Review about IndexingInverted indexes

Lists contain increasing integers

Gaps between integers are smaller in the longest lists

4 10 15 25 29 40 46 54 57 70 79 82Original posting list

1 2 3 4 5 6 7 8 9 10 11 12

4 6 5 10 4 11 6 8 3 13 9 3Diferenc.

4

c6 c5 c10

29

c11 c6 c8

57

c13 c9 c3

Absolute sampling + var length coding

Direct access

Partial

decompression

c4 c6 c5 c10 c4 c11 c6 c8 c3 c13 c9 c3Var-length codingComplete

decompression

Page 62: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7863

A brief Review about IndexingSuffix Arrays

Sorting all the suffixes of T lexicographically

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T =

12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

abracadabra$

acadabra$

$ a$ adabra$

bra$

bracadabra$

cadabra$

abra$

dabra$

ra$

racadabra$

Page 63: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7864

A brief Review about IndexingSuffix Arrays

Binary search for any pattern: “ab”

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T =

12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

P = a b

Page 64: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7865

A brief Review about IndexingSuffix Arrays

Binary search for any pattern: “ab”

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T =

12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

P = a b

Page 65: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7866

A brief Review about IndexingSuffix Arrays

Binary search for any pattern: “ab”

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T =

12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

P = a b

Page 66: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7867

A brief Review about IndexingSuffix Arrays

Binary search for any pattern: “ab”

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T =

12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

P = a b

Page 67: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7868

A brief Review about IndexingSuffix Arrays

Binary search for any pattern: “ab”

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T =

12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

P = a b

Page 68: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7869

A brief Review about IndexingSuffix Arrays

Binary search for any pattern: “ab”

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T =

12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

P = a b

Page 69: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7870

A brief Review about IndexingSuffix Arrays

Binary search for any pattern: “ab”

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T =

12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

locations

Noccs = (4-3)+1Occs = A[3] .. A[4] = {8, 1}

Fast SpaceO(m lg n) O(4n)O(m lg n + noccs) + |T|

P = a b

Page 70: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7871

A brief Review about IndexingBWT FM-index

BWT(S) + other structures it is an index

• C[c] : for each char c in S , stores the number of occs

in S of the chars that are lexicographically smaller

than c.C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8

• OCC(c, k): Number of occs of char c in the prefix of

L: L [1 ..k]

For k in [1..12]

Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1

Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4

Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1

Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2

Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4

• Char L[i] occurs in F at position LF(i):

LF(i) = C[L[i]] + Occ(L[i],i)

Page 71: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7874

A brief Review about IndexingBWT FM-index

Count (S[1,u], P[1,p])

• Count (S, “issi”)

s

s

i

C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8

Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1

Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4

Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1

Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2

Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4

Page 72: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7875

A brief Review about IndexingBWT FM-index

Representing L with a wavelet tree occ is “compressed”

Page 73: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7876

Bibliography

1. M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124,Digital Systems Research Center, 1994. http://gatekeeper.dec.com/pub/DEC/SRC/researchreports/.

2. F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proc. 15th SPIRE,LNCS 5280, pages 176–187, 2008.

3. Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index. In Proc. 12thACM-SIAM Symposium on Discrete Algorithms (SODA), Washington (USA), 2001.

4. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581,2005.

5. Philip Gage. A new algorithm for data compression. C Users Journal, 12(2):23–38, February 1994

6. A. Golynski, I. Munro, and S. Rao. Rank/select operations on large alphabets: a tool for text indexing. InProc. 17th SODA, pages 368–373, 2006.

7. R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proc. 14th SODA,pages 841–850, 2003.

Page 74: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7877

Bibliography

8. David A. Huffman. A method for the construction of minimum-redundancy codes. Proc. of the Institute ofRadio Engineers, 40(9):1098-1101, 1952

9. N. J. Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE,88(11):1722–1732, 2000

10. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM J. Comp.,22(5):935–948, 1993

11. Alistair Moffat, Andrew Turpin: Compression and Coding Algorithms .Kluwer 2002, ISBN 0-7923-7668-4

12. I. Munro. Tables. In Proc. 16th FSTTCS, LNCS 1180, pages 37–42, 1996.

13. Gonzalo Navarro , Veli Mäkinen, Compressed full-text indexes, ACM Computing Surveys (CSUR), v.39n.1, p.2-es, 2007

14. Gonzalo Navarro. Compact Data Structures -A practical approach. Cambridge University Press, 570pages, 2016

15. D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9thALENEX, 2007.

Page 75: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

of 7878

Bibliography

16. R. Raman, V. Raman, and S. Rao. Succinct indexable dictionaries with applications to encoding k-arytrees and multisets. In Proc. 13th SODA, pages 233–242, 2002.

17. Edleno Silva de Moura, Gonzalo Navarro, Nivio Ziviani, and Ricardo Baeza-Yates. Fast and flexible wordsearching on compressed text. ACM Transactions on Information Systems, 18(2):113–139, 2000.

18. Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and IndexingDocuments and Images. Morgan Kaufmann, 1999.

19. Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Transactions onInformation Theory 23, 3, 337–343.

20. Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEETransactions on Information Theory 24, 5, 530–536.

Page 76: Compact Data Strutures - TU Wien · Compact Data Strutures (To compress is to Conquer) Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST 2017 …

Compact Data Strutures

(To compress is to Conquer)

Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto

23TH AUGUST 2017

3rd KEYSTONE Training SchoolKeyword search in Big Linked Data

(Thanks: slides partially by: Susana Ladra, E. Rodríguez, & José R. Paramá)