10/7/16 1 Advanced Algorithmics (6EAP) MTAT.03.238 Succinct Trees Jaak Vilo Thanks to S. Srinivasa Rao 2015 1 Jaak Vilo succinct Suppose that is the information-theoretical optimal number of bits needed to store some data. A representation of this data is called • implicit if it takes bits of space, • succinct if it takes bits of space, and • compact if it takes bits of space. http://en.wikipedia.org/wiki/Succinct_data_structure • In computer science,a succinct data structure is a data structure which uses an amount of space that is "close" to the information-theoretic lower bound, but (unlike other compressed representations) still allows for efficient query operations. The concept was originally introduced by Jacobson [1] to encode bit vectors, (unlabeled) trees, and planar graphs. Unlike general lossless data compression algorithms, succinct data structures retain the ability to use them in-place, without decompressing them first. A related notion is that of a compressed data structure, in which the size of the data structure depends upon the particular data being represented. Succinct Representations of Trees S. Srinivasa Rao Seoul National University Outline p Succinct data structures n Introduction n Examples p Tree representations n Motivation n Heap-like representation n Jacobson’s representation n Parenthesis representation n Partitioning method n Comparison and Applications p Rank and Select on bit vectors Succinct data structures p Goal: represent the data in close to optimal space, while supporting the operations efficiently. (optimal –– information-theoretic lower bound) Introduced by [Jacobson, FOCS ‘89] p An “extension” of data compression. (Data compression: n Achieve close to optimal space n Queries need not be supported efficiently )
9
Embed
Succinct Representations of Trees - ut · - Search trees (B-trees, binary search trees, digital trees or tries) - Graph structures (we do a tree based search) -Search indexes for
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
10/7/16
1
AdvancedAlgorithmics(6EAP)MTAT.03.238
SuccinctTrees
JaakViloThankstoS. Srinivasa Rao
2015
1JaakVilo
succinct
Suppose that is the information-theoretical optimal number of bits needed to store some data. A representation of this data is called
• implicit if it takes bits of space, • succinct if it takes bits of space, and • compact if it takes bits of space. !
• In computerscience,a succinctdatastructure isa datastructure whichusesanamountofspacethatis"close"tothe information-theoretic lowerbound,but(unlikeothercompressedrepresentations)stillallowsforefficientqueryoperations.TheconceptwasoriginallyintroducedbyJacobson[1] toencode bitvectors,(unlabeled) trees,and planargraphs.Unlikegeneral losslessdatacompression algorithms,succinctdatastructuresretaintheabilitytousethemin-place,withoutdecompressingthemfirst.Arelatednotionisthatofa compresseddatastructure,inwhichthesizeofthedatastructuredependsupontheparticulardatabeingrepresented.
Succinct Representations of Trees
S. Srinivasa Rao
Seoul National University
Outlinep Succinct data structures
n Introductionn Examples
p Tree representationsn Motivationn Heap-like representationn Jacobson’s representationn Parenthesis representationn Partitioning methodn Comparison and Applications
p Rank and Select on bit vectors
Succinct data structuresp Goal: represent the data in close to
optimal space, while supporting the operations efficiently.
(optimal –– information-theoretic lower bound)
Introduced by [Jacobson, FOCS ‘89]
p An “extension” of data compression.(Data compression:
n Achieve close to optimal spacen Queries need not be supported efficiently )
10/7/16
2
Applications
p Potential applications where
n memory is limited: small memory devices like PDAs, mobile phones etc.
n massive amounts of data: DNA sequences, geographical/astronomical data, search engines etc.
Examples p Trees, Graphsp Bit vectors, Setsp Dynamic arraysp Text indexes
n suffix trees/suffix arrays etc.
p Permutations, Functionsp XML documents, File systems (labeled,
multi-labeled trees)p DAGs and BDDsp …
Example: Text Indexingp A text string T of length n over an alphabet Σ can
be represented usingn n log |Σ| + o(n log |Σ|) bits,(or the even the k-th order entropy of T)
to support the following pattern matching queries (given a pattern P of length m):n count the # occurrences of P in T,n report all the occurrences of P in T,n output a substring of T of given length
in almost optimal time.
Example: Compressed Suffix Treesp Given a text string T of length n over an
alphabet Σ, one store it using O(n log |Σ|) bits, to support all the operations supported by a standard suffix tree such as pattern matching queries, suffix links, string depths, lowest common ancestorsetc. with slight slowdown.
p Note that standard suffix trees use O(n log n) bits.
Example: PermutationsA permutation p of 1,…,n
A simple representation:p n lg n bits- p(i) in O(1) time- p-1(i) in O(n) time
Succinct representation:p (1+ε) n lg n bits - p(i) in O(1) time- p-1(i) in O(1/ε) time (`optimal’ trade-off)
- pk(i) in O(1/ε) time (for any positive or negative integer k)
- lg (n!) + o(n) (< n lg n) bits (optimal space)- pk(i) in O(lg n / lg lg n) time
p(1)=6 p-1(1)=5
1 2 3 4 5 6 7 8
p: 6 5 2 8 1 3 4 7
p2(1)=3 p-2(1)=5…
Memory modelp Word RAM model with word size Θ(log n)
supporting
n read/write n addition, subtraction, multiplication, divisionn left/right shiftsn AND, OR, XOR, NOT
operations on words in constant time.
(n is the “problem size”)
10/7/16
3
Succinct Tree Representations
MotivationTrees are used to represent:
- Directories (Unix, all the rest)- Search trees (B-trees, binary search trees,
digital trees or tries)- Graph structures (we do a tree based
search)- Search indexes for text (including DNA)
- Suffix trees
- XML documents- …
Drawbacks of standard representationsp Standard representations of trees support
very few operations. To support other useful queries, they require a large amount of extra space.
p In various applications, one would like to support operations like “subtree size” of a node, “least common ancestor” of two nodes, “height”, “depth” of a node, “ancestor” of a node at a given level etc.
Drawbacks of standard representationsp The space used by the tree structure
could be the dominating factor in some applications.
n Eg. More than half of the space used by a standard suffix tree representation is used to store the tree structure.
p “A pointer-based implementation of a suffix tree requires more than 20n bytes. A more sophisticated solution uses at least 12n bytes in the worst case, and about 8n bytes in the average. For example, a suffix tree built upon 700Mb of DNA sequences may take 40Gb of space.”
-- Handbook of Computational Molecular Biology, 2006
Standard representationBinary tree: each node has twopointers to its leftand right children
An n-node tree takes2n pointers or 2n lg n bits(can be easily reduced to n lg n + O(n) bits).
Supports finding left child or right child of a node (in constant time).
For each extra operation (eg. parent, subtree size) we have to pay, roughly, an additional n lg n bits.
x
xxxx
x xx x
Can we improve the space bound?p There are less than 22n distinct binary
trees on n nodes.n "The Art of Computer Programming", Volume 4,
Fascicle 4: Generating all treesn http://www-cs-faculty.stanford.edu/~uno/fasc4a.ps.gz
p 2n bits are enough to distinguish between any two different binary trees.
p Can we represent an n node binary tree using 2n bits?
10/7/16
4
http://cs.lmu.edu/~ray/notes/binarytrees/
Heap-like notation for a binary tree
1
11 1
1 1
1
1
00 00
0 000
0
Add external nodes
Label internal nodes with a 1and external nodes with a 0
Write the labels in level order
1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0
One can reconstruct the tree from this sequence
An n node binary tree can be represented in 2n+1 bits.
p If the bit vector is read-only, any index (auxiliary structure) that supports rank or select in constant time (in fact in O(log m) bit probes) has size Ω(m loglog m / log m)
[Miltersen ‘05] [Golynski ‘06]
Space measuresp Bit-vector (BV):
n space used be m + o(m) bits.
p Bit-vector index :n bit-sequence stored in read-only memoryn index of o(m) bits to assist operations
p Compressed bit-vector: with n 1’sn space used should be B(m,n) + o(m) bits.
úú
ùêê
é÷ø
öçè
æ=
nm
nmB log),(
Results on Bitvectors
p Elias (JACM 74)p Jacobson (FOCS 89)p Clark+Munro (SODA
96)p Pagh (SICOMP 01)p Raman et al (SODA
02)p Miltersen (SODA 04)p Golynski (ICALP 06)p Gupta et al.
Implementations:n Geary et al. (TCS 06)n Kim et al. (WEA 05)n Delpratt et al. (WEA 06,
SOFSEM 07)n Okanohara+Sadakane
(ALENEX 07)
(Entry in Encyclopaedia of Algorithms)
Ordered trees ?A rooted ordered tree (on n nodes):
Navigational operations:- parent(x) = a- first child(x) = b- next sibling(x) = c
Other useful operations:- degree(x) = 2- subtree size(x) = 4
x
a
b
c
Ordered treesp A binary tree representation taking 2n+o(n) bits
that supports parent, left child and right childoperations in constant time.
p There is a one-to-one correspondence between binary trees (on n nodes) and rooted ordered trees (on n+1 nodes).
p Gives an ordered tree representation taking 2n+o(n) bits that supports first child, next sibling(but not parent) operations in constant time.
p We will now consider ordered tree representations that support more operations.
Level-order degree sequence
3 2 0 3 0 1 0 2 0 0 0 0
But, this still requires n lg n bits
Solution: write them in unary
1 1 1 0 1 1 0 0 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0
Takes 2n-1 bits
Write the degree sequence in level order 3
2 0 3
0 0
0 0 0
01 2
A tree is uniquely determined by its degree sequence
•level ancestor•LCA•leftmost/rightmost leaf•number of leaves in the subtree•next node in the level•pre/post order number •i-th child
[Munro-Raman ‘97] [Munro et al. 01] [Sadakane ‘03] [Lu-Yeh ‘08][Implementation: Geary et al., CPM-04]
A different approachp If we group k nodes into a block, then pointers
with the block can be stored using only lg k bits.
p For example, if we can partition the tree into n/k blocks, each of size k, then we can store it using (n/k) lg n + (n/k) k lg k = (n/k) lg n +n lg k bits.
A careful two-level `tree covering’ method achieves a space bound of 2n+o(n) bits.