Introduction to Tree Grammar Compression Yoh Okuno
May 22, 2015
Introduction to Tree Grammar CompressionYoh Okuno
Summary● Trie contains repeated substructures.● Idea: share frequent patterns inside trie.● Approach: extend grammar compression.● TreeRePair: detect frequent pairs recursively
Method TreeRePair Succinct Pointer
Size 1/43 1/7 1
Time 14 3 1[Lohrey 2011]
Background: Repeated Substructures in Trie
● going● having● making
● pretended● attendance● tendency
Represent string by a small context free grammar which generates only the string.
RePair: String Grammar Compression
Initialize the grammar to inputLoop:
Find most frequent pair of adjacent symbolsReplace the pair by a new symbolAdd a new rule to the grammar G
Until no repeated pair exists
[Larsson 2000]
Represent tree by a small context free tree grammar which generates only the tree.
TreeRePair: Tree Grammar Compression [Lohrey 2011]
Initialize the grammar to inputLoop:
Find most frequent pair of adjacent nodesReplace the pair by a new symbolAdd a new rule to the grammar G
Until no repeated pair exists
Example 1 - Step 1
f
ac
ff
ac
ff
S -> f(a,f(c,f(a,f(c,f(a,f(c,f(a,f(c,b))))))))
ac
f
b
f
ac
f
Example 1 - Step 1
f
ac
ff
ac
ff
ac
ff
ac
f
bS -> f(a,f(c,f(a,f(c,f(a,f(c,f(a,f(c,b))))))))
Example 1 - Step 2
A
c
fA
c
fA
c
fA
f
bS -> A(f(c,A(f(c,A(f(c,A(f(c,b))))))))A(y) -> f(a,y)
c
S -> A(f(c,A(f(c,A(f(c,A(f(c,b))))))))A(y) -> f(a,y)
Example 1 - Step 2
A
c
fA
c
fA
c
fA
f
bc
S -> A(B(A(B(A(B(A(B(b))))))))A(y) -> f(a,y)B(y) -> f(c, y)
Example 1 - Step 3
AB
AB
AB
AB
b
S -> A(B(A(B(A(B(A(B(b))))))))A(y) -> f(a,y)B(y) -> f(c, y)
Example 1 - Step 3
AB
AB
AB
AB
b
S -> C(C(C(C(b))))A(y) -> f(a,y)B(y) -> f(c, y)C(y) -> A(B(y))
Example 1 - Step 4
C
C
C
C
b
S -> C(C(C(C(b))))A(y) -> f(a,y)B(y) -> f(c, y)C(y) -> A(B(y))
Example 1 - Step 4
C
C
C
C
b
S -> D(D(b))A(y) -> f(a,y)B(y) -> f(c, y)C(y) -> A(B(y))D(y) -> C(C(y))
Example 1 - Finished
D
b
D
Dataset: average on 24 XML documentsTime: pre-order traversal timeSize: memory consumptionOSS: https://code.google.com/p/treerepair/
Experiment
Method TreeRePair DAG Succinct Pointer
Time (ms) 771 3,220 164 56
Size (KB) 463 3,070 2,724 19,995
[Lohrey 2011]
DAG as Tree Grammar● Sharing subtree can compress trie● The result is DAG (directed acyclic graph)● Can be seen as a kind of tree grammar
ab c S -> a(b(D), c(D))
D -> d(e,f)
Symbols have no parametersd
fe
How to distinguish shared nodes?● Consider pre-order identifiers in original tree● Store size for each subgraph of rules● Sum size of skipped subtrees during search
Distinguish shared nodes
Example 2
S -> a(B,e(B)) |S| = 8B -> b(c,d) |B| = 3
a
b
bc d
c d
e
0
4
5
6
1
2 3
7
Example 3
S -> a(B(d,e(f)),g(B(h,i))) |S| = 11B(x,y) -> b(c(x,y)) |B(x,y)| = |x|+|y|+2
a
b
c
ed
b
c
ihf
g1
0
2
3 4
5
6
7
8
9 10
Open questions● When to share nodes, and when not to do?
○ Pruning / rank / overlap variations○ Smallest grammar is NP-hard [Charikar 2005]
● How to encode tree grammar efficiently?○ Compression and encoding are separate problems○ Extend succinct CFG [Tabei 2013] to tree?
● Can we use deterministic CFG for string?○ Construct DCFG which accepts words in lexicon○ Use shift-reduce parser to test membership○ Start symbols are unique word identifiers
(ノ゚Д)ノ== ┻━┻
Fin.
Reference● [Lohrey 2010] Tree structure compression
with RePair.● [Larsson 2000] Off-line dictionary-based
compression.● [Charikar 2005] The Smallest Grammar
Problem.● [Tabei 2013] A Succinct Grammar
Compression.●