The SBC-Tree: An Index for Run-Length Compressed Sequences Mohamed El-tabakh 1 , Wing-Kia Hon 2 Rahul Shah 3 , Walid Aref 1 , Jeffrey Vitter 1 1 Department of Computer Science, Purdue University 2 Department of Computer Science, National Tsing Hua University 3 Department of Computer Science, Louisiana State University
23
Embed
The SBC-Tree: An Index for Run-Length Compressed Sequences
The SBC-Tree: An Index for Run-Length Compressed Sequences. Mohamed El-tabakh 1 , Wing-Kia Hon 2 Rahul Shah 3 , Walid Aref 1 , Jeffrey Vitter 1 1 Department of Computer Science, Purdue University 2 Department of Computer Science, National Tsing Hua University - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The SBC-Tree: An Index for Run-Length Compressed Sequences
Mohamed El-tabakh1, Wing-Kia Hon2
Rahul Shah3, Walid Aref1, Jeffrey Vitter1
1 Department of Computer Science, Purdue University2 Department of Computer Science, National Tsing Hua University
3 Department of Computer Science, Louisiana State University
Outline Introduction
Related Work
SBC-Tree Structure
SBC-Tree Operations
Theoretical and Experimental Analysis
Summary
2
Introduction: Why Compression?
We deal with massive amount of data, scientific databases, …
Text and sequence formats are very common
Compression techniques gain significant importance because they achieve:
Significant storage reduction Reducing buffer requirements Reducing number of I/Os>>> Enhance the overall system performance
4
Introduction: Objective Current databases do not support data compression
Operate over the raw data
compress
Store, Index, and Search the compressed Sequences
Store, Index, and Search the decompressed sequences
The main challenge is how to operate on the compressed data without decompressing it
More challenging for external memory processing
5
Processing Compressed Sequences: Related Work(1)
A. Amir and G. Benson. Efficient two-dimensional compressed matching. In DCC, 1992.
A. Amir, G. Benson, and M. Farach. Let sleeping files lie: pattern matching in z-compressed files. In SODA, 1994.
A. Apostolico, G. M. Landau, and S. Skiena. Matching for run-length encoded strings. Journal of Complexity,1999.
T. Bell, M. Powell, A. Mukherjee, and D. Adjeroh. Searching BWT compressed text with the boyer-moore algorithm and binary search. In DCC, 2002.
V. Freschi and A. Bogliolo. Longest common subsequence between run-length-encoded strings: a new algorithm with improved parallelism. Information Processing Letters, 2004.
• Searching compressed data is addressed in main memory• Substring matching, longest common subsequence, edit distance
1. Processing compressed data in main memory
6
Processing Compressed Sequences: Related Work(2)
M. Stonebraker, D. Abadi, A. Batkin, X. Chen,M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik, C-store: A column oriented dbms, In VLDB, 2005.
D. Abadi, S. Madden. Compression in Column Oriented Databases. In SIGMOD, 2006.
20
20
100 times (20, 100)
Run-Length encoding Operations such as SUM can be applied directly over the compressed data
Column in a database table
More complex operations have not been addressed yet• Indexing RLE-compressed sequences• Substring searching
2. Processing compressed data in DBMSs
What is SBC-Tree? SBC-Tree (String B-tree for Compressed sequences)
An index for Run-Length Encoding (RLE) compressed sequences
Supports prefix, range, and substring matching
Optimal theoretical bounds for: External memory space complexity Search I/O requirements>> Relative to the size of the compressed
sequences
7
8
SBC-Tree: An Index for RLE Compressed Sequences
S = LLLLLLLLLLEEEEEELLLLEEEHHHHHHHHHHHHHHHHHHRLE(S) = L10 E6 L4 E3 H18>> S has 41 suffixes>> RLE(S) has 5 RLE-suffixes
RLE-char
L10 E6 L4 E3 H18E6 L4 E3 H18L4 E3 H18E3 H18H18
1. Store the compressed sequences 2. Index the RLE-suffixes3. Perform efficiently substring operations
RLE-suffixes
Run-Length Encoding (RLE) Replace tandem repeated characters with their frequency Effective with small alphabets
9
SBC-tree Structure Two-level index structure
String B-tree: Indexes the RLE-suffixes Two-dimensional index: built on top of the leaves of
the string B-tree
Two-dimensional Index(e.g., R-tree)
Tags
Pre
ced
ing
ch
ara
cte
r
String B-tree
root
Numeric tag assigned to each suffix
10
String B-tree Overview[P. Ferragina and R. Grossi., Journal of ACM,1999]
S = LLLLLLLLLLEEEEEELLLLEEEHHHHHHHHHHHHHHHHHH
(S,21)(S,12)(S,11) (S,13)
Store logical pointers instead of the keys
1. Generate all suffixes of S
2. Insert the suffixes into the String B-tree (ordered alphabetically)
3. Store the logical keys instead of the key sequence
4. Several optimizations to achieve optimal theoretical bounds for:
External memory space complexity Search I/O requirements
11 21
>> Relative to the size of the raw (decompressed) sequences
11
String B-tree over RLE-suffixes
String B-tree CANNOT be used directly to index RLE-suffixes RLE-suffixes are subset of the total suffixes
3
1
5
24
Order
• We indexed only subset of the suffixes (RLE-suffixes)• Searching for “L10 E6 L3” Found • Searching for “L5 E6 L3” Not Found• Searching for “E3 L4” Not Found
Implicit in L10 E6 L4 E3 H18
Implicit in E6 L4 E3 H18
L10 E6 L4 E3 H18E6 L4 E3 H18L4 E3 H18E3 H18H18
(S,6)(S,8)(S,4) (S,1)(S,10)
L10 E6 L4 E3 H184 6 108
12
SBC-Tree over RLE-suffixes
Query Pattern Mapping Rule: Substring query pattern P = x1f1 x2f2 … xnfn is mapped
into P’ = x1f1+ x2f2 … xnfn
L10 E6 L4 E3 H18E6 L4 E3 H18L4 E3 H18E3 H18H18
RLE-suffixes
• Searching for “L5 E6 L3” (L5+ E6 L3) Found• Searching for “E3 L4” (E3+ L4) Found
Challenge:The answer set is no longer consecutive in the index tree Unbounded number of I/Os to answer a query
L5+ E6 L3
L5 E6 L3
L6 E6 L3
L5 H2L5 K10 L3
Not part of the query answer
13
SBC-tree: Insertion Procedure
Given an RLE sequence S = Ω1 x1f1 x2f2 … xnfn
1. Insert S as the first suffix into the SBC-tree first level
2. 1 ≤ i ≤ n, insert RLE-suffix xi1 xi+1fi+1 … xnfn into the SBC-tree first level Assign it a position tag T (Tag assignment problem)
3. Insert into the SBC-tree second level point = (T, f i)
Two-dimensional Index(e.g., R-tree)
Tags
Pre
ce
din
g
cha
racte
r
String B-tree
root
Numeric tag assigned to each suffix
Two-dimensional Index(e.g., R-tree)
Tags
Pre
ce
din
g
cha
racte
r
String B-tree
root
Numeric tag assigned to each suffix
14
SBC-tree: Substring Searching
Given a query Q = y1f1 y2f2 … ymfm
1. Map Q into Q’ = y1f1+ y2f2 … ymfm
2. Search the String B-tree for Q’’ = y11 y2f2 … ymfm
Returns (min_tag, max_tag) as a contiguous range
3. Search the SBC-tree second level for suffixes with frequency >= f1
String B-tree
The answer set
Pre
cedi
ng R
LE
-cha
r
Suffix tag
f1
Two-dimensional indexTwo-dimensional Index
(e.g., R-tree)
Tags
Pre
cedi
ng
char
acte
r
String B-tree
root
Numeric tag assigned to each suffix
Two-dimensional Index(e.g., R-tree)
Tags
Pre
cedi
ng
char
acte
r
String B-tree
root
Numeric tag assigned to each suffix
Max_tagMin_tag
SBC-Tree: Example
15
P = A5 E3 B4 P’ = A5+ E3 B4 P’’ = A1 E3 B4
SBC-Tree Variants
3-sided structure[L. Arge, V. Samoladas, J. Vitter, PODS99]
External memory structure based on priority search tree and B-tree Answers 3-sided range queries in 2D space Provides optimal worst-case theoretical bounds for:
External memory space complexity Insertion and deletion 3-sided range query
R-tree Available in all DBMSs Provides good performance in practice Does not have worst-case theoretical bounds for searching
One-Level SBC-tree Remove the second level structure Disadvantage: In queries scan many tuples outside the answer set
16
17
SBC-tree Theoretical Bounds
Optimal external-memory space complexity O(N/B)
Optimal substring, prefix, and range searching in
O(LogBN + (|p| +T)/B) I/O operations
Insertion and deletion in (m LogB(N+m)) amortized I/O operations
Parameter Definition
B Disk page size
N Total length of the RLE-compressed sequences
T Query output size
|p| Length of the RLE-compressed query pattern
m Length of the RLE-compressed sequence to be inserted or deleted
18
SBC-tree Implementation SBC-tree (R-tree variant) is implemented inside PostgreSQL