Speeding up pattern Speeding up pattern matching matching by text compression by text compression tment of Informatics, Kyushu University, J ment of AI, Kyushu Institute of Technology, Yusuke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shi nohara, Setsuo Arikawa
35
Embed
Speeding up pattern matching by text compression Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Speeding up pattern matching Speeding up pattern matching by text compressionby text compression
Department of Informatics, Kyushu University, JapanDepartment of AI, Kyushu Institute of Technology, Japan
Pattern matching algorithm on BPE compressed text.
Experimental result.
Conclusion.
Pattern matchingmatching is one of the most fundamental operations in string processing.Recently, a new trend for accelerating pattern matchingmatching hasemerged: Speeding up pattern matchingmatching by text compression.From the traditional criteria for data compression, i.e., compression ratio and compression/decompression time,adaptive dictionary methods such as the Lempel-Ziv family are often preferred. However, such methods cannot speed upthe pattern matchingmatching since an extra work is needed to keep track of compression mechanism.
Pattern matchingmatching is one of the most fundamental operations in string processing.Recently, a new trend for accelerating pattern matchingmatching hasemerged: Speeding up pattern matchingmatching by text compression.From the traditional criteria for data compression, i.e., compression ratio and compression/decompression time,adaptive dictionary methods such as the Lempel-Ziv family are often preferred. However, such methods cannot speed upthe pattern matchingmatching since an extra work is needed to keep track of compression mechanism.
Pattern Matching Problem
matchingmatchingPatternPattern
TextText
Knuth-Morris-Pratt (1974)
Boyer-Moore (1977)
Aho-Corasick (1975)
Shift-Or (1992)
Pattern Matching on Compressed Text
Expand
on Memory
on Memory
File transfer
on Secondary disk storage
original textoriginal text
File transfer
on Memoryon Secondary disk storage
compressed textcompressed text
SearchSearch
SearchSearch
It requires extra time and space.
Pattern Matching on Compressed Text
File transfer
on Memoryon Secondary disk storage
compressed textcompressed text
Search directlySearch directly
To perform a faster search in compressed texts in comparisonwith a regular decompression followed by an ordinary search.
GOAL 1GOAL 1
To perform a faster search in compressed texts in comparison with an ordinary search in the original texts.
GOAL 2GOAL 2
Speeding up pattern matching by text compression
Previous Results(1)
1988 Eliam-Tsoreff and Vishkin run-length
1992 Amir, Landau, and Vishkin two-dimensional run-length
1995 Farach and Thorup LZ77
1996 Amir, Benson and Farach LZW
1997 Karpinski, Rytter, and Shinohara straight-line programs
1996 Gasieniec, et al. LZ77
1997 Miyazaki, Shinohara, and Takeda straight-line programs
1992 Amir and Benson two-dimensional run-length
Amir, Benson, and Farach1994 two-dimensional run-length
1997 Takeda finite state encoding
1998 Shibata byte pair encoding
1994 Manber original compression scheme
1998 Fukamachi, Shinohara, and Takeda Huffman encoding
1998 Kida, et al. LZW
year researcher compression
year researcher compression
1999 Shibata, Takeda, Shinohara, andArikawa
Antidictionary based
1999 Kida, Takeda, Shinohara, andArikawa
LZW
2000 Shibata, et al. Byte pair encoding
1999 Navarro and Raffinot LZ family
Today’s talkToday’s talk
Previous Results(2)
1998 de Moura, Navarro, Ziviani, andBaeza-Yates
Word based encoding
Unifying frameworkUnifying
frameworkKida, et al.1999 Dictionary based methods
(Collage system)
A Unifying Framework for Compressed Pattern Matching
Previous:Compression A PM Algorithm A
Compression B PM Algorithm B
Compression C PM Algorithm C
Collage system
Kida et al.[1999]:
Pattern matching algorithm on the unifying framework
Compression A
Compression B
Compression C
Collage SystemCollage System
Definition and Several Examples
Originaltext
Originaltext
Dictionary Based Compression
compressedtext
compressedtext
Dictionarystructure
Dictionarystructure
encoding
factorize into a series of phrases
How to choose the phrases.How to design the data structure of the dictionary.How to encode phrases.
Collage System
Collage system is a pair 〈 D, S 〉
S : A sequence of variables defined in D (Compressed text)
S = Xi1 , Xi2 , ・・・ , Xil ( Xi ∈D )
D : A sequence of assignments (Dictionary structure)
X1 := expr1 ; ・・・X2 := expr2 ; Xn := exprn ;
||D|| = n : number of assignments in D
|S| = l : number of variables in S
where exprk are ...
X1 = expr1 ; ・・・X2 = expr2 ; Xn = exprn ;
D : A sequence of assignments (Dictionary structure)
a a ∈Σ {ε∪ }, (primitive assignment)
Xi ・ X j (concatenation)for i, j < k,
( Xi ) j for i < k and integer j ( j times repetition)
[ j ]Xi(prefix truncation)for i < k and integer j
Xi [ j ] (suffix truncation)for i < k and integer j
Collage System
Example of Collage System
X1 = a ;X2 = b ;
D :
S : X3 , X6 , X4 , X7
abbabbababba
X7 = X6・ X4 ;
X6 = [ 3 ]X5 ;
X5 = ( X3 )3 ;
X4 = X2・ X1 ;
X3 = X1・ X2 ;
babbabababababbaab
X7
X6 X4
X5
X3
X1 X2
X2 X1
a b )3 )[ 3 ] (( b a
prefixtruncation
3 timesrepetition
T(X7)
height(X7) = 4
height(D) = 4
??????
Pattern Matching Algorithmon a Collage System
Compressed pattern matching on a collage system
mm : pattern lengthrr : number of pattern occurrences
||||DD|||| : number of assignments in D||SS|| : number of variables in S
Theorem[Kida et al. 1999]Problem of compressed pattern matching
can be solved inOO( (||( (||DD||+|||+|SS|)|) ・・ heightheight((DD) + ) + mm22 + + r r ) ) timetime
using OO( ||( ||DD|| + || + mm22 ) ) spacespace.If D contains no truncation, it can be solved in
OO( ||( ||DD|| + ||| + |SS| + | + mm22 + + r r )) time time.
Theorem[Kida et al. 1999]Problem of compressed pattern matching
can be solved inOO( (||( (||DD||+|||+|SS|)|) ・・ heightheight((DD) + ) + mm22 + + r r ) ) timetime
using OO( ||( ||DD|| + || + mm22 ) ) spacespace.If D contains no truncation, it can be solved in
OO( ||( ||DD|| + ||| + |SS| + | + mm22 + + r r )) time time.
state: 0
: goto function: failure function
Pattern π= a b a b b
Basic Idea
original text: abababba
0a
1 2b a
3b
4b
5
1 2 3 4 3 4 5 1
S : Xi1 Xi2 Xi3 Xi4
abababba
The set Output( j, u) ={1≦i≦|u| | P = a suffix of P[1: j]・ u[1: i]}
The function Jump( j, u) =δKMP( j, u)
•This set contains the pattern occurrences.
•The domain is Q×D• It simulates the sequence of state transitions for u.
Jump and Output
Reply inO(1) timeReply inO(1) time
Reply inO( l ) timeReply in
O( l ) time
Realization of Jump and Output
for Jump( q, Xk) , if Xk is ...
a
Xi ・ X j
O(1) time
If the factor concatenation problem for length m string can be solved in O(1) time, it can be solved in O(1) time.
a
Xi ・ X j
O(1) time
for Output( q, Xk), if Xk is ...
It can be enumerate in O( l ) time
from Output of Xi and X j .
Size of the set Output
Size of the set Output
Factor Concatenation Problem
example: P = COPACABANA
OPA , CABAN OPACABAN‘Yes’! P[2:9]concatenate
Instance: Two factors x and y of a string Peach represented as a node of suffix trie of P.Question: Is the string xy a factor of P ?If ‘yes’ then return its node number.
Solution to the problem
• Using a suffix trie, it can be solved in O(m) time after preprocessing of O(m2) time and space.
• Using a two-dimensional lookup table, it can be solved in O(1), but we need O(m4) time and space preprocessing.
It can be solved in O(1) time after O(m2) space and time preprocessing.
Outline of Our Algorithm
Input. pattern P and collage system 〈 D, S 〉 ( S := Xi1 , Xi2 , ・・・ , Xin )Output. All occurrences of the patterns.
Input. pattern P and collage system 〈 D, S 〉 ( S := Xi1 , Xi2 , ・・・ , Xin )Output. All occurrences of the patterns.
/* preprocessing of D and P */ preprocess(D); preprocess(P);
l:=0; q:=0;for j:=1 to n do begin for each dOutput(q, Xij) do report ‘pattern occurs at position l+d ’;
q:= Jump(q, Xij); /* state transition */
l:= l + |Xij |; /* calculation of the offset */end