Multiple Pattern Matching Multiple Pattern Matching in LZW Compressed Text in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics Department of Informatics Kyushu University, Japan Kyushu University, Japan Nagano Nagano Fukuoka Fukuoka Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA
23
Embed
Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Multiple Pattern Matching Multiple Pattern Matching in LZW Compressed Textin LZW Compressed Text
Previous result vs Our resultPrevious result vs Our result Amir, Benson, and Farach's algorithm (JCSS 1996)Amir, Benson, and Farach's algorithm (JCSS 1996)
"Let sleeping files lie: Pattern matching in Z-compressed files""Let sleeping files lie: Pattern matching in Z-compressed files"– deals with deals with only singleonly single pattern. pattern.– can find can find only the first occurrenceonly the first occurrence of the pattern. of the pattern.– takes O(takes O(nn++mm22) time and space.) time and space.
n : length of the compressed text, n : length of the compressed text, m: length of the pattern.m: length of the pattern.
Our algorithmOur algorithm– deals with deals with multiplemultiple patterns. patterns.– can find can find all occurrences all occurrences of the patterns.of the patterns.– takes O(takes O(nn++mm22++rr) time and O() time and O(nn++mm22) space.) space.
m: total length of the patterns,m: total length of the patterns, r r : number of pattern occurrences. : number of pattern occurrences.
original text: a a b a b a a b b a b a b original text: a a b a b a a b b a b a b aa b a b b a baa b a b b a ba ba b a b a baa b a b b a ba b a ba b a baa aa
Our AlgorithmOur AlgorithmInput. Input. Π Π : set of patterns,: set of patterns, uu11,,uu22, …,, …,uunn :: LZW compressed text . LZW compressed text .Output. All occurrences of the patterns.Output. All occurrences of the patterns.
Construct from Construct from ΠΠ the AC machine, the AC machine, and the generalized suffix trie.and the generalized suffix trie. Initialize the dictionary trie, Initialize the dictionary trie, NextNext and and Output Output ;;
ll:=0; :=0; statestate:=:=qq00;;
for for ii:=1 to :=1 to nn do begin do begin for eachfor each 〈〈 d d ,π,π 〉∈ 〉∈ OutputOutput((statestate,,uuii)) do do report "report "pattern π occurs at position pattern π occurs at position ll++dd"";; statestate:=:=NextNext((statestate,,uuii));; ll:= := ll+ + ||uuii||;; Update the dictionary trie, Update the dictionary trie, Next Next and and OutputOutput end.end.
O( O( nn++r r )) O( O( n n ))
O( O( mm22 ))
Ok! Let’s go!Ok! Let’s go!
State Transition Function State Transition Function Next Next ((qq, , uu))
statestatestatestate a b a b c ab ba bbab ba bb bc ca aba abb ca aba abb abc bab bca bca abab abca babb ababb abca babb ababba b a b c ab ba bbab ba bb bc ca aba abb ca aba abb abc bab bca bca abab abca babb ababb abca babb ababb
Table of Table of NN1 1 ((qq, , uu)) ・・ uu --- O( --- O( mm××m m ))
Ancestor(Ancestor(qq, , kk): the ancestor of node ): the ancestor of node qq with distance with distance kk in the trie of AC machine.in the trie of AC machine.
u : u : one of the explicit descendants of node uone of the explicit descendants of node u in the generalized suffix trie.in the generalized suffix trie.
Original TextOriginal Text"The Brown corpus""The Brown corpus"
6.8 Mbytes6.8 Mbytes
Compressed TextCompressed Text
3.4 Mbytes3.4 MbytesLanguage: C++ (gcc without optimization)Language: C++ (gcc without optimization)Machine : Sun SPARCstation 20.Machine : Sun SPARCstation 20.
compresscompress(UNIX command)(UNIX command)
Result of the ExperimentResult of the Experiment
(number of pattern occurrences / original text length)
Previous ResultPrevious Result Our ResultOur Result
deals with only single deals with only single patternpattern
deals with deals with multiplemultiplepatternspatterns
can find only the first can find only the first occurrence of the patternoccurrence of the pattern
takes O( takes O( nn++mm2 2 ) time and) time andspacespace
can find can find all occurrences all occurrences of the patternsof the patterns
takes O( takes O( nn++mm2 2 ) space) spacecan answer in O(can answer in O(nn++mm22++rr))timetime
no practical evaluationno practical evaluationabout about twice faster twice faster thanthana decompression followeda decompression followedby using the AC machineby using the AC machine