Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Post on 04-Jan-2016

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Real time pattern matching

Porat BennyPorat Ely

Bar-Ilan University

Pattern Matching

Given a Text T and Pattern P, the problem is to find all the substring of T that equal to P.

T=

P=

Online pattern matching

We get the text character by character

=P

Outline

Motivation

Presentation of 3 online models

Space lower bound

A black box algorithm

Exact and approximate pattern matching in the streaming model

Motivation…

Monitoring internet traffic

Motivation…

Stock market

Motivation..

Espionage

Motivation…

Viruses and malware

3 online models

Read only memory

Working memory

Secondm, for saving the pattern

O(poly(log(m))

third0, we can’t save the pattern

O(poly(log(m))

First m, for saving the pattern

O(m)

Space lower bound (deterministic)

Assume algorithm A, use o(m) space for solving the online pattern matching problem

Alice Bob

A

s1,s2,s3…. smS=

S A

Run over all the string Q = q1,q2,…qm. and insert Q, as the text for A.

AQ

Q = S

match

A black box for online approximate pattern matching

Raphaël CliffordBenny PoratEly Porat

CPM 2008

Black box for the First model

Read only memory

Working memory

Firstm, for saving the pattern

O(m)

Problem definition

There are a lot of offline pattern matching algorithms.

We want to find a black box algorithm, that takes most offline pattern matching algorithms and converts them to be pseudo real time.

pseudo real time – take the best time of the offline algorithm,

divide it by nAnd this is bound the time per character.

Not Amortized!!

Result In example, we can applied our

algorithm to the flowing problem Hamming norm K-mismatch Matching under L2

Matching under L1

Online Convolution . .

2logO m logO m m

log logO k k m

2logO m

logO m m

2logO m

Exact And Approximate Pattern Matching In The Streaming Model

Porat BennyPorat Ely

FOCS 2009

solution for the third model

Read only memory

Working memory

third0, we can’t save the pattern

O(poly(log(m))

Pattern Matching

Pattern Matching up to k mistake

It’s not minor!

Cache Work much faster then the Ram Now it’s can fit!

Anti virus on routers

Researchers thought that there is a lower bound and it can't be done.

Randomized algorithm (RK)

1 21, 1 2( ) ...m m

i i m i i i mT t r t r t

pm-1,…p2,p1,p0

t1,t2,t3, … ,ti+1,ti+2 ,…tm, , … tn

2 10 1 2 1( ) ... m

mp p p r p r p r

1 2, 1 1 1( ) ...m mi i m i i i mT t r t r t

How can I calculate from without remembering ti ???

1( )iT ( )iT

ti tm+1

All the calculation in Fq

Streaming pattern matching

P= Z

ZT

Signature

Start signing

Signature

The pattern start with z, and there is no more z's in the pattern

Z

Signature

Start signing

No Z

P= U

UT

Signature

Start signing

Signature

There is a prefix U s.t U appear only once in the pattern

U

Signature

Start signing

m<=m/2Seek in recursion

No small U

P= U

Look on the first m/2 characterThey appear again somewhere

U

P= v v v v v v v v

Prefix of v

Option 1

Option 2

P= v v v v w

w isn't a prefix of vand v isn't a prefix of w

v=<m/2

Solving this case

Option 2

P= v v v v wv=<m/2

Search in recursion for v, and count how many time you found it

Sign on w

T v v v v

Start signing

Signature

v

Solving this case - continue

Option 2

P= v v v v wv=<m/2

Search in recursion for v, and count how many time you found it

Sign on w

T v v v v

Start signing

Signature

v

Using O(log m) signatures and counters in the worst caseTime = O(log m) in the worst case

v v v

>m/2

<m/2Signature

Start signing

Pattern Matching up to k mistake

1 – mismatch

Pattern Matching up to k mistake

Chinese Remainder Theorem

Lets n and m be two coprimes.

a mod n=b mod n a mod m= b mod m

a mod nm=b mod nm

1-mismatch

p1,p2,p3, … pm

p1,p3,p5 …

p2,p4,p6 …

p1,p4,p7 …

p2,p5,p8 …

p3,p6,p9 …

mod 2

mod 3

q1q2q3 . ..q l s . t ∏i=1

l

qi≥m

1-mismatch

p1,p3,p5 …

p2,p4,p6 …

t1,t3,t5 …

t2,t4,t6 …

p1,p3,p5 …

p2,p4,p6 … mod 2

p1,p4,p7 …

p2,p5,p8 …

p3,p6,p9 …

mod 3

Overall sum of all primes

1-mismatch

p1,p3,p5 …

p2,p4,p6 …

t1,t3,t5 …

t2,t4,t6 …

p1,p3,p5 …

p2,p4,p6 … mod 2

Problem

p1,p3,p5 …

p2,p4,p6 …

t1,t3,t5 …

t2,t4,t6 …

p1,p3,p5 …

p2,p4,p6 … mod 2

p1,p3,p5 …

t2,t4,t6 …

When we compare?

For each qi we will start to compare for each alignment 0≤σ≤q i

Space complexity

For each qi we run qi time our algorithm for each alignment.

For each alignment we run again qi

time for each shift.

Overall:

m

mOmq

l

oii loglog

loglog

42

Time complexity

Each character go to just one alignment for each shift.

Overall: ∑i=o

l

q i logm∈O log 3mloglog m

1-mismatch

Lemma1 There is exactly one mismatch

There is exactly one subpattern in each group that not match.

C.R.T

Pattern Matching up to k mistake

Group testing/ Random selector…

A black box for online approximate pattern matching

Raphaël CliffordBenny PoratEly Porat

CPM 2008

The idea

We will split the pattern to log(m) consecutive subpattern

p1, p2, p3, … pm-3, pm-2, pm-1, pm

pm

p1, p2, p3, … pm/2

pm-6,pm-5,pm-4,pm-3

pm-2 ,pm-1

P1

P2

P4

Pm/2

Bring it online

Let look on subpattern with length m’=>Pm’

When we got to the i’th character of the text, to where is Pm’ align?

Conclusion 1 We need to know DIFF(Pm’,T(i-m’,i)) just at position

i+m’ of the text.

ti

pmpm-1 pm-2…Pm’…

m

m’ m’-1

The idea…

For each subpattren of length m’. we partition the text to overlap substring

of length 2m’

m’ m’m’m’m’m’

2m’ 2m’

2m’

2m’

2m’

The idea…

For each subpattren of length m’ we run the offline algorithm on each partition of

the text separately.

This ensure us, that we got the difference on time.

ti

If i=2lm’ or 2lm’+m’ for some l

run the offline algorithm on the last 2m’ character.

m’

2m’

We will got all the differences for this section

Running Time T(n,m)=nT(m) – the running time of the

offline algorithm For each subpattern of length m’

We got overlap partition. total time for each subpattrn:

Total time:

' 1

n

m

( ) 2 ' ( ') ( ( '))'

nO m T m O nT mm

log 1

1( , 2 )

m j

jO T n

The problem

We saw, that overall the time is good But,

2m’ = m

2m’ = m

2tm’+m’

m’ = m/2 Pm/2 m’ = m/2

ti

2(t+1)m’

We must wait until the run of the offline algorithm on Pm/2 and the last m character to finish, before we can return the answer for. => (m/2)T(m) time!

The solution

We will split the text to partition of length 1.5m’

m’ m’m’m’m’

1.5m’

1.5m’

m’

The solution…

The latest we will get DIFF(Pm’,Ti-m’,i) will be at index i+m’/2

And by Conclusion 1, we can wait m’/2 character, before we will need this difference.

Conclusion 1.We need to know DIFF(Pm’,Ti-m’,i) just atposition i+m’ of the text.

Spreading the work

So, we can spread the work over the next m’/2 character.

m’/2 m’/2 m’/2 m’/2 m’/2 m’/2

P1 P2 P3

Work on p1

Work on p2

Work on p3

Need to know the difference of P1

Spreading the work…

Overall, we can spread the work for a specific subpattern equivalently between all the character of the text.

All we left to do, is to check that the running time, not change.

Running Time T(n,m)=nT(m) – the running time of the

offline algorithm For each subpattern of length m’

Now, We got overlap partition. total time for each subpattrn:

Total time for all the text:

'/ 2

n

m

( ) 2 ' ( ') ( ( '))'

nO m T m O nT mm

log 1

1( , 2 )

m j

jO T n

Not change!

Running Time…

By spreading the work we got total running time for each character

log 1

1( , 2 ) /

m j

jO T n n

conclusion

We give a space lower bound for deterministic online pattern matching

We give a black box algorithm that can adapt any offline algorithm to online algorithm, using only O(m) space and take time per character.

log 1

1( , 2 ) /

m j

jO T n n

top related