Top Banner
DICTIONARY MATCHING WITH ONE GAP Amihood Amir, Avivit Levy , Ely Porat and B. Riva Shalom 1 C P M 2 0 1 4
26

D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

Dec 15, 2015

Download

Documents

Alan Shropshire
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

1

DICTIONARY MATCHING

WITH ONE GAP

Amihood Amir, Avivit Levy ,Ely Porat and B. Riva

Shalom CPM

20

14

Page 2: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

2

CPM

20

14

CPM 2014 - MOSCOW

Page 3: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

3

CPM

20

14

!MIND THE GAP

Page 4: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

4

OUTLINE

The DMG(Dictionary Matching with one

Gap ) ProblemMotivationPrevious WorkBidirectional Suffix Trees SolutionLookup Table additionOpen Problems

CPM

20

14

Page 5: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

5

THE DMG PROBLEM

A gapped pattern is a pattern P of the form:P1{1,1} P2{2,2}… Pk-1{k-1,k-1}Pk

Each Pj is over alphabet ,{j,j} is a sequence of at least j and at most j don’t cares = @.

Example: aba{3,6}cbb aba @@@cbb aba@@@@cbb aba@@@@@cbb aba@@@@@@cbb

CPM

20

14

Page 6: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

6

THE DMG PROBLEM

The DMG problem is:Preprocess: A dictionary D of d gapped

patterns P1,…, Pd over alphabet .

Query: A text T of length n over alphabet .

Output: all locations in T where a dictionary gapped pattern ends.

We focus on DMG with a single gap.

CPM

20

14

Page 7: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

7

EXAMPLE

Dictionary: P1 = aba {3,6} cbb

P2 = ab {3,6} bbac

P3 = aa {3,6} ac

Query 1 2 3 4 5 6 7 8 9 10 11

text: a b a a b a c b b a c

P1,1 P1,2P2,1

P2,2P3,1 P3,2

CPM

20

14

First =1≤i≤d{ Pi,1 } Second=1≤i≤d{ Pi,2 }

Page 8: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

8

MOTIVATION

Computational BiologyA renew interest due to cyber

security. Network intrusion detection systems

perform protocol analysis, content searching and content matching to detect harmful software.

Malware may appear in several packets!

CPM

20

14

Page 9: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

9

PREVIOUS WORK

Gapped pattern matching problem was studied for a few decades,eg. [Myers, JACM 1992],[Navaro&Raffinot, Algorithmica 2004],[Bille&Thorup, ICALP 2009] , [Bille&Thorup SODA 2010],

[Morgante et al., JCB 2005], [Rahman et al., COCOON 2006], [Bille et al., TCS 2012]

DMG problem not studied enough ![Kucherov&Rosinovich,TCS 1997],[Zhang et al., IPL 2010]-no bounds on the length of the gap.

CPM

20

14

Page 10: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

10

BI-DIRECTIONAL SUFFIX TREES ALGORITHM

Gapped pattern: a b{3,6}b b a c

Query: a b a a b a c b b a c

CPM

20

14

Page 11: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

11

BI-DIRECTIONAL SUFFIX TREES ALGORITHMIdea: view as [Amir et al., JAL 2000]

Gapped patterns:P1= a b a{3,6}a b a c P2= a b a{3,6}b b a P3= a b{3,6}b a aQuery:

a b a a b a c b b a cUse suffix tree TS of Second

Use suffix tree TF

R ofFirstR

gap

CPM

20

14

Page 12: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

12

BI-DIRECTIONAL SUFFIX TREES ALGORITHMFor each text location l

Insert tl tl +1…tn to TS (the node h)

to find labels on the path to h.

For f= l --1 to l --1Insert tftf-1…t1 to TFR (the node g)

to find labels on the path to g.

Output intersection (for end locations).

Finds Pi,2 starting at location l.

Finds Pi,1 ending at location f.

CPM

20

14

Page 13: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

13

BI-DIRECTIONAL SUFFIX TREES ALGORITHM - INTERSECTIONPatterns: {(1,4),(2,9),(3,7),…,(6,5),…}

TSTFR

Range:[1,9]

Range:[2,7]

CPM

20

14

3

69

1

g

5

7

2

h

Page 14: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

14

BI-DIRECTIONAL SUFFIX TREES ALGORITHM (CONTINUED)Intersection via range queries:

Range:[2,7]

Range: [1,9]

(1,4)

(3,7)

(6,5)

(8,8)

(2,9)

CPM

20

14

Page 15: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

15

TIME & SPACE Preprocessing Time:Dictionary segments suffix tree and reverse

suffix tree: O(|D|)Preprocessing grid for range queries:

O(d log d). [Chan et al., SoCG 2011]

Preprocessing Space:Dictionary segments suffix tree and reverse

suffix tree: O(|D|)Space for grid:

O(d log d). [Chan et al., SoCG 2011]

CPM

20

14

Page 16: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

16

TIME & SPACE Query Time:For each end text location, we try every gap

size: a factor of .The number of range queries is the number of

vertical paths in a given path: O(log2 min{d, log |D|}).A range query costs: O(log log d+occ).

[Chan et al., SoCG 2011]

Total: O(n()log log d log2 min{d, log |D|}+occ).

CPM

20

14

369

1

g

Page 17: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

17

LOOKUP TABLE ALGORITHM

Idea: Instead of using range queries in a

grid to compute the intersection, we use a pre-computed lookup table.

Enables intersection in O(occ) time.

Total query time becomes:O(n()+occ).

CPM

20

14

Page 18: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

18

LOOKUP TABLE ALGORITHM

Inter[g,h] = all i s.t. Pi,1R appears on

the path from the root of TFR till node

g and Pi,2 appears on the path from the root of TS till node h.

CPM

20

14

369

1

57

2

P1=(1,4), P2=(2,9), P3=(3,7),

P4=(3,2), …,P6=(6,5), P7

=(9,6)Inter[ 3, 5 ]= {4}

g h

Page 19: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

19

LOOKUP TABLE ALGORITHM

Inter[g,h] = all i s.t. Pi,1R appears on

the path from the root of TFR till node

g and Pi,2 appears on the path from the root of TS till node h.

CPM

20

14

369

1

57

2

P1=(1,4), P2=(2,9),

P3=(3,7), P4=(3,2), …,P6=(6,5), P7 =(9, 6)Inter[ 3, 5 ]= {4}

Inter[ 3, 7 ]= {3,4}

g

h

Page 20: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

20

LOOKUP TABLE ALGORITHM

Inter[g,h] = all i s.t. Pi,1R appears on

the path from the root of TFR till node

g and Pi,2 appears on the path from the root of TS till node h.

CPM

20

14

369

1

57

2

P1=(1,4), P2=(2,9), P3=(3,7),

P4=(3,2), …,P6=(6,5), P7

=(9,6)Inter[ 3, 5 ]= {4}Inter[ 3, 7 ]= {3,4}Inter[ 6, 7 ]= {3,4,6} g

h

Page 21: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

21

LOOKUP TABLE ALGORITHM

Inter[g,h] = all i s.t. Pi,1R appears on

the path from the root of TFR till node

g and Pi,2 appears on the path from the root of TS till node h.

CPM

20

14

369

1

57

2

P1=(1,4), P2=(2,9), P3=(3,7), P4=(3,2), …,P6=(6,5), P7

=(9,6)Inter[ 3, 5 ]= {4}Inter[ 3, 7 ]= {3,4}Inter[ 6, 7 ]= {3,4,6} Inter[ 9, 7 ]= {3,4,6} g h

Page 22: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

22

LOOKUP TABLE ALG.

CPM

20

14

369

1

57

2

P1=(1,4), P2=(2,9), P3=(3,7), P4=(3,2),

…,P6=(6,5),P7 =(9,6)

Inter[3,5]= {4}Inter[3,7]= {3,4}Inter[6,7]= {3,4,7}

1

3

:

1

9

6

.…2 5 6 7

2

:

--41--

--

6

3

--

4

7

Page 23: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

23

LOOKUP TABLE ALGORITHM

Preprocessing:Time: Table can be computed using DP

in time O(d2 ovr + |D|) where ovr is the number of subpatterns including other subpattern as a prefix or suffix.

Space: O(d 2 + |D|).

Query time: O(n()+occ).

CPM

20

14

Page 24: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

24

OUR RESULTS

Preprocessing time: O(d log d + |D|).Space: O(d log d + |D|).Query time:

O(n()log log d log2(min{d, log |D|} )+occ).

Preprocessing time: O(d2 ovr + |D|).Space: O(d 2 + |D|).Query time: O(n()+occ).

Bi-directional suffix trees & range queries

Bi-directional suffix trees & Lookup table

CPM

20

14

Page 25: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

25

OPEN PROBLEMS

Generalizing to k gapsReducing the dependency on the size

Scalability to different gap bounds in the dictionary

Online algorithm

CPM

20

14

Page 26: D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

26

THANK YOU!

CPM

20

14