Top Banner
Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS
21

Exact and Approximate Pattern in the Streaming Model

Jan 16, 2016

Download

Documents

aysel

Exact and Approximate Pattern in the Streaming Model. Benny Porat and Ely Porat 2009 FOCS. Presented by - Tanushree Mitra. Problem Statement. Find all instances of pattern P of length m, as a contiguous substring in a text string T, of length n, where m < n. Contributions. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Exact and Approximate Pattern in the Streaming Model

Exact and Approximate Pattern in the Streaming Model

Presented by - Tanushree Mitra

Benny Porat and Ely Porat 2009 FOCS

Page 2: Exact and Approximate Pattern in the Streaming Model

Problem Statement

• Find all instances of pattern P of length m, as a contiguous substring in a text string T, of length n, where m < n.

Page 3: Exact and Approximate Pattern in the Streaming Model

Contributions• Exact pattern matching - A fully online randomized

algorithm for the classical pattern matching problem

Time complexity - O(logm) per character that arrives

Space complexity - O(logm), breaking the O(m) barrier that held for this problem for a long time.

• Approximate pattern matching – An algorithm for pattern matching with k mismatches problem.

Time complexity - O(k2poly(logm)) per character Space complexity - O(k3poly(logm))

Page 4: Exact and Approximate Pattern in the Streaming Model

Applications• Monitoring Internet traffic

• Computational Biology

• Large Scale web searching

• Viruses and Malware detection

• Automatic Stock market analysis

• Robotics

Page 5: Exact and Approximate Pattern in the Streaming Model

BackgroundBrute Force Algorithm –

– Slide the pattern along the text and – Compare it to the corresponding portion of the text

Time Complexity – O(mn)

Speedup possible in these 2 steps.• Sliding step speedup by pre-processing the pattern,

– Knuth-Morris-Pratt algorithm – Boyer-Moore algorithm.– Ukkonen’s algorithm to construct suffix trees

• Comparison step speedup – Rabin-Karp algorithm.

Page 6: Exact and Approximate Pattern in the Streaming Model

Quick History

Page 7: Exact and Approximate Pattern in the Streaming Model

The Intuition

• Combine the key features of KMP and the Rabin-Karp algorithms to achieve an online algorithm that uses less space.

The Idea

• When Rabin-Karp’s algorithm is done with the i’th character, and advances to the next position in the text, it does not use any of the information gathered.

• The KMP algorithm, on the other hand, puts that information to good use.

Page 8: Exact and Approximate Pattern in the Streaming Model

Definitions - Fingerprints

String S ф(S)Fingerprint

Polynomial Fingerprint

q = s1r + s2r2 + … +slrl mod p, where pЄθ(N4), rЄFp

False Positives

If S1 ≠ S2, then probability of фr,p(S1) = фr,p(S2) is < 1/n3

Sliding Fingerprint

Page 9: Exact and Approximate Pattern in the Streaming Model

Definitions - PeriodPl

• Period - A prefix Sp = s1,s2,….,sl of a string S is defined to be a period of S, iff si = si+l, for 0 ≤ i ≤ n - l

• PeriodPl - For a pattern P = p1,p2,….,pm, prefix is, Pl = p1,p2,….,pl ,0 ≤ l ≤ m. The shortest period of Pl is periodPl

• If Pl matches the test at a given index i, then there cannot be a match between i to i + |periodPl|

Put the information to good use

Page 10: Exact and Approximate Pattern in the Streaming Model

The Idea

• Match at i’th index indicates that we know the last ‘m’ characters, so no point saving them?

• Preprocessing phase – Calculate Sliding fingerprint on the pattern фp and on the

shortest period фperiod p

• Online phase – Slide fingerprint ф over the entire text. – While ф = фp, slide ф by | PeriodPl | characters

– If we do not reach end of text abort

False Positives?? Slide over |periodPl| position that could be a match. Very

LOW PROBABILITY of false positives

Text and pattern should satisfy

stringent restrictions

Page 11: Exact and Approximate Pattern in the Streaming Model

Go for subpatterns• Log m subpatterns

p1, p2, p3, … pm-3, pm-2, pm-1, pm

pm

p1, p2, p3, … pm/2

pm-6,pm-5,pm-4,pm-3

pm-2 ,pm-1

P1

P2

P4

Pm/2

• Starting point – Find a position in which the smallest subpattern matches the text. Smallest subpattern is of length 1 – this can be easily found.

Page 12: Exact and Approximate Pattern in the Streaming Model

Algorithm• Guidelines –

• Find a position where Pi is a match, try to match Pi + 1 from the same starting point as Pi

• If Pi + 1 does not match, use the information that Pi is a match.• Check in jumps of |periodPi| until there is no overlap with the area

where Pi matches.

PROCESS1. Initialize an empty sliding fingerprint ф.2. For each character that arrive:

– Extend ф to include the new character– If |ф| = 2i and ф = фi for some 0 ≤ i ≤ log m.

• If ф has at least |periodPi-1 | length overlaps with the last match, slide ф by |periodPi-1| characters.

• Else, abort.

What if there is a match that starts in

substring of 1st process and ends in

substring of 2nd process

Page 13: Exact and Approximate Pattern in the Streaming Model

Exact_PM final AlgorithmIntroduce Checkpoint

Checkpoint - Start a new process in the last checkpoint of each process

Algorithm• Preprocessing -

– Initialize an empty sliding fingerprint ф.– For each 0 ≤ i ≤ log m calculate the sliding

fingerprint – фi of Pi and

– фi,period of the period of Pi

Page 14: Exact and Approximate Pattern in the Streaming Model

Final Algorithm – Online Phase

• Online Phase –– Start a new process

– For any character that arrive send it to all the processes

– If some process aborts start new prorcess

– If some process , A reaches to a checkpoint• Stop the ‘son process’ of A (if it has one)

• Start a new ‘son process’ of A

Page 15: Exact and Approximate Pattern in the Streaming Model

Complexity• Space –

– All fingerprints from preprocessing use O(log m) space.

– Each process saves another fingerprint and there can be atmost log m processes in parallel

– OVERALL usage – O(log m) space

• Time – – Each process spends O(1) time for each new character

that arrives– Each time there are at most 3 log m processes running

(1. process A, 2. son-process of A, grandson-process of A. A has to die when great-granson of A is created)

– OVERALL running time – O(log m) per character

Page 16: Exact and Approximate Pattern in the Streaming Model

Pattern Matching ( 1 – Mistmatch)

• Partition the pattern and the text• We need to align every partition of the pattern Pqi,j

to qi text shifts

Page 17: Exact and Approximate Pattern in the Streaming Model

Intuition

• For each Pqi,j, run qi processes of Exact_PM.

• Processqi,j,σ - σ’th process of the subpattern Pqi,j , for 0 ≤ σ < qi. This will try to match the Pqi,j to the text by considering the text as if it starts from the σ character. (τ mod qi = j – σ)

• If for all qi, – numOfNotMatchqi,σ = 0 ‘match’.

– numOfNotMatchqi,σ = 1, ‘exactly 1-mismatch’

– Otherwise, ‘more than 1-mismatch’.

Page 18: Exact and Approximate Pattern in the Streaming Model

Complexity

• FACTS –– Run ∑l

i=1 qi2 processes of Exact_PM

– There exists a constant c such that for any x, there exist (x / logm) prime numbers, between x, and cx

– We have q1,q2, . . . ql groups of partitions. Each qi is a prime number

• Space - O(log4m / log log m)

• Time - O(log3m / log log m)

Page 19: Exact and Approximate Pattern in the Streaming Model

Pattern Matching ( k – Errors)

• Preprocessing Phase – Initialize a process Processqi,j,σ of 1-mismatch, for each qi Є {q1,q2, . . . ql}, 0 ≤ i ≤ qi and 0 ≤ σ < qi

• Online Phase – Send τ character to each Processqi,j,σ such that τ mod qi = j – σ

• d = all mismatches from all processes that return ‘exactly 1-mismatch’– d > k more than k mismatches

Page 20: Exact and Approximate Pattern in the Streaming Model

Complexity

• Space –– Run ∑i=1

klogm qi2 Є O(k3 log4m/ log log m)

processes of 1-mismatch in parallel. – Each process requires log4m space. – OVERALL - O(k3poly(log m))

• Time – – Number of processes of 1-mismatch algorithm is

bounded by ∑i=1klogm qi

2 Є O(k3 log4m/ log log m) – Running time of each character O(log3m)– OVERALL - O(k2poly(log m))

Page 21: Exact and Approximate Pattern in the Streaming Model

Concluding Discussion

• The Two-Dimensional String-Matching Problem

• The String-Matching Problem with Wild Characters – Example: pattern P = {abc#abc#} is found in texts T1 = {abcdcadbaccabc}, T2 = {abcabc}

• String matching with weighted mismatch