Top Banner
String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015
40

String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Apr 25, 2018

Download

Documents

doandang
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

String Algorithms

Jaehyun Park

CS 97SIStanford University

June 30, 2015

Page 2: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Outline

String Matching Problem

Hash Table

Knuth-Morris-Pratt (KMP) Algorithm

Suffix Trie

Suffix Array

String Matching Problem 2

Page 3: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

String Matching Problem

◮ Given a text T and a pattern P , find all occurrences of P

within T

◮ Notations:

– n and m: lengths of P and T

– Σ: set of alphabets (of constant size)– Pi: ith letter of P (1-indexed)– a, b, c: single letters in Σ– x, y, z: strings

String Matching Problem 3

Page 4: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Example

◮ T = AGCATGCTGCAGTCATGCTTAGGCTA

◮ P = GCT

◮ P appears three times in T

◮ A naive method takes O(mn) time

– Initiate string comparison at every starting point– Each comparison takes O(m) time

◮ We can do much better!

String Matching Problem 4

Page 5: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Outline

String Matching Problem

Hash Table

Knuth-Morris-Pratt (KMP) Algorithm

Suffix Trie

Suffix Array

Hash Table 5

Page 6: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Hash Function

◮ A function that takes a string and outputs a number

◮ A good hash function has few collisions

– i.e., If x 6= y, H(x) 6= H(y) with high probability

◮ An easy and powerful hash function is a polynomial mod someprime p

– Consider each letter as a number (ASCII value is fine)– H(x1 . . . xk) = x1ak−1 + x2ak−2 + · · · + xk−1a + xk (mod p)– How do we find H(x2 . . . xk+1) from H(x1 . . . xk)?

Hash Table 6

Page 7: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Hash Table

◮ Main idea: preprocess T to speedup queries

– Hash every substring of length k

– k is a small constant

◮ For each query P , hash the first k letters of P to retrieve allthe occurrences of it within T

◮ Don’t forget to check collisions!

Hash Table 7

Page 8: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Hash Table

◮ Pros:

– Easy to implement– Significant speedup in practice

◮ Cons:– Doesn’t help the asymptotic efficiency

◮ Can still take Θ(nm) time if hashing is terrible or data isdifficult

– A lot of memory consumption

Hash Table 8

Page 9: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Outline

String Matching Problem

Hash Table

Knuth-Morris-Pratt (KMP) Algorithm

Suffix Trie

Suffix Array

Knuth-Morris-Pratt (KMP) Algorithm 9

Page 10: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Knuth-Morris-Pratt (KMP) Matcher

◮ A linear time (!) algorithm that solves the string matchingproblem by preprocessing P in Θ(m) time

– Main idea is to skip some comparisons by using the previouscomparison result

◮ Uses an auxiliary array π that is defined as the following:

– π[i] is the largest integer smaller than i such that P1 . . . Pπ[i] isa suffix of P1 . . . Pi

◮ ... It’s better to see an example than the definition

Knuth-Morris-Pratt (KMP) Algorithm 10

Page 11: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

π Table Example (from CLRS)

◮ π[i] is the largest integer smaller than i such that P1 . . . Pπ[i]is a suffix of P1 . . . Pi

– e.g., π[6] = 4 since abab is a suffix of ababab

– e.g., π[9] = 0 since no prefix of length ≤ 8 ends with c

◮ Let’s see why this is useful

Knuth-Morris-Pratt (KMP) Algorithm 11

Page 12: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Using the π Table

◮ T = ABC ABCDAB ABCDABCDABDE

◮ P = ABCDABD

◮ π = (0, 0, 0, 0, 1, 2, 0)

◮ Start matching at the first position of T :

◮ Mismatch at the 4th letter of P !

Knuth-Morris-Pratt (KMP) Algorithm 12

Page 13: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Using the π Table

◮ We matched k = 3 letters so far, and π[k] = 0

– Thus, there is no point in starting the comparison at T2, T3

(crucial observation)

◮ Shift P by k − π[k] = 3 letters

◮ Mismatch at T4 again!

Knuth-Morris-Pratt (KMP) Algorithm 13

Page 14: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Using the π Table

◮ We matched k = 0 letters so far

◮ Shift P by k − π[k] = 1 letter (we define π[0] = −1)

◮ Mismatch at T11!

Knuth-Morris-Pratt (KMP) Algorithm 14

Page 15: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Using the π Table

◮ π[6] = 2 means P1P2 is a suffix of P1 . . . P6

◮ Shift P by 6 − π[6] = 4 letters

◮ Again, no point in shifting P by 1, 2, or 3 letters

Knuth-Morris-Pratt (KMP) Algorithm 15

Page 16: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Using the π Table

◮ Mismatch at T11 again!

◮ Currently 2 letters are matched

◮ Shift P by 2 − π[2] = 2 letters

Knuth-Morris-Pratt (KMP) Algorithm 16

Page 17: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Using the π Table

◮ Mismatch at T11 yet again!

◮ Currently no letters are matched

◮ Shift P by 0 − π[0] = 1 letter

Knuth-Morris-Pratt (KMP) Algorithm 17

Page 18: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Using the π Table

◮ Mismatch at T18

◮ Currently 6 letters are matched

◮ Shift P by 6 − π[6] = 4 letters

Knuth-Morris-Pratt (KMP) Algorithm 18

Page 19: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Using the π Table

◮ Finally, there it is!

◮ Currently all 7 letters are matched

◮ After recording this match (at T16 . . . T22, we shift P again inorder to find other matches

– Shift by 7 − π[7] = 7 letters

Knuth-Morris-Pratt (KMP) Algorithm 19

Page 20: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Computing π

◮ Observation 1: if P1 . . . Pπ[i] is a suffix of P1 . . . Pi, thenP1 . . . Pπ[i]−1 is a suffix of P1 . . . Pi−1

– Well, obviously...

◮ Observation 2: all the prefixes of P that are a suffix ofP1 . . . Pi can be obtained by recursively applying π to i

– e.g., P1 . . . Pπ[i], P1 . . . , Pπ[π[i]], P1 . . . , Pπ[π[π[i]]] are allsuffixes of P1 . . . Pi

Knuth-Morris-Pratt (KMP) Algorithm 20

Page 21: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Computing π

◮ A non-obvious conclusion:

– First, let’s write π(k)[i] as π[·] applied k times to i

– e.g., π(2)[i] = π[π[i]]– π[i] is equal to π(k)[i − 1] + 1, where k is the smallest integer

that satisfies Pπ(k)[i−1]+1 = Pi

◮ If there is no such k, π[i] = 0

◮ Intuition: we look at all the prefixes of P that are suffixes ofP1 . . . Pi−1, and find the longest one whose next lettermatches Pi

Knuth-Morris-Pratt (KMP) Algorithm 21

Page 22: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Implementation

pi[0] = -1;

int k = -1;

for(int i = 1; i <= m; i++) {

while(k >= 0 && P[k+1] != P[i])

k = pi[k];

pi[i] = ++k;

}

Knuth-Morris-Pratt (KMP) Algorithm 22

Page 23: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Pattern Matching Implementation

int k = 0;

for(int i = 1; i <= n; i++) {

while(k >= 0 && P[k+1] != T[i])

k = pi[k];

k++;

if(k == m) {

// P matches T[i-m+1..i]

k = pi[k];

}

}

Knuth-Morris-Pratt (KMP) Algorithm 23

Page 24: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Outline

String Matching Problem

Hash Table

Knuth-Morris-Pratt (KMP) Algorithm

Suffix Trie

Suffix Array

Suffix Trie 24

Page 25: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Suffix Trie

◮ Suffix trie of a string T is a rooted tree that stores all thesuffixes (thus all the substrings)

◮ Each node corresponds to some substring of T

◮ Each edge is associated with an alphabet

◮ For each node that corresponds to ax, there is a specialpointer called suffix link that leads to the node correspondingto x

◮ Surprisingly easy to implement!

Suffix Trie 25

Page 26: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Example

(Figure modified from Ukkonen’s original paper)

Suffix Trie 26

Page 27: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Incremental Construction

◮ Given the suffix tree for T1 . . . Tn

– Then we append Tn+1 = a to T , creating necessary nodes

◮ Start at node u corresponding to T1 . . . Tn

– Create an a-transition to a new node v

◮ Take the suffix link at u to go to u′, corresponding toT2 . . . Tn

– Create an a-transition to a new node v′

– Create a suffix link from v to v′

Suffix Trie 27

Page 28: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Incremental Construction

◮ Repeat the previous process:

– Take the suffix link at the current node– Make a new a-transition there– Create the suffix link from the previous node

◮ Stop if the node already has an a-transition

– Because from this point, all nodes that are reachable via suffixlinks already have an a-transition

Suffix Trie 28

Page 29: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Construction Example

Given the suffix trie for aba

We want to add a new letter c

Suffix Trie 29

Page 30: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Construction Example

Suffix Trie 30

Page 31: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Construction Example

Suffix Trie 31

Page 32: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Construction Example

Suffix Trie 32

Page 33: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Construction Example

Suffix Trie 33

Page 34: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Construction Example

Suffix Trie 34

Page 35: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Construction Example

◮ Construction time is linear in the tree size

◮ But the tree size can be quadratic in n

– e.g., T = aa . . . abb . . . b

Suffix Trie 35

Page 36: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Construction Example

◮ To find P , start at the root and keep following edges labeledwith P1, P2, etc.

◮ Got stuck? Then P doesn’t exist in T

Suffix Trie 36

Page 37: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Outline

String Matching Problem

Hash Table

Knuth-Morris-Pratt (KMP) Algorithm

Suffix Trie

Suffix Array

Suffix Array 37

Page 38: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Suffix Array

Suffix Array 38

Page 39: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Suffix Array

◮ Memory usage is O(n)

◮ Has the same computational power as suffix trie

◮ Can be constructed in O(n) time (!)

– But it’s hard to implement

◮ There is an approachable O(n log2 n) algorithm

– If you want to see how it works, read the paper on the coursewebsite

– http://cs97si.stanford.edu/suffix-array.pdf

Suffix Array 39

Page 40: String Algorithms - Stanford Universityweb.stanford.edu/class/cs97si/10-string-algorithms.pdf · String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015. Outline

Notes on String Problems

◮ Always be aware of the null-terminators

◮ Simple hash works so well in many problems

◮ If a problem involves rotations of some string, considerconcatenating it with itself and see if it helps

◮ Stanford team notebook has implementations of suffix arraysand the KMP matcher

Suffix Array 40