Top Banner
Forbidden Patterns Johannes Fischer 1,? , Travis Gagie 2,?? , Tsvi Kopelowitz 3 , Moshe Lewenstein 4 , Veli M¨ akinen 5,??? , Leena Salmela 5, , and Niko V¨ alim¨ aki 5, 1 KIT, Karlsruhe, Germany, [email protected] 2 Aalto University, Espoo, Finland, [email protected] 3 Weizmann Institute of Science, Rehovot, Israel, [email protected] 4 Bar-Ilan University, Ramat Gan, Israel, [email protected] 5 University of Helsinki, Helsinki, Finland, {vmakinen|leena.salmela|nvalimak}@cs.helsinki.fi Abstract. We consider the problem of indexing a collection of docu- ments (a.k.a. strings) of total length n such that the following kind of queries are supported: given two patterns P + and P - , list all n match documents containing P + but not P - . This is a natural extension of the classic problem of document listing as considered by Muthukrishnan [SODA’02], where only the positive pattern P + is given. Our main solu- tion is an index of size O(n 3/2 ) bits that supports queries in O(|P + | + |P - | + n match + n) time. 1 Introduction In recent years, the pattern matching community has paid a considerable amount of attention to document retrieval tasks, where, in contrast to traditional indexed pattern matching, the task is to output each document containing a search pat- tern even just once, and in particular without spending time proportional to the number of total occurrences of the pattern in that document. Starting with Muthukrishnan’s seminal paper [14] (building in fact on an earlier paper by the same author with colleagues [12]), an abundance of articles on variations of this scheme emerged, including: space reductions of the underlying data struc- tures [7, 16, 17], ranking of the output [9, 11, 15], two-pattern queries [3, 5], and perhaps many more. A different possible, and indeed very natural, extension of the basic problem is to exclude some documents from the output. In this setting, the user specifies, ? Supported by the German Research Foundation (DFG). ?? Supported by Academy of Finland grant 134287. ??? Partially funded by the Academy of Finland grant 1140727. Also affiliated with Helsinki Institute for Information Technology (HIIT). Supported by Academy of Finland grant 118653 (ALGODAN). Partially funded by the Academy of Finland grant 118653 (ALGODAN) and Helsinki Doctoral Programme in Computer Science (Hecse).
12

Forbidden Patterns - CiteSeerX

Apr 22, 2023

Download

Documents

Khang Minh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Forbidden Patterns - CiteSeerX

Forbidden Patterns

Johannes Fischer1,?, Travis Gagie2,??, Tsvi Kopelowitz3, Moshe Lewenstein4,Veli Makinen5,? ? ?, Leena Salmela5,†, and Niko Valimaki5,‡

1 KIT, Karlsruhe, Germany, [email protected] Aalto University, Espoo, Finland, [email protected]

3 Weizmann Institute of Science, Rehovot, Israel, [email protected] Bar-Ilan University, Ramat Gan, Israel, [email protected]

5 University of Helsinki, Helsinki, Finland,vmakinen|leena.salmela|[email protected]

Abstract. We consider the problem of indexing a collection of docu-ments (a.k.a. strings) of total length n such that the following kind ofqueries are supported: given two patterns P+ and P−, list all nmatch

documents containing P+ but not P−. This is a natural extension ofthe classic problem of document listing as considered by Muthukrishnan[SODA’02], where only the positive pattern P+ is given. Our main solu-tion is an index of size O(n3/2) bits that supports queries in O(|P+| +|P−|+ nmatch +

√n) time.

1 Introduction

In recent years, the pattern matching community has paid a considerable amountof attention to document retrieval tasks, where, in contrast to traditional indexedpattern matching, the task is to output each document containing a search pat-tern even just once, and in particular without spending time proportional tothe number of total occurrences of the pattern in that document. Starting withMuthukrishnan’s seminal paper [14] (building in fact on an earlier paper bythe same author with colleagues [12]), an abundance of articles on variations ofthis scheme emerged, including: space reductions of the underlying data struc-tures [7, 16, 17], ranking of the output [9, 11, 15], two-pattern queries [3, 5], andperhaps many more.

A different possible, and indeed very natural, extension of the basic problemis to exclude some documents from the output. In this setting, the user specifies,

? Supported by the German Research Foundation (DFG).?? Supported by Academy of Finland grant 134287.

? ? ? Partially funded by the Academy of Finland grant 1140727. Also affiliated withHelsinki Institute for Information Technology (HIIT).† Supported by Academy of Finland grant 118653 (ALGODAN).‡ Partially funded by the Academy of Finland grant 118653 (ALGODAN) and Helsinki

Doctoral Programme in Computer Science (Hecse).

Page 2: Forbidden Patterns - CiteSeerX

in addition to the query pattern P+, a negative pattern P− that should notoccur in the retrieved documents. Formally, this problem of forbidden patternscan be modeled as follows:6

Given: A collection of static documents D = D1, . . . , Dd over an alphabet Σof total length n =

∑i≤d |Di|.

Compute: An index that, given two patterns P+ and P− online, allows us tocompute all nmatch documents containing P+, but not P−.

The best we can hope for is certainly an index of linear size having a querytime of O(|P+|+ |P−|+ nmatch), as this is the time just to read the input andwrite the output of the query. However, achieving anything close to this optimumseems completely out of reach (at least at the current state of research), as theforbidden pattern queries can be regarded as set difference queries, which arearguably at least as hard as set intersection queries. In the realm of documentretrieval, those latter queries correspond to the case where two positive patternsP+1 and P+

2 are given (and one is interested in all documents containing bothpositive patterns); we are aware of three indexes that address this problem: (1)O(n3/2) words of space with query time O(|P+

1 |+ |P+2 |+ nmatch +

√n) [5], (2)

O(n log n) words and O(|P+1 |+ |P

+2 |+(

√nmatch n log n+nmatch) log2 n) time [3],

and (3) O(n) words and O(|P+1 |+|P

+2 |+(

√nmatch n log n+nmatch) log n)) time [8]

(this improves on [3] in both time and space and has the further advantage thatit generalizes to more than two patterns). In Appendix A of this paper, weshow that this problem of two positive patterns is indeed harder than documentretrieval with just one pattern.7

However, the main body of this paper is devoted to the forbidden patternsproblem. The following theorem summarizes our main result.

Theorem 1. For a text collection of total length n, there exists a data struc-ture of size O(n3/2) bits such that subsequent forbidden pattern queries can beanswered in O(|P+|+ |P−|+ nmatch +

√n) time.

The rest of this article is structured as follows. Sect. 2 introduces knownresults that form the basic building blocks of our solution, including a descriptionof the preprocessing algorithm for document retrieval with just one positivepattern [14]. In Sect. 3, we then give data structures for the forbidden patternsproblem, where, apart from proving Thm. 1, we also look at the variation of justcounting the number documents. Finally, Sect. 4 concludes the paper.

6 Muthukrishnan [14] already considered the case where just a negative pattern P− isgiven, and has an optimal solution that outputs all documents not containing P−.

7 Unfortunately, we could not come up with any meaningful lower bound for theforbidden patterns problem.

2

Page 3: Forbidden Patterns - CiteSeerX

2 Preliminaries

2.1 Succinct Data Structures

Consider a bit-string S[1, n] of length n. We define the fundamental rank - andselect-operations on S as follows: rank1(S, i) gives the number of 1’s in the prefixS[1, i], and select1(S, i) gives the position of the i’th 1 in S, reading S from leftto right (1 ≤ i ≤ n). The following lemma summarizes a by-now classic result(see, e.g., [13]):

Lemma 1. A bit-string of length n can be represented in n+o(n) bits such thatrank- and select-operations are supported in O(1) time.

2.2 Range Minimum Queries

A basic building block for our solution is a space-efficient preprocessing schemefor O(1) range minimum queries. For a static array E[1, n] of n objects from atotally ordered universe and two indices i and j with 1 ≤ i ≤ j ≤ n, a rangeminimum query rmqE(i, j) returns the position of a minimum element in thesub-array E[i, j]; in symbols: rmqE(i, j) = argmin

E[k] | i ≤ k ≤ j

. We state

the following result [6, Thm. 5.8]:

Lemma 2. A static array E[1, n] can be preprocessed in O(n) time into a datastructure of size 2n+ o(n) bits such that subsequent range minimum queries onE can be answered in O(1) time, without consulting E at query time. This sizeis asymptotically optimal.

2.3 Document Retrieval

We now explain Muthukrishnan’s solution [14] for document retrieval with onlyone positive pattern P (in fact, we describe a variant [16] of the original algo-rithm that is more convenient for our purposes). The overall idea is to build ageneralized suffix tree ST for the collection of documents D = D1, . . . , Dd,and enhance it with additional information for reporting the documents. A gen-eralized suffix tree for D is a suffix tree for the text T := D1#1D2#2 . . . Dd#d,where the #i’s are distinct characters not appearing elsewhere in D. A suffixtree ST on T , in turn, is a compacted trie on all suffixes of T , and consists ofonly O(|T |) nodes. Every such node v is said to correspond to a substring α ofT iff the letters on the root-to-v path are exactly α.

A suffix tree ST for T allows us to locate all occ occurrences of a search patternP in T in optimal O(|P |+ occ) time (with perfect hashing; otherwise the searchtakes O(|P | log |Σ|+ occ) time). This search proceeds in two steps: it first findsin O(|P |) time the node v in ST such that all leaves below v correspond to the

3

Page 4: Forbidden Patterns - CiteSeerX

occ suffixes that are prefixed by P . In a second step, the starting points of allthese suffixes are reported in additional O(occ) time. For document retrieval, itshould be clear that we can reuse only the first part of this search, but mustmodify the second step such that it uses O(nmatch) instead of O(occ) time.

To this end, Muthukrishnan’s solution proceeds as follows. Consider the leavesof ST in lexicographic order. The positions in T of their corresponding suffixesform a permutation of the numbers [1, n′] (n′ = n+d = O(n) being the size of T ),the so-called suffix-array A[1, n′]. Define a document array D[1, n′] of the samesize as A, such that D[i] holds the document number of the lexicographicallyi’th suffix. More formally, D[i] = j iff #j is the first document separator inT [A[i], n′]. We now chain suffixes from the same document in a new array E[1, n′]by defining E[i] = max

j < i | D[j] = D[i]

, where the maximum of the

empty set is assumed to be −∞. Array E is prepared for constant-time rangeminimum queries using Lemma 2. With these data structures, we can obtainoptimal O(|P |+ nmatch) listing time, as explained next. We first use ST to findin O(|P |) time the interval [`, r] in A such that the suffixes in A[`, r] are exactlythose that are prefixed by P . We then call the recursive procedure list in Alg. 1,initially invoked by list(`, r) and assuming V [i] = 0 for all 1 ≤ i ≤ d just beforethat first call:

Algorithm 1: List all documents in D[i, j] not occurring in D[`, i− 1].

procedure list(i, j)m← rmqE(i, j)if V [D[m]] = 0 then

output D[m]V [D[m]]← 1list(i,m− 1)list(m + 1, j)

The idea of procedure list is that each distinct document identifier in D[`, r]is listed only at the place m of its leftmost occurrence in the interval [`, r];such places are conveniently located by range minimum queries on E. To avoidduplicate outputs, we mark all documents found by a ’1’ in an additional arrayV [1, d], which is initialized with all 0’s in the preprocessing phase. Wheneverthe smallest element in E[i, j] comes from a document already reported (henceV [D[m]] = 1), the recursion can be stopped since every document in D[i, j] isreported when visiting distinct intervals [i′, j′] with ` ≤ i′ ≤ j′ < i. Hence, theoverall running time is O(nmatch). At the end of the reporting phase, we need toreset V [·] to 0 for all documents in the output. This takes additional O(nmatch)time.

Apart from the suffix tree ST , the space for this solution is dominated by then′ words needed for storing the document array D.

4

Page 5: Forbidden Patterns - CiteSeerX

3 Document Retrieval with Forbidden Patterns

We now come to the description of our solution to the problem of forbiddenpatterns, as presented in the introduction. We proceed by first presenting a rathersimple solution (Sect. 3.1), which is then subsequently refined (Sect. 3.2–3.3).

3.1 O(n2) Words of Space

We first show how to achieve optimal (|P+| + |P−| + nmatch) query time withO(n2) space. The idea is again to store a generalized suffix tree ST for the set ofdocuments D and enhance it with additional information. In particular, for everynode v in ST corresponding to string α, we store a copy of ST that excludesthe documents containing α. We call that copy ST v. Every ST v is prepared for“normal” document listing (Sect. 2.3).

When a query arrives, we first match P− in ST until reaching node v (if thematching ends on an edge, we take the following node). We then jump to ST v,where we match P+, and list all nmatch documents in optimal O(nmatch) time.

The space for this solution is clearly O(n2) words, as the number of nodes inST is O(n), and for each such node v we build another generalized suffix treeST v, all of which could contain O(n) nodes in the worst case.

3.2 O(n2) Bits of Space

We now reduce the solution from the previous section to O(n2) bits. Our aimis to reuse the full suffix tree ST also when matching the positive pattern P+,and use a modified RMQ-structure when reporting documents by procedurelist (see Sect. 2.3). To this end, let v be a node in ST , and let Dv be theset of documents containing the string represented by v (hence Dv is the set ofdocuments in D[`, r] if A[`, r] is the suffix array interval for v in the sense ofSect. 2.3). In a (conceptual) copy Ev of the global chaining array E, we blankout all entries corresponding to documents in Dv by setting the correspondingvalues to +∞. More precisely,

Ev[i] =

+∞ if D[i] ∈ Dv, and

E[i] otherwise.

Now each such Ev is prepared for range minimum queries using Lemma 2, andonly this RMQ-structure (not the array Ev itself!) is stored at node v. A furtherbit-vector Bv[1, n′] at node v marks those positions with a ’1’ that correspond todocuments in Dv, in symbols: Bv[i] = 1 if Ev[i] = +∞, and Bv[i] = 0 otherwise.Hence, the total space needed is 3n′ + o(n′) = O(n) bits per node in ST . Wealso store the global document array D[1, n′], plus the bit-vector V [1, d] needed

5

Page 6: Forbidden Patterns - CiteSeerX

Algorithm 2: Modified procedure for document listing.

procedure list′(i, j)m← rmqEv

(i, j)if V [D[m]] = 0 and Bv[m] = 0 then

output D[m]V [D[m]]← 1list′(i,m− 1)list′(m + 1, j)

by Alg. 1, needing O(n log n) and d = O(n) bits, respectively. The space for theentire data structure thereby amounts to O(n2) bits.

The query processing starts as in the previous section: we first match P− inST until reaching node v (and again take the following node if we end on anedge). Now instead of jumping to ST v (which is no longer stored), we use STagain to find the interval [`, r] in the suffix array A such that the suffixes in A[`, r]are exactly those that are prefixed by P+. We then call procedure list(`, r),but using the RMQ-structure for Ev instead of E (corresponding to the negativepattern P−). We need to further modify that procedure such that it does notlist those documents in Dv; this can be accomplished by checking if the m’th bitin Bv is set to ’0’. If, on the other hand, Bv[m] = 1, we can stop the recursion,as in that case all other entries in Ev[i, j] must also be +∞ (and hence comefrom documents in Dv). The complete modified algorithm list′ can be seen inAlg. 2. As before, after having listed all nmatch documents, we need to unmarkthe listed documents in V in additional O(nmatch) time in order to prepare forthe next query.

3.3 O(n3/2) Bits of Space

We now present a space/time tradeoff for the solution given in the previoussection. Our general idea is to store the RMQ-structures only at a selectedsubset of nodes in ST , thereby possibly listing false documents that need tobe filtered at query time. For what follows, the reader should also consult theexample shown in Fig. 1.

We assign a weight wv to each node v in ST as follows. As before, let Dv

denote the set of documents “below” v; i.e., the set of documents in D thatcontain the string represented by v. Then the weight of v is defined to be thenumber of documents in Dv, wv = |Dv|.

In ST , we mark certain nodes as important. The RMQ-structures will only bestored at important nodes. We will make sure that each node v has an importantsuccessor u such that wv ≤ wu + s, for some integer s to be determined later.At v, we also store a pointer to this important successor u. Let pv = u denotethis pointer (for important nodes v we define pv = v). At query time, when thesearch for P− ends at v, we use the algorithm from the previous section, but

6

Page 7: Forbidden Patterns - CiteSeerX

T = turing#1tape#2enigma#3apple#4

#3

p

ple#4

e#2

#2

#4

nigma#3

a e g i

#1ma#3

gma

#3

ng#1

le#4

ma#3

n p t

u

g#1

igma#3

e#2 l

e#4

ple#4

ring#1

ring#1

`0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

ape

#2

uring#1

22

22233

2

4

Fig. 1. Generalized suffix tree (with super-leaf `) for the collection of documentsD = turing, tape, enigma, apple (with irrelevant parts pruned). The number in-side a node v denotes its weight wv. Important nodes (assuming s = 2) are shown inbold. Hence, in this example only 3 RMQ-structures are stored.

now with the RMQ-structure for Epv . This reports at most s false documents,which need to be discarded from the output.

It thus remains to identify the false documents. For this, we need not store anyadditional data structures, as explained next. Let α denote the string representedby node pv. Observe that the false documents are exactly those that contain P−,but not α; but this corresponds to a forbidden pattern query, with P− as thepositive pattern, and α as the negative one! And for this query all necessary datastructures are at hand, because at pv there exists an RMQ-structure that filtersexactly those documents containing α.

In summary, to answer the query “P+ and not P−,” we first match P− inST up to node v, where we follow the pointer pv to an important successorrepresenting string α. At pv, we answer the query “P− and not α” with thealgorithm from Sect. 3.2 to identify the set of false documents, which we mark ina bit-vector F [1, d]. Finally, we answer the query “P+ and not α,” again by usingthe algorithm from Sect. 3.2, but this time outputting only those documents notmarked in F . In the end, all marked documents are unmarked to prepare forthe next query. As by definition the marked (=false) documents are at most sin number, the total query time is O(|P+|+ |P−|+ nmatch + s).

Identifying Important Nodes. It remains to show how the important nodescan be identified. We do this in a bottom-up traversal of ST . We first enhanceST with a super-leaf ` that is the single child of all original leaves in ST . This

7

Page 8: Forbidden Patterns - CiteSeerX

node ` has weight w` = 0, is marked as important, and stores an RMQ-structurefor the original array E.

During the bottom-up traversal of all original nodes in ST (excluding thesuper-leaf `), let us assume we arrive at a node v with children v1, . . . , vk. Byinduction, all vi’s have already been assigned an important successor pvi . Let vmbe the child of v having an important successor occurring in most documents,m = argmaxwpvi

| 1 ≤ i ≤ k. Then if wv − wpvm≤ s, we set pv to pvm .

Otherwise, we mark v as important, create an RMQ-structure for Ev, and setpv = v.

Space Analysis. To analyze the space, consider the subtree ST I of ST con-sisting only of important nodes and their mutual lowest common ancestors (inthe terminology of Cole et al. [4], ST I is the subtree of ST induced by the im-portant nodes). The nodes in ST I are further divided into two different classes:(1) non-branching internal, and (2) other. A node belongs to class non-branchinginternal iff it has exactly one child in ST I ; otherwise it belongs to class other.

We analyze the number of non-branching internal and other nodes separately.Let us first consider the other nodes. They form yet another induced subtree ofST I , let us call it ST ′I . A leaf v in ST ′I covers at least s (original) leaves inST , as its weight wv must by definition be at least wv > s, and in order to covers documents at least s suffixes from T must be covered. Hence, the number ofleaves in ST ′I is bounded from above by n/s, and because ST ′I is compact, thetotal number of nodes in ST ′I is O(n/s).

Now look at the non-branching internal nodes of ST I . Because every suchnode v must have wv − wu > s for its single child u in ST I , this increase inweight can only come from at least s (original) leaves of ST for which v is theirnearest important ancestor. As every leaf in ST can contribute to at most onesuch non-branching internal node, the number of non-branching internal nodesin also bounded by O(n/s).

In total, we store the RMQ-structures (using O(n) bits each) only at O(n/s)nodes of ST . By setting s =

√n, we obtain Thm. 1.

3.4 Approximate Counting Queries

If only the number of documents containing P+ but not P− matters, we canobtain a faster data structure than that of Thm. 1. We first explain Sadakane’ssuccinct data structure for document counting in the presence of just a positivepattern P [16, Sect. 5.2]. Without going too much into the details, the idea ofthat algorithm is as follows: for each node v in ST we store in c(v) how many“duplicate” suffixes from the same document occur in its subtree. When a querypattern P arrives, we first locate the node v representing P and its suffix arrayinterval [`, r]. Then r− `+1−c(v) gives the number of documents containing P ,as r− `+ 1 is the total number of occurrences, and c(v) is the right “correction

8

Page 9: Forbidden Patterns - CiteSeerX

term” that needs to be subtracted. (This idea was first used by Hui [10], whoalso shows how to compute the correction terms in overall linear time by lowestcommon ancestor queries.) The c(v)-numbers are not stored directly in eachnode (this would need O(n log n) bits overall), but rather in a bit-vector H ′ oflength O(n) such that c(v) can be computed by a constant number of rank- andselect-queries on H ′ (in fact by counting the 0’s in a certain interval in H ′), givenv’s suffix array interval [`, r]. In total, this solution uses O(n) bits (see Lemma1).

For forbidden patterns, we modify the data structure from Sect. 3.3 and storeonly the following information at important nodes v:

– Vector Bv, marking the documents in Dv, as defined in Sect. 3.2. Each Bv

is prepared for constant-time rank1-queries.– Vector H ′v, which is the bit-vector H ′ as defined in the preceding paragraph,

but modified to exclude those documents from Dv. Using H ′v we can computefor any node u a modified correction term cv(u) that excludes the documentscontaining the string represented by v. Vector H ′v is at most as long as H ′,hence using only O(n) bits.

Finally, each node v in ST stores the pointer pv to its nearest important suc-cessor.

When we need to answer a query of the form “what is the approximate num-ber of documents containing P+ but not P−?” we first match P− in ST untilreaching node v. We then match P+ in ST , say until reaching u. Assume thatu corresponds to the suffix array interval [`, r]. Now observe that the number of1’s in Bpv [`, r] gives the number of suffixes below u that correspond to forbid-den patterns, and that this number can be computed by f = rank1(Bpv , r) −rank1(Bpv

, `−1). So r−`+1−f is the number of suffixes below u excluding thosedocuments containing P−. Hence, we return n′match = r−`+1−f−cpv

(u) as theapproximate answer to nmatch, which satisfies nmatch ≤ n′match ≤ nmatch +

√n

by the definition of important nodes (with s =√n).

Theorem 2. For a text collection of total length n, there exists a data structureof size O(n3/2) bits such that subsequent queries asking for the approximatenumber of documents containing P+ but not P− can be answered in O(|P+| +|P−|) time, with an additive error of at most

√n.

4 Conclusions

We initiated the study of document retrieval in the presence of forbidden pat-terns. Apart from improving on the space consumption and/or query time ofour data structures, we point out the following subjects deserving further in-vestigation: (1) document retrieval with more than two patterns, both positiveand negative, (2) lower bounds for forbidden patterns, possibly in the spirit of

9

Page 10: Forbidden Patterns - CiteSeerX

Appendix A, but also in more realistic machine models such as the word-RAM,and (3) algorithmic engineering and comparison with inverted indexes.

References

1. B. Chazelle. Lower bounds for orthogonal range searching, I: The reporting case.J. ACM, 37:200–212, 1990.

2. Y.-F. Chien, W.-K. Hon, R. Shah, and J. S. Vitter. Geometric Burrows-Wheelertransform: Linking range searching and text indexing. In Proc. DCC, pages 252–261. IEEE Press, 2008.

3. H. Cohen and E. Porat. Fast set intersection and two-patterns matching. Theor.Comput. Sci., 411(40–42):3795–3800, 2010.

4. R. Cole, M. Farach-Colton, R. Hariharan, T. M. Przytycka, and M. Thorup. AnO(n logn) algorithm for the maximum agreement subtree problem for binary trees.SIAM J. Comput., 30(5):1385–1404, 2000.

5. P. Ferragina, N. Koudas, S. Muthukrishnan, and D. Srivastava. Two-dimensionalsubstring indexing. J. Comput. Syst. Sci., 66(4):763–774, 2003.

6. J. Fischer and V. Heun. Space efficient preprocessing schemes for range minimumqueries on static arrays. SIAM J. Comput., 40(2):465–492, 2011.

7. T. Gagie, S. J. Puglisi, and A. Turpin. Range quantile queries: Another virtue ofwavelet trees. In Proc. SPIRE, volume 5721 of LNCS, pages 1–6. Springer, 2009.

8. W.-K. Hon, R. Shah, S. V. Thankachan, and J. S. Vitter. String retrieval for multi-pattern queries. In Proc. SPIRE, volume 6393 of LNCS, pages 55–66. Springer,2010.

9. W. K. Hon, R. Shah, and J. S. Vitter. Space-efficient framework for top-k stringretrieval problems. In Proc. FOCS, pages 713–722. IEEE Computer Society, 2009.

10. L. C. K. Hui. Color set size problem with application to string matching. In Proc.CPM, volume 644 of LNCS, pages 230–243. Springer, 1992.

11. M. Karpinski and Y. Nekrich. Top-k color queries for document retrieval. In Proc.SODA, pages 401–411. ACM/SIAM, 2011.

12. Y. Matias, S. Muthukrishnan, S. C. Sahinalp, and J. Ziv. Augmenting suffix trees,with applications. In Proc. ESA, volume 1461 of LNCS, pages 67–78. Springer,1998.

13. J. I. Munro and V. Raman. Succinct representation of balanced parentheses andstatic trees. SIAM J. Comput., 31(3):762–776, 2001.

14. S. Muthukrishnan. Efficient algorithms for document retrieval problems. In Proc.SODA, pages 657–666. ACM/SIAM, 2002.

15. G. Navarro and Y. Nekrich. Top-k document retrieval in optimal time and linearspace. In Proc. SODA. ACM/SIAM, to appear 2012.

16. K. Sadakane. Succinct data structures for flexible text retrieval systems. J. DiscreteAlgorithms, 5(1):12–22, 2007.

17. N. Valimaki and V. Makinen. Space-efficient algorithms for document retrieval.In Proc. CPM, volume 4580 of LNCS, pages 205–215. Springer, 2007.

Appendix

A A Lower Bound for Two Positive Patterns

Consider the two-pattern matching problem in [3]:

10

Page 11: Forbidden Patterns - CiteSeerX

Given: A collection of static documents D = D1, . . . , Dd over an alphabet Σof total length n =

∑i≤d |Di|.

Compute: An index that, given two patterns P+1 and P+

2 online, allows us tocompute all nmatch documents containing P+

1 and P+2 .

We give a new lower bound result for the problem using the technique in-troduced in [2]. The reduction is from 4-dimensional range queries: Given setS of n points in R4 each represented with 4h′ bits, preprocess S to find pointss | s ∈ S ∩ ([x`, xr] × [y`, yr] × [z`, zr] × [t`, tr]), where x, y, z, and t denotethe 4 coordinate ranges. Chazelle [1] showed that on a pointer machine, an in-dex supporting d-dimensional range searching in O(polylog(n)+occ) query timerequires Ω(n(log n/ log log n)d−1) words of storage.

First, we can assume that points in S are from a [1, n]× [1, n]× [1, n]× [1, n]grid, since sorting and mapping n points in R4 to their ranks in each coordinatetake O(nh′) space and O(n log n) time. Later, a range query can be cast to thecorresponding one on the ranks in O(log n) time.

From the i’th point s = (x, y, z, t) ∈ S we create the document

Di =←−−〈yi〉#1〈xi〉#2

←−〈ti〉#3〈zi〉, (1)

where 〈c〉 denotes the h = Θ(log n)-bit binary representation of integer c, and←−〈c〉 its reverse (and the #i’s are again new characters).

Consider a balanced binary tree on values 1, 2, . . . , n in its leaves in this order.Associating 0 with left-branches and 1 with right-branches, paths from the rootto the leaves define prefix codes for the values. An interval [c, d] can be parti-tioned into O(log n) intervals such that each interval corresponds to a differentsubtree; denote by P (c, d) the set of O(log n) prefix codes defined by paths fromthe root to the roots of these O(log n) subtrees. We cast a given 4-dimensionalrange query [x`, xr]× [y`, yr]× [z`, zr]× [t`, tr] into O(log4 n) two-pattern queries

P+1 =

←−〈y〉#1〈x〉 and P+

2 =←−〈t〉#3〈z〉 (2)

for all (x, y, z, t) ∈ P (x`, xr)× P (y`, yr)× P (z`, zr)× P (t`, tr). One can now seethat these O(log4 n) two-pattern queries on D = D1, . . . , Dn of total lengthN = Θ(n log n) bits, constructed using Eq. (1), solve the 4-dimensional rangereporting query.

Theorem 3. On a pointer machine, an index on a document collection of totallength n supporting two-pattern matching in O(|P+

1 |+ |P+2 |+nmatch +polylog n)

time requires Ω(n(log n/ log log n)3) bits in the worst case.

Proof. The reduction showed the connection between an N bit collection and4-dimensional range queries on Θ(N/ logN) points, so recasting the result witha collection of length n gives

Ω(n/ log n(log(n/ log n)/ log log(n/ log n))4−1)

11

Page 12: Forbidden Patterns - CiteSeerX

words, i.e. Ω(n(log n/ log log n)3) bits. ut

12