Top Banner
TinyLex: Static N-Gram Index Pruning with Perfect Recall Derrick Coetzee, Microsoft Research CC0 waiver : To the extent possible under law, I waive all copyright and related or neighboring rights to all content in this presentation.
38

Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

Jan 15, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning with Perfect

RecallDerrick Coetzee, Microsoft Research

CC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to all content in this presentation.

Page 2: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

2

Consider searching for a subsequence in a collection of genome sequences:…gcaagctttatagtgacaacaataaggtatcactcggtt…

N-gram inverted indexes are the traditional solution, but have 10-100 times more terms than ordinary word-based inverted indexes

TinyLex indexes achieve similar query performance with 7-17 times less terms

TinyLex provides good worst-case query performance

Motivation

Page 3: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

3

1. Each wife had seven sacks, 2. Each sack had seven cats, 3. Each cat had seven kits. 4. Kits, cats, sacks, and wives.

Inverted indexes

each: {1, 2, 3}had: {1, 2, 3}seven: {1, 2, 3}wife: {1, 4}

sack: {1, 2, 4}cat: {2, 3, 4}kit: {3, 4}

Page 4: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

4

1. Each wife had seven sacks, 2. Each sack had seven cats, 3. Each cat had seven kits. 4. Kits, cats, sacks, and wives.

Inverted indexes

Query: sack and cat sack: {1, 2, 4} cat: {2, 3, 4} {1, 2, 4} ∩ {2, 3, 4} = {2, 4}

Page 5: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

5

Partial word or punctuation queries◦ Searching a dictionary for all words ending in

“ment”◦ Searching for <b> in HTML files◦ Searching for "%s" in C source files◦ Searching for x^2/2 in LaTeX source files

Searching East Asian language text◦ No spaces, word extraction is complex

Phrase searching

Limitations of inverted indexes

Page 6: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

6

Genome sequences: 1. gcaagctttatagtgacaac... 2. aataaggtatcactcggtta... 3. caattacccccacttcccct... 4. cattataaagaaatgatcaa...

Example query:Documents containing subsequence “cact”

Limitations of inverted indexes

Page 7: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

7

Simplified example: Two-letter alphabet 1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb

N-gram inverted indexes

aaa: {2}aab: {2, 3, 4}aba: {1, 2, 3}abb: {1, 2, 4}

baa: {2, 3, 4}bab: {1, 2, 3}bba: {1, 4}bbb: {1, 4}

Page 8: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

8

1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb

N-gram inverted indexesQuery: aaba

aaba aab and aba

Page 9: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

9

1. babbbbabab 2. aababaaabb 3. babababaab (false positive) 4. bbbbaabbbb

N-gram inverted indexesQuery: aaba aab and abaaab: {2, 3, 4}aba: {1, 2, 3}{2, 3, 4} ∩ {1, 2, 3} = {2, 3}

Page 10: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

10

1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb

Selecting n-gram length

a: {1, 2, 3, 4}b: {1, 2, 3, 4}

Small number of termsSlow queries• Long posting lists• Too many false positives

length = 1

Page 11: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

11

1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb

Selecting n-gram length

aababa: {2}aabbbb: {4}abaaab: {2}ababaa: {2,3}ababab: {3}abbbba: {1}baaabb: {2}baabbb: {4}babaaa: {2}

babaab: {3}bababa: {3}babbbb: {1}bbaabb: {4}bbabab: {1}bbbaab: {4}bbbaba: {1}bbbbaa: {4}bbbbab: {1}

Fast queriesToo many termsQueries must be ≥6 characters

length = 6

Page 12: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

12

Review of inverted n-gram indexes Example TinyLex index TinyLex index construction Results Disadvantages Questions

Overview

Page 13: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

13

Goal: less terms without sacrificing query performance

Consider the n-grams “juggl” and “uggle”◦ Almost exactly the same posting list in a typical

English language collection◦ Just put the n-gram “uggl” in the index, and leave

out “juggl” and “uggle”

TinyLex

juggl: {2, 7, 33}uggle: {2, 7, 33}

uggl: {2,7,33}

Page 14: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

14

Insight: The more false positives a term produces when it is queried for, the more information it adds when it is added to the index.

Choose a false positive threshold t and choose the smallest possible set of index terms that satisfies it.

Allow variable-length n-grams.

TinyLex

Page 15: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

15

1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb

TinyLex: Example

aa: {2, 3, 4}bb: {1, 2, 4}aaa: {2}aba: {1, 2, 3}bab: {1, 2, 3}

bba: {1, 4}bbb: {1, 4}aaba: {2}baab: {3, 4}babb: {1}

In this example t = 1. At most 1 false positive is allowed for any query.Only 10 terms!

Page 16: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

16

1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb

TinyLex: Example

Query: abaab aba and baababa: {1, 2, 3}baab: {3, 4}{1, 2, 3} ∩ {3, 4} = {3}

Page 17: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

17

The construction guarantees that if the query term occurs in the collection, it will have at most t – 1 false positives (zero in this case).

If we observe t false positives, we can halt immediately.

TinyLex: Nonoccurring terms

Page 18: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

18

TinyLex: Nonoccurring terms

Query: bbbbb bbb and bbb and bbbbbb: {1, 4}{1, 4} ∩ {1, 4} ∩ {1, 4} = {1, 4}

1. babbbbabab (false positive)

...can’t happen unless the query result is empty. Halt.

Page 19: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

19

Achieve similar query performance to classical n-gram indexes with a much larger number of terms

Worst-case bound on number of false positives

Query can be any length

TinyLex: Benefits

Page 20: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

20

Review of inverted n-gram indexes Example TinyLex index TinyLex index construction Results Disadvantages Questions

Overview

Page 21: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

21

The problem:◦ Input: a set of documents, a threshold t◦ Output: a list of terms such that any query for a

term occurring in the collection will have at most t – 1 false positives

Constructing a TinyLex index

Page 22: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

22

Basic construction: For each n-gram length from 1 to max:

◦ Make a list of all n-grams in the collection and what documents they occur in.

◦ Perform a query on each term using the partially constructed index.

◦ If a term has too many false positives, add it to the index.

Constructing a TinyLex index

Page 23: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

23

1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb

Construction: Example

(index empty)

1-grams

Query result

Actual

a {1,2,3,4}

{1,2,3,4}

b {1,2,3,4}

{1,2,3,4}

t = 1

If the difference between the query result size and the actual posting list size is at least 1, add it to the index.

Page 24: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

24

1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb

Construction: Example2-grams

Query result

Actual

aa {1,2,3,4}

{2,3,4}

ab {1,2,3,4}

{1,2,3,4}

ba {1,2,3,4}

{1,2,3,4}

bb {1,2,3,4}

{1,2,4}(index empty)

Page 25: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

25

1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb

Construction: Example2-grams

Query result

Actual

aa {1,2,3,4}

{2,3,4}

ab {1,2,3,4}

{1,2,3,4}

ba {1,2,3,4}

{1,2,3,4}

bb {1,2,3,4}

{1,2,4}

aa: {2,3,4}bb: {1,2,4}

Page 26: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

26

1. 1011110101 2. 0010100011 3. 1010101001 4. 1111001111

Construction: Example

aa: {2,3,4}bb: {1,2,4}

3-grams

Query result

Actual

aaa {2,3,4} {2}

aab {2,3,4} {2,3,4}

aba {1,2,3,4}

{1,2,3}

abb {1,2,4} {1,2,4}

baa {2,3,4} {2,3,4}

bab {1,2,3,4}

{1,2,3}

bba {1,2,4} {1,4}

bbb {1,2,4} {1,4}

Page 27: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

27

1. 1011110101 2. 0010100011 3. 1010101001 4. 1111001111

Construction: Example

aa: {2,3,4}bb: {1,2,4}aaa: {2}aba: {1,2,3}bab: {1,2,3}bba: {1,4}bbb: {1,4}

3-grams

Query result

Actual

aaa {2,3,4} {2}

aab {2,3,4} {2,3,4}

aba {1,2,3,4}

{1,2,3}

abb {1,2,4} {1,2,4}

baa {2,3,4} {2,3,4}

bab {1,2,3,4}

{1,2,3}

bba {1,2,4} {1,4}

bbb {1,2,4} {1,4}

Page 28: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

28

1. 1011110101 2. 0010100011 3. 1010101001 4. 1111001111

Construction: Example4-grams

Query result

Actual

aaab {2} {2}

aaba {2,3} {2}

aabb {2,4} {2,4}

abaa {2,3} {2,3}

abab {1,2,3} {1,2,3}

abbb {1,4} {1,4}

baaa {2} {2}

baab {2,3,4} {3,4}

baba {1,2,3} {1,2,3}

babb {1,2} {1}

bbaa {4} {4}

bbab {1} {1}

bbba {1,4} {1,4}

bbbb {1,4} {1,4}

aa: {2,3,4}bb: {1,2,4}aaa: {2}aba: {1,2,3}bab: {1,2,3}bba: {1,4}bbb: {1,4}

Page 29: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

29

1. 1011110101 2. 0010100011 3. 1010101001 4. 1111001111

Construction: Example

aa: {2,3,4}bb: {1,2,4}aaa: {2}aba: {1,2,3}bab: {1,2,3}bba: {1,4}bbb: {1,4}

aaba: {2}baab: {3,4}babb: {1}

4-grams

Query result

Actual

aaab {2} {2}

aaba {2,3} {2}

aabb {2,4} {2,4}

abaa {2,3} {2,3}

abab {1,2,3} {1,2,3}

abbb {1,4} {1,4}

baaa {2} {2}

baab {2,3,4} {3,4}

baba {1,2,3} {1,2,3}

babb {1,2} {1}

bbaa {4} {4}

bbab {1} {1}

bbba {1,4} {1,4}

bbbb {1,4} {1,4}

Page 30: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

30

Review of inverted n-gram indexes Example TinyLex index TinyLex index construction Results Disadvantages Questions

Overview

Page 31: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

31

Results Test set: 100MB TREC WSJ collection

37000 documents, English text Same query performance with 7-17 times less

terms

1E+3 1E+4 1E+5 1E+60

100

200

300

400

500

600TinyLex index

Classical n-gram index

Number of terms

Mean

qu

ery

tim

e

(ms)

Page 32: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

32

Results Overall compressed index size 2-20% less TinyLex index has more information per term

0 25 50 75 1000

100

200

300

400

500

600TinyLex index

Classical n-gram index

Index size (MB)

Mean

qu

ery

tim

e

(ms)

Page 33: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

33

Results Dramatic 50x improvement in worst-case query

performance for long queries

0 10 20 30 40 50 60 70 80 900

5001000150020002500300035004000

6-grams

TinyLex index of same size

Query length in characters

Wo

rst

qu

ery

tim

e

(ms)

Page 34: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

34

Applications to phrase searching using variable-length word n-grams

Making the construction more efficient Performance on genome sequences Empirical evaluation of scaling

See paper for

Page 35: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

35

Suffix arrays (Manber and Myers 1991)◦ Faster queries, but indexes 3-10 times larger

agrep and GLIMPSE (Wu and Manber 1994)◦ More general queries, but relies on a word

concept n-Gram/2L (Kim et al 2005)

◦ Orthogonal; examines less document offsets “Growing an n-gram language model”

◦ (Siivola and Pellom 2005)◦ Similar idea applied to language modeling

Related work

Page 36: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

36

Faster construction time◦ Currently about 10 times slower to construct than

a classical n-gram index. Queries for nonoccurring terms are more

expensive than with classical n-gram indexes (t documents must be read).

Generalize to dynamic collections

Future work

Page 37: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

37

N-gram indexes enable practical queries for subsequences

TinyLex indexes achieve similar query performance to classical n-gram indexes with 7-17 times less terms

TinyLex yields good worst-case query performance by placing an upper bound on the number of false positives

Conclusions

Page 38: Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

38

Questions?