Top Banner
Advanced Topics in Information Systems: Information Retrieval Advanced Topics in Information Systems: Information Retrieval Jun.Prof. Alexander Markowetz Slides modified from Christopher Manning and Prabhakar Raghavan
175
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: IR2

Advanced Topics in Information Systems: Information Retrieval

Advanced Topicsin Information Systems:

Information Retrieval

Jun.Prof. Alexander Markowetz

Slides modified from Christopher Manning and Prabhakar Raghavan

Page 2: IR2

Advanced Topics in Information Systems: Information Retrieval

DICTIONARY DATA STRUCTURES

2

Page 3: IR2

Advanced Topics in Information Systems: Information Retrieval

Hashes Each vocabulary term is hashed to an integer

(We assume you’ve seen hashtables before)

Pros: Lookup is faster than for a tree: O(1)

Cons: No easy way to find minor variants:

judgment/judgement

No prefix search [tolerant retrieval] If vocabulary keeps growing, need to occasionally do

the expensive operation of rehashing everything

Sec. 3.1

3

Page 4: IR2

Advanced Topics in Information Systems: Information Retrieval

Roota-m n-z

a-hu hy-m n-sh si-z

aardvark

huygens

sickle

zygot

Tree: binary tree

Sec. 3.1

4

Page 5: IR2

Advanced Topics in Information Systems: Information Retrieval

Tree: B-tree

Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].

a-huhy-m

n-z

Sec. 3.1

5

Page 6: IR2

Advanced Topics in Information Systems: Information Retrieval

Trees Simplest: binary tree More usual: B-trees Trees require a standard ordering of characters and

hence strings … but we standardly have one Pros:

Solves the prefix problem (terms starting with hyp) Cons:

Slower: O(log M) [and this requires balanced tree] Rebalancing binary trees is expensive

But B-trees mitigate the rebalancing problem

Sec. 3.1

6

Page 7: IR2

Advanced Topics in Information Systems: Information Retrieval

WILD-CARD QUERIES

7

Page 8: IR2

Advanced Topics in Information Systems: Information Retrieval

Wild-card queries: *

mon*: find all docs containing any word beginning “mon”.

Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon ≤ w < moo

*mon: find words ending in “mon”: harder Maintain an additional B-tree for terms backwards.

Can retrieve all words in range: nom ≤ w < non.

Exercise: from this, how can we enumerate all termsmeeting the wild-card query pro*cent ?

Sec. 3.2

8

Page 9: IR2

Advanced Topics in Information Systems: Information Retrieval

Query processing At this point, we have an enumeration of all

terms in the dictionary that match the wild-card query.

We still have to look up the postings for each enumerated term.

E.g., consider the query:se*ate AND fil*erThis may result in the execution of many Boolean AND queries.

Sec. 3.2

9

Page 10: IR2

Advanced Topics in Information Systems: Information Retrieval

B-trees handle *’s at the end of a query term How can we handle *’s in the middle of query

term? co*tion

We could look up co* AND *tion in a B-tree and intersect the two term sets Expensive

The solution: transform wild-card queries so that the *’s occur at the end

This gives rise to the Permuterm Index.

Sec. 3.2

10

Page 11: IR2

Advanced Topics in Information Systems: Information Retrieval

Permuterm index For term hello, index under:

hello$, ello$h, llo$he, lo$hel, o$hell

where $ is a special symbol. Queries:

X lookup on X$ X* lookup on $X* *X lookup on X$* *X* lookup on X* X*Y lookup on Y$X* X*Y*Z ??? Exercise!

Query = hel*oX=hel, Y=o

Lookup o$hel*

Sec. 3.2.1

11

Page 12: IR2

Advanced Topics in Information Systems: Information Retrieval

Permuterm query processing Rotate query wild-card to the right Now use B-tree lookup as before. Permuterm problem: ≈ quadruples lexicon size

Empirical observation for English.

Sec. 3.2.1

12

Page 13: IR2

Advanced Topics in Information Systems: Information Retrieval

Bigram (k-gram) indexes Enumerate all k-grams (sequence of k chars)

occurring in any term e.g., from text “April is the cruelest month”

we get the 2-grams (bigrams)

$ is a special word boundary symbol

Maintain a second inverted index from bigrams to dictionary terms that match each bigram.

$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,ue,el,le,es,st,t$, $m,mo,on,nt,h$

Sec. 3.2.2

13

Page 14: IR2

Advanced Topics in Information Systems: Information Retrieval

Bigram index example The k-gram index finds terms based on a query

consisting of k-grams (here k=2).

mo

on

among

$m mace

among

amortize

madden

around

Sec. 3.2.2

14

Page 15: IR2

Advanced Topics in Information Systems: Information Retrieval

Processing wild-cards Query mon* can now be run as

$m AND mo AND on

Gets terms that match AND version of our wildcard query.

But we’d enumerate moon. Must post-filter these terms against query. Surviving enumerated terms are then looked

up in the term-document inverted index. Fast, space efficient (compared to permuterm).

Sec. 3.2.2

15

Page 16: IR2

Advanced Topics in Information Systems: Information Retrieval

Processing wild-card queries As before, we must execute a Boolean query for each

enumerated, filtered term. Wild-cards can result in expensive query execution

(very large disjunctions…) pyth* AND prog*

If you encourage “laziness” people will respond!

Which web search engines allow wildcard queries?

Search

Type your search terms, use ‘*’ if you need to.E.g., Alex* will match Alexander.

Sec. 3.2.2

16

Page 17: IR2

Advanced Topics in Information Systems: Information Retrieval

SPELLING CORRECTION

17

Page 18: IR2

Advanced Topics in Information Systems: Information Retrieval

Spell correction Two principal uses

Correcting document(s) being indexed Correcting user queries to retrieve “right” answers

Two main flavors: Isolated word

Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words e.g., from → form

Context-sensitive Look at surrounding words, e.g., I flew form Heathrow to Narita.

Sec. 3.3

18

Page 19: IR2

Advanced Topics in Information Systems: Information Retrieval

Document correction Especially needed for OCR’ed documents

Correction algorithms are tuned for this: rn/m Can use domain-specific knowledge

E.g., OCR can confuse O and D more often than it would confuse O and I (adjacent on the QWERTY keyboard, so more likely interchanged in typing).

But also: web pages and even printed material has typos

Goal: the dictionary contains fewer misspellings But often we don’t change the documents but

aim to fix the query-document mapping

Sec. 3.3

19

Page 20: IR2

Advanced Topics in Information Systems: Information Retrieval

Query mis-spellings Our principal focus here

E.g., the query Alanis Morisett

We can either Retrieve documents indexed by the correct spelling,

OR Return several suggested alternative queries with

the correct spelling Did you mean … ?

Sec. 3.3

20

Page 21: IR2

Advanced Topics in Information Systems: Information Retrieval

Isolated word correction Fundamental premise – there is a lexicon from

which the correct spellings come Two basic choices for this

A standard lexicon such as Webster’s English Dictionary An “industry-specific” lexicon – hand-maintained

The lexicon of the indexed corpus E.g., all words on the web All names, acronyms etc. (Including the mis-spellings)

Sec. 3.3.2

21

Page 22: IR2

Advanced Topics in Information Systems: Information Retrieval

Isolated word correction Given a lexicon and a character sequence Q,

return the words in the lexicon closest to Q What’s “closest”? We’ll study several alternatives

Edit distance (Levenshtein distance) Weighted edit distance n-gram overlap

Sec. 3.3.2

22

Page 23: IR2

Advanced Topics in Information Systems: Information Retrieval

Edit distance

Given two strings S1 and S2, the minimum number of operations to convert one to the other

Operations are typically character-level Insert, Delete, Replace, (Transposition)

E.g., the edit distance from dof to dog is 1 From cat to act is 2 (Just 1 with transpose.) from cat to dog is 3.

Generally found by dynamic programming. See http://www.merriampark.com/ld.htm for a

nice example plus an applet.

Sec. 3.3.3

23

Page 24: IR2

Advanced Topics in Information Systems: Information Retrieval

Weighted edit distance As above, but the weight of an operation

depends on the character(s) involved Meant to capture OCR or keyboard errors, e.g. m

more likely to be mis-typed as n than as q Therefore, replacing m by n is a smaller edit

distance than by q This may be formulated as a probability model

Requires weight matrix as input Modify dynamic programming to handle

weights

Sec. 3.3.3

24

Page 25: IR2

Advanced Topics in Information Systems: Information Retrieval

Using edit distances Given query, first enumerate all character

sequences within a preset (weighted) edit distance (e.g., 2)

Intersect this set with list of “correct” words Show terms you found to user as suggestions Alternatively,

We can look up all possible corrections in our inverted index and return all docs … slow

We can run with a single most likely correction

The alternatives disempower the user, but save a round of interaction with the user

Sec. 3.3.4

25

Page 26: IR2

Advanced Topics in Information Systems: Information Retrieval

Edit distance to all dictionary terms?

Given a (mis-spelled) query – do we compute its edit distance to every dictionary term? Expensive and slow Alternative?

How do we cut the set of candidate dictionary terms?

One possibility is to use n-gram overlap for this This can also be used by itself for spelling

correction.

Sec. 3.3.4

26

Page 27: IR2

Advanced Topics in Information Systems: Information Retrieval

n-gram overlap Enumerate all the n-grams in the query string

as well as in the lexicon Use the n-gram index (recall wild-card search)

to retrieve all lexicon terms matching any of the query n-grams

Threshold by number of matching n-grams Variants – weight by keyboard layout, etc.

Sec. 3.3.4

27

Page 28: IR2

Advanced Topics in Information Systems: Information Retrieval

Example with trigrams Suppose the text is november

Trigrams are nov, ove, vem, emb, mbe, ber.

The query is december Trigrams are dec, ece, cem, emb, mbe, ber.

So 3 trigrams overlap (of 6 in each term) How can we turn this into a normalized

measure of overlap?

Sec. 3.3.4

28

Page 29: IR2

Advanced Topics in Information Systems: Information Retrieval

One option – Jaccard coefficient A commonly-used measure of overlap Let X and Y be two sets; then the J.C. is

Equals 1 when X and Y have the same elements and zero when they are disjoint

X and Y don’t have to be of the same size Always assigns a number between 0 and 1

Now threshold to decide if you have a match E.g., if J.C. > 0.8, declare a match

YXYX ∪∩ /

Sec. 3.3.4

29

Page 30: IR2

Advanced Topics in Information Systems: Information Retrieval

Matching trigrams Consider the query lord – we wish to identify

words matching 2 of its 3 bigrams (lo, or, rd)

lo

or

rd

alone lord sloth

lord morbid

border card

border

ardent

Standard postings “merge” will enumerate …

Adapt this to using Jaccard (or another) measure.

Sec. 3.3.4

30

Page 31: IR2

Advanced Topics in Information Systems: Information Retrieval

Context-sensitive spell correction Text: I flew from Heathrow to Narita. Consider the phrase query “flew form

Heathrow” We’d like to respond

Did you mean “flew from Heathrow”?because no docs matched the query phrase.

Sec. 3.3.5

31

Page 32: IR2

Advanced Topics in Information Systems: Information Retrieval

Context-sensitive correction Need surrounding context to catch this. First idea: retrieve dictionary terms close (in

weighted edit distance) to each query term Now try all possible resulting phrases with one

word “fixed” at a time flew from heathrow fled form heathrow flea form heathrow

Hit-based spelling correction: Suggest the alternative that has lots of hits.

Sec. 3.3.5

32

Page 33: IR2

Advanced Topics in Information Systems: Information Retrieval

Exercise Suppose that for “flew form Heathrow” we

have 7 alternatives for flew, 19 for form and 3 for heathrow.

How many “corrected” phrases will we enumerate in this scheme?

Sec. 3.3.5

33

Page 34: IR2

Advanced Topics in Information Systems: Information Retrieval

Another approach Break phrase query into a conjunction of

biwords (Lecture 2). Look for biwords that need only one term

corrected. Enumerate phrase matches and … rank them!

Sec. 3.3.5

34

Page 35: IR2

Advanced Topics in Information Systems: Information Retrieval

General issues in spell correction We enumerate multiple alternatives for “Did

you mean?” Need to figure out which to present to the user Use heuristics

The alternative hitting most docs Query log analysis + tweaking

For especially popular, topical queries

Spell-correction is computationally expensive Avoid running routinely on every query? Run only on queries that matched few docs

Sec. 3.3.5

35

Page 36: IR2

Advanced Topics in Information Systems: Information Retrieval

SOUNDEX

36

Page 37: IR2

Advanced Topics in Information Systems: Information Retrieval

Soundex Class of heuristics to expand a query into

phonetic equivalents Language specific – mainly for names E.g., chebyshev → tchebycheff

Invented for the U.S. census … in 1918

Sec. 3.4

37

Page 38: IR2

Advanced Topics in Information Systems: Information Retrieval

Soundex – typical algorithm Turn every token to be indexed into a 4-

character reduced form Do the same with query terms Build and search an index on the reduced forms

(when the query calls for a soundex match)

http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#Top

Sec. 3.4

38

Page 39: IR2

Advanced Topics in Information Systems: Information Retrieval

Soundex – typical algorithm1. Retain the first letter of the word. 2. Change all occurrences of the following letters

to '0' (zero):  'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.

3. Change letters to digits as follows: B, F, P, V → 1 C, G, J, K, Q, S, X, Z → 2 D,T → 3 L → 4 M, N → 5 R → 6

Sec. 3.4

39

Page 40: IR2

Advanced Topics in Information Systems: Information Retrieval

Soundex continued

4. Remove all pairs of consecutive digits.5. Remove all zeros from the resulting string.6. Pad the resulting string with trailing zeros and

return the first four positions, which will be of the form <uppercase letter> <digit> <digit> <digit>.

E.g., Herman becomes H655.

Will hermann generate the same code?

Sec. 3.4

40

Page 41: IR2

Advanced Topics in Information Systems: Information Retrieval

Soundex Soundex is the classic algorithm, provided by

most databases (Oracle, Microsoft, …) How useful is soundex? Not very – for information retrieval Okay for “high recall” tasks (e.g., Interpol),

though biased to names of certain nationalities Zobel and Dart (1996) show that other

algorithms for phonetic matching perform much better in the context of IR

Sec. 3.4

41

Page 42: IR2

Advanced Topics in Information Systems: Information Retrieval

What queries can we process? We have

Positional inverted index with skip pointers Wild-card index Spell-correction Soundex

Queries such as(SPELL(moriset) /3 toron*to) OR

SOUNDEX(chaikofski)

42

Page 43: IR2

Advanced Topics in Information Systems: Information Retrieval

INDEX GENERATION

43

Page 44: IR2

Advanced Topics in Information Systems: Information Retrieval

Index construction How do we construct an index? What strategies can we use with limited

main memory?

Ch. 4

44

Page 45: IR2

Advanced Topics in Information Systems: Information Retrieval

Hardware basics Many design decisions in information retrieval

are based on the characteristics of hardware We begin by reviewing hardware basics

Sec. 4.1

45

Page 46: IR2

Advanced Topics in Information Systems: Information Retrieval

Hardware basics Access to data in memory is much faster than

access to data on disk. Disk seeks: No data is transferred from disk

while the disk head is being positioned. Therefore: Transferring one large chunk of data

from disk to memory is faster than transferring many small chunks.

Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks).

Block sizes: 8KB to 256 KB.

Sec. 4.1

46

Page 47: IR2

Advanced Topics in Information Systems: Information Retrieval

Hardware basics Servers used in IR systems now typically have

several GB of main memory, sometimes tens of GB.

Available disk space is several (2–3) orders of magnitude larger.

Fault tolerance is very expensive: It’s much cheaper to use many regular machines rather than one fault tolerant machine.

Google is particularly famous for combining standard hardware … in shipping containers.

Sec. 4.1

47

Page 48: IR2

Advanced Topics in Information Systems: Information Retrieval

Hardware assumptions symbol statistic value s average seek time 5 ms = 5 x 10−3 s b transfer time per byte 0.02 μs = 2 x 10−8 s

processor’s clock rate 109 s−1

p low-level operation 0.01 μs = 10−8 s (e.g., compare & swap a word)

size of main memory several GB size of disk space 1 TB or more

Sec. 4.1

48

Page 49: IR2

Advanced Topics in Information Systems: Information Retrieval

RCV1: Our collection for this lecture Shakespeare’s collected works definitely aren’t

large enough for demonstrating many of the points in this course.

The collection we’ll use isn’t really large enough either, but it’s publicly available and is at least a more plausible example.

As an example for applying scalable index construction algorithms, we will use the Reuters RCV1 collection.

This is one year of Reuters newswire (part of 1995 and 1996)

Sec. 4.2

49

Page 50: IR2

Introduction to Information Retrieval

A Reuters RCV1 document

Sec. 4.2

50

Page 51: IR2

Advanced Topics in Information Systems: Information Retrieval

Reuters RCV1 statistics symbol statistic value N documents 800,000 L avg. # tokens per doc 200 M terms (= word types) 400,000 avg. # bytes per token 6

(incl. spaces/punct.)

avg. # bytes per token 4.5 (without spaces/punct.)

avg. # bytes per term 7.5 non-positional postings 100,000,000

Sec. 4.2

51

Page 52: IR2

Advanced Topics in Information Systems: Information Retrieval

Documents are parsed to extract words and these are saved with the Document ID.

I did enact JuliusCaesar I was killed i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The nobleBrutus hath told youCaesar was ambitious

Doc 2

Recall IIR 1 index constructionTerm Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

Sec. 4.2

52

Page 53: IR2

Advanced Topics in Information Systems: Information Retrieval

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Key step

After all documents have been parsed, the inverted file is sorted by terms.

We focus on this sort step.We have 100M items to sort.

Sec. 4.2

53

Page 54: IR2

Advanced Topics in Information Systems: Information Retrieval

Scaling index construction In-memory index construction does not scale. How can we construct an index for very large

collections? Taking into account the hardware constraints

we just learned about . . . Memory, disk, speed, etc.

Sec. 4.2

54

Page 55: IR2

Advanced Topics in Information Systems: Information Retrieval

Sort-based index construction As we build the index, we parse docs one at a time.

While building the index, we cannot easily exploit compression tricks (you can, but much more complex)

The final postings for any term are incomplete until the end.

At 12 bytes per non-positional postings entry (term, doc, freq), demands a lot of space for large collections.

T = 100,000,000 in the case of RCV1 So … we can do this in memory in 2009, but typical

collections are much larger. E.g. the New York Times provides an index of >150 years of newswire

Thus: We need to store intermediate results on disk.

Sec. 4.2

55

Page 56: IR2

Advanced Topics in Information Systems: Information Retrieval

Use the same algorithm for disk? Can we use the same index construction

algorithm for larger collections, but by using disk instead of memory?

No: Sorting T = 100,000,000 records on disk is too slow – too many disk seeks.

We need an external sorting algorithm.

Sec. 4.2

56

Page 57: IR2

Advanced Topics in Information Systems: Information Retrieval

Bottleneck Parse and build postings entries one doc at a

time Now sort postings entries by term (then by doc

within each term) Doing this with random disk seeks would be too

slow – must sort T=100M records

If every comparison took 2 disk seeks, and N items could besorted with N log2N comparisons, how long would this take?

Sec. 4.2

57

Page 58: IR2

Advanced Topics in Information Systems: Information Retrieval

BSBI: Blocked sort-based Indexing (Sorting with fewer disk seeks) 12-byte (4+4+4) records (term, doc, freq). These are generated as we parse docs. Must now sort 100M such 12-byte records by

term. Define a Block ~ 10M such records

Can easily fit a couple into memory. Will have 10 such blocks to start with.

Basic idea of algorithm: Accumulate postings for each block, sort, write to

disk. Then merge the blocks into one long sorted order.

Sec. 4.2

58

Page 59: IR2

Advanced Topics in Information Systems: Information Retrieval Sec. 4.2

59

Page 60: IR2

Advanced Topics in Information Systems: Information Retrieval

Sorting 10 blocks of 10M records First, read each block and sort within:

Quicksort takes 2N ln N expected steps In our case 2 x (10M ln 10M) steps

Exercise: estimate total time to read each block Exercise: estimate total time to read each block from disk and and quicksort it.from disk and and quicksort it.

10 times this estimate – gives us 10 sorted runs of 10M records each.

Done straightforwardly, need 2 copies of data on disk But can optimize this

Sec. 4.2

60

Page 61: IR2

Sec. 4.2

61

Page 62: IR2

Advanced Topics in Information Systems: Information Retrieval

How to merge the sorted runs? Can do binary merges, with a merge tree of log210 = 4

layers. During each layer, read into memory runs in blocks of

10M, merge, write back.

Disk

1

3 4

22

1

4

3

Runs beingmerged.

Merged run.

Sec. 4.2

62

Page 63: IR2

Advanced Topics in Information Systems: Information Retrieval

How to merge the sorted runs? But it is more efficient to do a n-way merge, where you

are reading from all blocks simultaneously Providing you read decent-sized chunks of each block

into memory and then write out a decent-sized output chunk, then you’re not killed by disk seeks

Sec. 4.2

63

Page 64: IR2

Advanced Topics in Information Systems: Information Retrieval

Remaining problem with sort-based algorithm Our assumption was: we can keep the

dictionary in memory. We need the dictionary (which grows

dynamically) in order to implement a term to termID mapping.

Actually, we could work with term,docID postings instead of termID,docID postings . . .

. . . but then intermediate files become very large. (We would end up with a scalable, but very slow index construction method.)

Sec. 4.3

64

Page 65: IR2

Advanced Topics in Information Systems: Information Retrieval

SPIMI: Single-pass in-memory indexing

Key idea 1: Generate separate dictionaries for each block – no need to maintain term-termID mapping across blocks.

Key idea 2: Don’t sort. Accumulate postings in postings lists as they occur.

With these two ideas we can generate a complete inverted index for each block.

These separate indexes can then be merged into one big index.

Sec. 4.3

65

Page 66: IR2

Advanced Topics in Information Systems: Information Retrieval

SPIMI-Invert

Merging of blocks is analogous to BSBI.

Sec. 4.3

66

Page 67: IR2

Advanced Topics in Information Systems: Information Retrieval

SPIMI: Compression Compression makes SPIMI even more efficient.

Compression of terms Compression of postings

Sec. 4.3

67

Page 68: IR2

Advanced Topics in Information Systems: Information Retrieval

Distributed indexing For web-scale indexing (don’t try this at

home!):must use a distributed computing cluster

Individual machines are fault-prone Can unpredictably slow down or fail

How do we exploit such a pool of machines?

Sec. 4.4

68

Page 69: IR2

Advanced Topics in Information Systems: Information Retrieval

Google data centers Google data centers mainly contain commodity

machines. Data centers are distributed around the world. Estimate: a total of 1 million servers, 3 million

processors/cores (Gartner 2007) Estimate: Google installs 100,000 servers each

quarter. Based on expenditures of 200–250 million dollars per

year

This would be 10% of the computing capacity of the world!?!

Sec. 4.4

69

Page 70: IR2

Advanced Topics in Information Systems: Information Retrieval

Google data centers If in a non-fault-tolerant system with 1000

nodes, each node has 99.9% uptime, what is the uptime of the system?

Answer: 63% Calculate the number of servers failing per

minute for an installation of 1 million servers.

Sec. 4.4

70

Page 71: IR2

Advanced Topics in Information Systems: Information Retrieval

Distributed indexing Maintain a master machine directing the

indexing job – considered “safe”. Break up indexing into sets of (parallel) tasks. Master machine assigns each task to an idle

machine from a pool.

Sec. 4.4

71

Page 72: IR2

Advanced Topics in Information Systems: Information Retrieval

Parallel tasks We will use two sets of parallel tasks

Parsers Inverters

Break the input document collection into splits Each split is a subset of documents

(corresponding to blocks in BSBI/SPIMI)

Sec. 4.4

72

Page 73: IR2

Advanced Topics in Information Systems: Information Retrieval

Parsers Master assigns a split to an idle parser machine Parser reads a document at a time and emits

(term, doc) pairs Parser writes pairs into j partitions Each partition is for a range of terms’ first

letters (e.g., a-f, g-p, q-z) – here j = 3.

Now to complete the index inversion

Sec. 4.4

73

Page 74: IR2

Advanced Topics in Information Systems: Information Retrieval

Inverters An inverter collects all (term,doc) pairs (=

postings) for one term-partition. Sorts and writes to postings lists

Sec. 4.4

74

Page 75: IR2

Introduction to Information Retrieval

Data flow

splits

Parser

Parser

Parser

Master

a-f g-p q-z

a-f g-p q-z

a-f g-p q-z

Inverter

Inverter

Inverter

Postings

a-f

g-p

q-z

assign assign

Mapphase

Segment files Reducephase

Sec. 4.4

75

Page 76: IR2

Advanced Topics in Information Systems: Information Retrieval

MapReduce The index construction algorithm we just

described is an instance of MapReduce. MapReduce (Dean and Ghemawat 2004) is a

robust and conceptually simple framework for distributed computing …

… without having to write code for the distribution part.

They describe the Google indexing system (ca. 2002) as consisting of a number of phases, each implemented in MapReduce.

Sec. 4.4

76

Page 77: IR2

Advanced Topics in Information Systems: Information Retrieval

MapReduce Index construction was just one phase. Another phase: transforming a term-partitioned

index into a document-partitioned index. Term-partitioned: one machine handles a subrange

of terms Document-partitioned: one machine handles a

subrange of documents

As we discuss later in the course, most search engines use a document-partitioned index … better load balancing, etc.

Sec. 4.4

77

Page 78: IR2

Advanced Topics in Information Systems: Information Retrieval

Schema for index construction in MapReduce Schema of map and reduce functions map: input → list(k, v) reduce: (k,list(v)) → output Instantiation of the schema for index

construction map: web collection → list(termID, docID) reduce: (<termID1, list(docID)>, <termID2,

list(docID)>, …) → (postings list1, postings list2, …) Example for index construction map: d2 : C died. d1 : C came, C c’ed. → (<C, d2>,

<died,d2>, <C,d1>, <came,d1>, <C,d1>, <c’ed, d1> reduce: (<C,(d2,d1,d1)>, <died,(d2)>, <came,(d1)>,

<c’ed,(d1)>) → (<C,(d1:2,d2:1)>, <died,(d2:1)>, <came,(d1:1)>, <c’ed,(d1:1)>)

Sec. 4.4

78

Page 79: IR2

Advanced Topics in Information Systems: Information Retrieval

Dynamic indexing Up to now, we have assumed that collections

are static. They rarely are:

Documents come in over time and need to be inserted.

Documents are deleted and modified.

This means that the dictionary and postings lists have to be modified: Postings updates for terms already in dictionary New terms added to dictionary

Sec. 4.5

79

Page 80: IR2

Advanced Topics in Information Systems: Information Retrieval

Simplest approach Maintain “big” main index New docs go into “small” auxiliary index Search across both, merge results Deletions

Invalidation bit-vector for deleted docs Filter docs output on a search result by this

invalidation bit-vector

Periodically, re-index into one main index

Sec. 4.5

80

Page 81: IR2

Advanced Topics in Information Systems: Information Retrieval

Issues with main and auxiliary indexes Problem of frequent merges – you touch stuff a lot Poor performance during merge Actually:

Merging of the auxiliary index into the main index is efficient if we keep a separate file for each postings list.

Merge is the same as a simple append. But then we would need a lot of files – inefficient for O/S.

Assumption for the rest of the lecture: The index is one big file.

In reality: Use a scheme somewhere in between (e.g., split very large postings lists, collect postings lists of length 1 in one file etc.)

Sec. 4.5

81

Page 82: IR2

Advanced Topics in Information Systems: Information Retrieval

Logarithmic merge Maintain a series of indexes, each twice as

large as the previous one. Keep smallest (Z0) in memory

Larger ones (I0, I1, …) on disk

If Z0 gets too big (> n), write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)

Or merge with I1 to form Z2

etc.

Sec. 4.5

82

Page 83: IR2

Sec. 4.5

83

Page 84: IR2

Advanced Topics in Information Systems: Information Retrieval

Logarithmic merge Auxiliary and main index: index construction

time is O(T2) as each posting is touched in each merge.

Logarithmic merge: Each posting is merged O(log T) times, so complexity is O(T log T)

So logarithmic merge is much more efficient for index construction

But query processing now requires the merging of O(log T) indexes Whereas it is O(1) if you just have a main and

auxiliary index

Sec. 4.5

84

Page 85: IR2

Advanced Topics in Information Systems: Information Retrieval

Further issues with multiple indexes Collection-wide statistics are hard to maintain E.g., when we spoke of spell-correction: which

of several corrected alternatives do we present to the user? We said, pick the one with the most hits

How do we maintain the top ones with multiple indexes and invalidation bit vectors? One possibility: ignore everything but the main index

for such ordering

Will see more such statistics used in results ranking

Sec. 4.5

85

Page 86: IR2

Advanced Topics in Information Systems: Information Retrieval

Dynamic indexing at search engines All the large search engines now do dynamic

indexing Their indices have frequent incremental

changes News items, blogs, new topical web pages

Sarah Palin, …

But (sometimes/typically) they also periodically reconstruct the index from scratch Query processing is then switched to the new index,

and the old index is then deleted

Sec. 4.5

86

Page 87: IR2

Advanced Topics in Information Systems: Information Retrieval Sec. 4.5

87

Page 88: IR2

Advanced Topics in Information Systems: Information Retrieval

Other sorts of indexes Positional indexes

Same sort of sorting problem … just larger

Building character n-gram indexes: As text is parsed, enumerate n-grams. For each n-gram, need pointers to all dictionary

terms containing it – the “postings”. Note that the same “postings entry” will arise

repeatedly in parsing the docs – need efficient hashing to keep track of this. E.g., that the trigram uou occurs in the term deciduous will

be discovered on each text occurrence of deciduous Only need to process each term once

Why?

Sec. 4.5

88

Page 89: IR2

Advanced Topics in Information Systems: Information Retrieval

INDEX COMPRESSION

8989

Page 90: IR2

Advanced Topics in Information Systems: Information Retrieval

Compressing Indexes

Collection statistics in more detail (with RCV1) How big will the dictionary and postings be?

Dictionary compression Postings compression

Ch. 5

90

Page 91: IR2

Advanced Topics in Information Systems: Information Retrieval

Why compression (in general)? Use less disk space

Saves a little money

Keep more stuff in memory Increases speed

Increase speed of data transfer from disk to memory [read compressed data | decompress] is faster than

[read uncompressed data] Premise: Decompression algorithms are fast

True of the decompression algorithms we use

Ch. 5

91

Page 92: IR2

Advanced Topics in Information Systems: Information Retrieval

Why compression for inverted indexes? Dictionary

Make it small enough to keep in main memory Make it so small that you can keep some postings lists

in main memory too

Postings file(s) Reduce disk space needed Decrease time needed to read postings lists from disk Large search engines keep a significant part of the

postings in memory. Compression lets you keep more in memory

We will devise various IR-specific compression schemes

Ch. 5

92

Page 93: IR2

Advanced Topics in Information Systems: Information Retrieval

Recall Reuters RCV1 symbolstatistic value N documents 800,000 L avg. # tokens per doc 200 M terms (= word types) ~400,000 avg. # bytes per token 6

(incl. spaces/punct.)

avg. # bytes per token 4.5 (without spaces/punct.)

avg. # bytes per term 7.5 non-positional postings 100,000,000

Sec. 5.1

93

Page 94: IR2

Advanced Topics in Information Systems: Information Retrieval

Index parameters vs. what we index (details IIR Table 5.1, p.80)

size of word types (terms)

non-positionalpostings

positional postings

dictionary

non-positional index

positional index

Size (K) ∆% cumul %

Size (K) ∆ % cumul %

Size (K) ∆ % cumul %

Unfiltered 484 109,971 197,879

No numbers 474 -2 -2 100,680 -8 -8 179,158 -9 -9

Case folding 392 -17 -19 96,969 -3 -12 179,158 0 -9

30 stopwords 391 -0 -19 83,390 -14 -24 121,858 -31 -38

150 stopwords 391 -0 -19 67,002 -30 -39 94,517 -47 -52

stemming 322 -17 -33 63,812 -4 -42 94,517 0 -52Exercise: give intuitions for all the ‘0’ entries. Why do some zero entries correspond to big deltas in other columns?

Sec. 5.1

94

Page 95: IR2

Advanced Topics in Information Systems: Information Retrieval

Lossless vs. lossy compression Lossless compression: All information is

preserved. What we mostly do in IR.

Lossy compression: Discard some information Several of the preprocessing steps can be

viewed as lossy compression: case folding, stop words, stemming, number elimination.

Chap/Lecture 7: Prune postings entries that are unlikely to turn up in the top k list for any query. Almost no loss quality for top k list.

Sec. 5.1

95

Page 96: IR2

Advanced Topics in Information Systems: Information Retrieval

Vocabulary vs. collection size How big is the term vocabulary?

That is, how many distinct words are there?

Can we assume an upper bound? Not really: At least 7020 = 1037 different words of

length 20

In practice, the vocabulary will keep growing with the collection size Especially with Unicode

Sec. 5.1

96

Page 97: IR2

Advanced Topics in Information Systems: Information Retrieval

Vocabulary vs. collection size Heaps’ law: M = kTb

M is the size of the vocabulary, T is the number of tokens in the collection

Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5 In a log-log plot of vocabulary size M vs. T,

Heaps’ law predicts a line with slope about ½ It is the simplest possible relationship between the

two in log-log space An empirical finding (“empirical law”)

Sec. 5.1

97

Page 98: IR2

Heaps’ LawFor RCV1, the dashed line

log10 M = 0.49 log10 T + 1.64 is the best least squares fit.Thus, M = 101.64 T0.49 so k = 101.64 ≈ 44 and b = 0.49.

Good empirical fit for Reuters RCV1 !

For first 1,000,020 tokens,law predicts 38,323 terms;actually, 38,365 terms

Fig 5.1 p81

Sec. 5.1

98

Page 99: IR2

Advanced Topics in Information Systems: Information Retrieval

Exercises What is the effect of including spelling errors, vs.

automatically correcting spelling errors on Heaps’ law?

Compute the vocabulary size M for this scenario: Looking at a collection of web pages, you find that there

are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens.

Assume a search engine indexes a total of 20,000,000,000 (2 × 1010 ) pages, containing 200 tokens on average

What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law?

Sec. 5.1

99

Page 100: IR2

Advanced Topics in Information Systems: Information Retrieval

Zipf’s law

Heaps’ law gives the vocabulary size in collections.

We also study the relative frequencies of terms. In natural language, there are a few very

frequent terms and very many very rare terms. Zipf’s law: The ith most frequent term has

frequency proportional to 1/i . cfi ∝ 1/i = K/i where K is a normalizing constant

cfi is collection frequency: the number of occurrences of the term ti in the collection.

Sec. 5.1

100

Page 101: IR2

Advanced Topics in Information Systems: Information Retrieval

Zipf consequences

If the most frequent term (the) occurs cf1 times then the second most frequent term (of) occurs cf1/2

times the third most frequent term (and) occurs cf1/3 times

Equivalent: cfi = K/i where K is a normalizing factor, so log cfi = log K - log i

Linear relationship between log cfi and log i

Another power law relationship

Sec. 5.1

101

Page 102: IR2

Introduction to Information Retrieval

Zipf’s law for Reuters RCV1

Sec. 5.1

102

Page 103: IR2

Advanced Topics in Information Systems: Information Retrieval

Compression

Now, we will consider compressing the space for the dictionary and postings Basic Boolean index only No study of positional indexes, etc. We will consider compression schemes

Ch. 5

103

Page 104: IR2

DICTIONARY COMPRESSION

Sec. 5.2

104

Page 105: IR2

Advanced Topics in Information Systems: Information Retrieval

Why compress the dictionary? Search begins with the dictionary We want to keep it in memory Memory footprint competition with other

applications Embedded/mobile devices may have very little

memory Even if the dictionary isn’t in memory, we want

it to be small for a fast search startup time So, compressing the dictionary is important

Sec. 5.2

105

Page 106: IR2

Advanced Topics in Information Systems: Information Retrieval

Dictionary storage - first cut Array of fixed-width entries

~400,000 terms; 28 bytes/term = 11.2 MB.

Terms Freq. Postings ptr.

a 656,265

aachen 65

…. ….

zulu 221

Dictionary searchstructure

20 bytes 4 bytes each

Sec. 5.2

106

Page 107: IR2

Advanced Topics in Information Systems: Information Retrieval

Fixed-width terms are wasteful Most of the bytes in the Term column are

wasted – we allot 20 bytes for 1 letter terms. And we still can’t handle supercalifragilisticexpialidocious or

hydrochlorofluorocarbons.

Written English averages ~4.5 characters/word. Exercise: Why is/isn’t this the number to use for

estimating the dictionary size?

Ave. dictionary word in English: ~8 characters How do we use ~8 characters per dictionary term?

Short words dominate token counts but not type average.

Sec. 5.2

107

Page 108: IR2

Introduction to Information Retrieval

Compressing the term list: Dictionary-as-a-String

….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

Freq. Postings ptr. Term ptr.

33

29

44

126

Total string length =400K x 8B = 3.2MB

Pointers resolve 3.2Mpositions: log23.2M =

22bits = 3bytes

Store dictionary as a (long) string of characters:Pointer to next word shows end of current wordHope to save up to 60% of dictionary space.

Sec. 5.2

108

Page 109: IR2

Advanced Topics in Information Systems: Information Retrieval

Space for dictionary as a string 4 bytes per term for Freq. 4 bytes per term for pointer to Postings. 3 bytes per term pointer Avg. 8 bytes per term in term string 400K terms x 19 ⇒ 7.6 MB (against 11.2MB for

fixed width)

Now avg. 11 bytes/term, not 20.

Sec. 5.2

109

Page 110: IR2

Advanced Topics in Information Systems: Information Retrieval

Blocking Store pointers to every kth term string.

Example below: k=4.

Need to store term lengths (1 extra byte)

….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….

Freq. Postings ptr. Term ptr.

33

29

44

126

7

Save 9 bytes on 3 pointers.

Lose 4 bytes onterm lengths.

Sec. 5.2

110

Page 111: IR2

Advanced Topics in Information Systems: Information Retrieval

Net Example for block size k = 4 Where we used 3 bytes/pointer without

blocking 3 x 4 = 12 bytes,

now we use 3 + 4 = 7 bytes.

Shaved another ~0.5MB. This reduces the size of the dictionary from 7.6 MB to 7.1 MB.We can save more with larger k.

Why not go with larger k?

Sec. 5.2

111

Page 112: IR2

Advanced Topics in Information Systems: Information Retrieval

Exercise Estimate the space usage (and savings

compared to 7.6 MB) with blocking, for block sizes of k = 4, 8 and 16.

Sec. 5.2

112

Page 113: IR2

Advanced Topics in Information Systems: Information Retrieval

Dictionary search without blocking

Assuming each dictionary term equally likely in query (not really so in practice!), average number of comparisons = (1+2∙2+4∙3+4)/8 ~2.6

Sec. 5.2

Exercise: what if the frequencies of query terms were non-uniform but known, how would you structure the dictionary search tree?

113

Page 114: IR2

Introduction to Information Retrieval

Dictionary search with blocking

Binary search down to 4-term block; Then linear search through terms in block.

Blocks of 4 (binary tree), avg. = (1+2∙2+2∙3+2∙4+5)/8 = 3 compares

Sec. 5.2

114

Page 115: IR2

Advanced Topics in Information Systems: Information Retrieval

Exercise Estimate the impact on search performance

(and slowdown compared to k=1) with blocking, for block sizes of k = 4, 8 and 16.

Sec. 5.2

115

Page 116: IR2

Advanced Topics in Information Systems: Information Retrieval

Front coding Front-coding:

Sorted words commonly have long common prefix – store differences only

(for last k-1 in a block of k)

8automata8automate9automatic10automation

→8automat*a1◊e2◊ic3◊ion

Encodes automat Extra lengthbeyond automat.

Begins to resemble general string compression.

Sec. 5.2

116

Page 117: IR2

Advanced Topics in Information Systems: Information Retrieval

RCV1 dictionary compression summary

Technique Size in MB

Fixed width 11.2

Dictionary-as-String with pointers to every term

7.6

Also, blocking k = 4 7.1

Also, Blocking + front coding 5.9

Sec. 5.2

117

Page 118: IR2

POSTINGS COMPRESSION

Sec. 5.3

118

Page 119: IR2

Advanced Topics in Information Systems: Information Retrieval

Postings compression The postings file is much larger than the

dictionary, factor of at least 10. Key desideratum: store each posting

compactly. A posting for our purposes is a docID. For Reuters (800,000 documents), we would

use 32 bits per docID when using 4-byte integers.

Alternatively, we can use log2 800,000 ≈ 20 bits per docID.

Our goal: use a lot less than 20 bits per docID.

Sec. 5.3

119

Page 120: IR2

Advanced Topics in Information Systems: Information Retrieval

Postings: two conflicting forces A term like arachnocentric occurs in maybe

one doc out of a million – we would like to store this posting using log2 1M ~ 20 bits.

A term like the occurs in virtually every doc, so 20 bits/posting is too expensive. Prefer 0/1 bitmap vector in this case

Sec. 5.3

120

Page 121: IR2

Advanced Topics in Information Systems: Information Retrieval

Postings file entry We store the list of docs containing a term in

increasing order of docID. computer: 33,47,154,159,202 …

Consequence: it suffices to store gaps. 33,14,107,5,43 …

Hope: most gaps can be encoded/stored with far fewer than 20 bits.

Sec. 5.3

121

Page 122: IR2

Advanced Topics in Information Systems: Information Retrieval

Three postings entries

Sec. 5.3

122

Page 123: IR2

Advanced Topics in Information Systems: Information Retrieval

Variable length encoding Aim:

For arachnocentric, we will use ~20 bits/gap entry. For the, we will use ~1 bit/gap entry.

If the average gap for a term is G, we want to use ~log2G bits/gap entry.

Key challenge: encode every integer (gap) with about as few bits as needed for that integer.

This requires a variable length encoding Variable length codes achieve this by using

short codes for small numbers

Sec. 5.3

123

Page 124: IR2

Advanced Topics in Information Systems: Information Retrieval

Variable Byte (VB) codes For a gap value G, we want to use close to the

fewest bytes needed to hold log2 G bits

Begin with one byte to store G and dedicate 1 bit in it to be a continuation bit c

If G ≤127, binary-encode it in the 7 available bits and set c =1

Else encode G’s lower-order 7 bits and then use additional bytes to encode the higher order bits using the same algorithm

At the end set the continuation bit of the last byte to 1 (c =1) – and for the other bytes c = 0.

Sec. 5.3

124

Page 125: IR2

Advanced Topics in Information Systems: Information Retrieval

Example

docIDs 824 829 215406

gaps 5 214577

VB code 00000110 10111000

10000101 00001101 00001100 10110001

Postings stored as the byte concatenation000001101011100010000101000011010000110010110001

Key property: VB-encoded postings areuniquely prefix-decodable.

For a small gap (5), VBuses a whole byte.

Sec. 5.3

125

Page 126: IR2

Advanced Topics in Information Systems: Information Retrieval

Other variable unit codes Instead of bytes, we can also use a different “unit of

alignment”: 32 bits (words), 16 bits, 4 bits (nibbles). Variable byte alignment wastes space if you have

many small gaps – nibbles do better in such cases. Variable byte codes:

Used by many commercial/research systems Good low-tech blend of variable-length coding and sensitivity

to computer memory alignment matches (vs. bit-level codes, which we look at next).

There is also recent work on word-aligned codes that pack a variable number of gaps into one word

Sec. 5.3

126

Page 127: IR2

Advanced Topics in Information Systems: Information Retrieval

Unary code Represent n as n 1s with a final 0. Unary code for 3 is 1110. Unary code for 40 is11111111111111111111111111111111111111110 . Unary code for 80 is:111111111111111111111111111111111111111111

111111111111111111111111111111111111110

This doesn’t look promising, but….

127

Page 128: IR2

Advanced Topics in Information Systems: Information Retrieval

Gamma codes We can compress better with bit-level codes

The Gamma code is the best known of these.

Represent a gap G as a pair length and offset offset is G in binary, with the leading bit cut off

For example 13 → 1101 → 101

length is the length of offset For 13 (offset 101), this is 3.

We encode length with unary code: 1110. Gamma code of 13 is the concatenation of

length and offset: 1110101

Sec. 5.3

128

Page 129: IR2

Advanced Topics in Information Systems: Information Retrieval

Gamma code examples

number length offset γ-code

0 none

1 0 0

2 10 0 10,0

3 10 1 10,1

4 110 00 110,00

9 1110 001 1110,001

13 1110 101 1110,101

24 11110 1000 11110,1000

511 111111110 11111111 111111110,11111111

1025 11111111110 0000000001 11111111110,0000000001

Sec. 5.3

129

Page 130: IR2

Advanced Topics in Information Systems: Information Retrieval

Gamma code properties G is encoded using 2 log G + 1 bits

Length of offset is log G bits Length of length is log G + 1 bits

All gamma codes have an odd number of bits Almost within a factor of 2 of best possible, log2

G

Gamma code is uniquely prefix-decodable, like VB

Gamma code can be used for any distribution Gamma code is parameter-free

Sec. 5.3

130

Page 131: IR2

Advanced Topics in Information Systems: Information Retrieval

Gamma seldom used in practice Machines have word boundaries – 8, 16, 32, 64

bits Operations that cross word boundaries are slower

Compressing and manipulating at the granularity of bits can be slow

Variable byte encoding is aligned and thus potentially more efficient

Regardless of efficiency, variable byte is conceptually simpler at little additional space cost

Sec. 5.3

131

Page 132: IR2

Advanced Topics in Information Systems: Information Retrieval

RCV1 compression

Data structure Size in MB

dictionary, fixed-width 11.2

dictionary, term pointers into string 7.6

with blocking, k = 4 7.1

with blocking & front coding 5.9

collection (text, xml markup etc) 3,600.0

collection (text) 960.0

Term-doc incidence matrix 40,000.0

postings, uncompressed (32-bit words) 400.0

postings, uncompressed (20 bits) 250.0

postings, variable byte encoded 116.0

postings, γ−encoded 101.0

Sec. 5.3

132

Page 133: IR2

Advanced Topics in Information Systems: Information Retrieval

Index compression summary We can now create an index for highly efficient

Boolean retrieval that is very space efficient Only 4% of the total size of the collection Only 10-15% of the total size of the text in the

collection However, we’ve ignored positional information Hence, space savings are less for indexes used

in practice But techniques substantially the same.

Sec. 5.3

133

Page 134: IR2

SCORING, TERM WEIGHTING AND THE VECTOR SPACE MODEL

Sec. 5.3

134

Page 135: IR2

Advanced Topics in Information Systems: Information Retrieval

Ranked retrieval Thus far, our queries have all been Boolean.

Documents either match or don’t.

Good for expert users with precise understanding of their needs and the collection. Also good for applications: Applications can easily consume

1000s of results.

Not good for the majority of users. Most users incapable of writing Boolean queries (or they are,

but they think it’s too much work). Most users don’t want to wade through 1000s of results.

This is particularly true of web search.

Ch. 6

135

Page 136: IR2

Advanced Topics in Information Systems: Information Retrieval

Problem with Boolean search:feast or famine Boolean queries often result in either too few

(=0) or too many (1000s) results. Query 1: “standard user dlink 650” → 200,000

hits Query 2: “standard user dlink 650 no card

found”: 0 hits It takes a lot of skill to come up with a query

that produces a manageable number of hits. AND gives too few; OR gives too many

Ch. 6

136

Page 137: IR2

Advanced Topics in Information Systems: Information Retrieval

Ranked retrieval models Rather than a set of documents satisfying a query

expression, in ranked retrieval models, the system returns an ordering over the (top) documents in the collection with respect to a query

Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language

In principle, there are two separate choices here, but in practice, ranked retrieval models have normally been associated with free text queries and vice versa

137137

Page 138: IR2

Advanced Topics in Information Systems: Information Retrieval

Feast or famine: not a problem in ranked retrieval When a system produces a ranked result set,

large result sets are not an issue Indeed, the size of the result set is not an issue We just show the top k ( ≈ 10) results We don’t overwhelm the user

Premise: the ranking algorithm works

Ch. 6

138

Page 139: IR2

Advanced Topics in Information Systems: Information Retrieval

Scoring as the basis of ranked retrieval We wish to return in order the documents most

likely to be useful to the searcher How can we rank-order the documents in the

collection with respect to a query? Assign a score – say in [0, 1] – to each

document This score measures how well document and

query “match”.

Ch. 6

139

Page 140: IR2

Advanced Topics in Information Systems: Information Retrieval

Query-document matching scores We need a way of assigning a score to a

query/document pair Let’s start with a one-term query If the query term does not occur in the

document: score should be 0 The more frequent the query term in the

document, the higher the score (should be) We will look at a number of alternatives for

this.

Ch. 6

140

Page 141: IR2

Advanced Topics in Information Systems: Information Retrieval

Take 1: Jaccard coefficient Recall from Lecture 3: A commonly used

measure of overlap of two sets A and B jaccard(A,B) = |A ∩ B| / |A ∪ B| jaccard(A,A) = 1 jaccard(A,B) = 0 if A ∩ B = 0 A and B don’t have to be the same size. Always assigns a number between 0 and 1.

Ch. 6

141

Page 142: IR2

Advanced Topics in Information Systems: Information Retrieval

Jaccard coefficient: Scoring example What is the query-document match score that

the Jaccard coefficient computes for each of the two documents below?

Query: ides of march Document 1: caesar died in march Document 2: the long march

Ch. 6

142

Page 143: IR2

Advanced Topics in Information Systems: Information Retrieval

Issues with Jaccard for scoring It doesn’t consider term frequency (how many

times a term occurs in a document) Rare terms in a collection are more informative

than frequent terms. Jaccard doesn’t consider this information

We need a more sophisticated way of normalizing for length

Later in this lecture, we’ll use . . . instead of |A ∩ B|/|A ∪ B| (Jaccard) for

length normalization.

| B A|/| B A|

Ch. 6

143

Page 144: IR2

Advanced Topics in Information Systems: Information Retrieval

Recall (Lecture 1): Binary term-document incidence matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Each document is represented by a binary vector {0,1}∈ |V|

Sec. 6.2

144

Page 145: IR2

Advanced Topics in Information Systems: Information Retrieval

Term-document count matrices Consider the number of occurrences of a term

in a document: Each document is a count vector in ℕv: a column

below

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0

Brutus 4 157 0 1 0 0

Caesar 232 227 0 2 1 1

Calpurnia 0 10 0 0 0 0

Cleopatra 57 0 0 0 0 0

mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0

Sec. 6.2

145

Page 146: IR2

Advanced Topics in Information Systems: Information Retrieval

Bag of words model Vector representation doesn’t consider the

ordering of words in a document John is quicker than Mary and Mary is quicker

than John have the same vectors This is called the bag of words model. In a sense, this is a step back: The positional

index was able to distinguish these two documents.

We will look at “recovering” positional information later in this course.

For now: bag of words model146

Page 147: IR2

Advanced Topics in Information Systems: Information Retrieval

Term frequency tf

The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.

We want to use tf when computing query-document match scores. But how?

Raw term frequency is not what we want: A document with 10 occurrences of the term is more

relevant than a document with 1 occurrence of the term.

But not 10 times more relevant.

Relevance does not increase proportionally with term frequency.

NB: frequency = count in IR 147

Page 148: IR2

Advanced Topics in Information Systems: Information Retrieval

Log-frequency weighting The log frequency weight of term t in d is

0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. Score for a document-query pair: sum over

terms t in both q and d: score

The score is 0 if none of the query terms is present in the document.

>+

=otherwise 0,

0 tfif, tflog 1 10 t,dt,d

t,dw

∑ ∩∈+=

dqt dt ) tflog (1 ,

Sec. 6.2

148

Page 149: IR2

Advanced Topics in Information Systems: Information Retrieval

Document frequency

Rare terms are more informative than frequent terms Recall stop words

Consider a term in the query that is rare in the collection (e.g., arachnocentric)

A document containing this term is very likely to be relevant to the query arachnocentric

→ We want a high weight for rare terms like arachnocentric.

Sec. 6.2.1

149

Page 150: IR2

Advanced Topics in Information Systems: Information Retrieval

Document frequency, continued Frequent terms are less informative than rare terms

Consider a query term that is frequent in the collection (e.g., high, increase, line)

A document containing such a term is more likely to be relevant than a document that doesn’t

But it’s not a sure indicator of relevance.

→ For frequent terms, we want high positive weights for words like high, increase, and line

But lower weights than for rare terms.

We will use document frequency (df) to capture this.

Sec. 6.2.1

150

Page 151: IR2

Advanced Topics in Information Systems: Information Retrieval

idf weight

dft is the document frequency of t: the number of documents that contain t dft is an inverse measure of the informativeness of t

dft ≤ N

We define the idf (inverse document frequency) of t by

We use log (N/dft) instead of N/dft to “dampen” the effect of idf.

)/df( log idf 10 tt N=

Will turn out the base of the log is immaterial.

Sec. 6.2.1

151

Page 152: IR2

Advanced Topics in Information Systems: Information Retrieval

idf example, suppose N = 1 million

term dft idft

calpurnia 1

animal 100

sunday 1,000

fly 10,000

under 100,000

the 1,000,000

There is one idf value for each term t in a collection.

Sec. 6.2.1

)/df( log idf 10 tt N=

152

Page 153: IR2

Advanced Topics in Information Systems: Information Retrieval

Effect of idf on ranking Does idf have an effect on ranking for one-term

queries, like iPhone

idf has no effect on ranking one term queries idf affects the ranking of documents for queries with

at least two terms For the query capricious person, idf weighting makes

occurrences of capricious count for much more in the final document ranking than occurrences of person.

153

Page 154: IR2

Advanced Topics in Information Systems: Information Retrieval

Collection vs. Document frequency

The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences.

Example:

Which word is a better search term (and should get a higher weight)?

Word Collection frequency Document frequency

insurance 10440 3997

try 10422 8760

Sec. 6.2.1

154

Page 155: IR2

Advanced Topics in Information Systems: Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tf weight and its idf weight.

Best known weighting scheme in information retrieval Note: the “-” in tf-idf is a hyphen, not a minus sign! Alternative names: tf.idf, tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection

)df/(log)tflog1(w 10,, tdt Ndt

×+=

Sec. 6.2.2

155

Page 156: IR2

Advanced Topics in Information Systems: Information Retrieval

Final ranking of documents for a query

Score(q,d) = tf.idft,dt ᅫ qᅫ d

Sec. 6.2.2

156

Page 157: IR2

Advanced Topics in Information Systems: Information Retrieval

Binary → count → weight matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 5.25 3.18 0 0 0 0.35

Brutus 1.21 6.1 0 1 0 0

Caesar 8.59 2.54 0 1.51 0.25 0

Calpurnia 0 1.54 0 0 0 0

Cleopatra 2.85 0 0 0 0 0

mercy 1.51 0 1.9 0.12 5.25 0.88

worser 1.37 0 0.11 4.15 0.25 1.95

Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|

Sec. 6.3

157

Page 158: IR2

Advanced Topics in Information Systems: Information Retrieval

Documents as vectors So we have a |V|-dimensional vector space Terms are axes of the space Documents are points or vectors in this space Very high-dimensional: tens of millions of

dimensions when you apply this to a web search engine

These are very sparse vectors - most entries are zero.

Sec. 6.3

158

Page 159: IR2

Advanced Topics in Information Systems: Information Retrieval

Queries as vectors Key idea 1: Do the same for queries: represent

them as vectors in the space Key idea 2: Rank documents according to their

proximity to the query in this space proximity = similarity of vectors proximity ≈ inverse of distance Recall: We do this because we want to get

away from the you’re-either-in-or-out Boolean model.

Instead: rank more relevant documents higher than less relevant documents

Sec. 6.3

159

Page 160: IR2

Advanced Topics in Information Systems: Information Retrieval

Formalizing vector space proximity First cut: distance between two points

( = distance between the end points of the two vectors)

Euclidean distance? Euclidean distance is a bad idea . . . . . . because Euclidean distance is large for

vectors of different lengths.

Sec. 6.3

160

Page 161: IR2

Advanced Topics in Information Systems: Information Retrieval

Why distance is a bad idea

The Euclidean distance between qand d2 is large even though thedistribution of terms in the query q and the distribution ofterms in the document d2 are

very similar.

Sec. 6.3

161

Page 162: IR2

Advanced Topics in Information Systems: Information Retrieval

Use angle instead of distance Thought experiment: take a document d and

append it to itself. Call this document d′. “Semantically” d and d′ have the same content The Euclidean distance between the two

documents can be quite large The angle between the two documents is 0,

corresponding to maximal similarity.

Key idea: Rank documents according to angle with query.

Sec. 6.3

162

Page 163: IR2

Advanced Topics in Information Systems: Information Retrieval

From angles to cosines The following two notions are equivalent.

Rank documents in decreasing order of the angle between query and document

Rank documents in increasing order of cosine(query,document)

Cosine is a monotonically decreasing function for the interval [0o, 180o]

Sec. 6.3

163

Page 164: IR2

Advanced Topics in Information Systems: Information Retrieval

From angles to cosines

But how – and why – should we be computing cosines?

Sec. 6.3

164

Page 165: IR2

Advanced Topics in Information Systems: Information Retrieval

Length normalization A vector can be (length-) normalized by

dividing each of its components by its length – for this we use the L2 norm:

Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of unit hypersphere)

Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization. Long and short documents now have comparable

weights

∑=i ixx 2

2

Sec. 6.3

165

Page 166: IR2

Advanced Topics in Information Systems: Information Retrieval

cosine(query,document)

∑∑∑

==

==•=•=V

i i

V

i i

V

i ii

dq

dq

d

d

q

q

dq

dqdq

1

2

1

2

1),cos(

Dot product Unit vectors

qi is the tf-idf weight of term i in the querydi is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.

Sec. 6.3

166

Page 167: IR2

Advanced Topics in Information Systems: Information Retrieval

Cosine for length-normalized vectors For length-normalized vectors, cosine similarity

is simply the dot product (or scalar product):

for q, d length-normalized.

167

cos(r q ,

r d ) =

r q ᅫ

r d = qidii=1

V

Page 168: IR2

Advanced Topics in Information Systems: Information Retrieval

Cosine similarity illustrated

168

Page 169: IR2

Advanced Topics in Information Systems: Information Retrieval

Cosine similarity amongst 3 documents

term SaS PaP WH

affection 115 58 20

jealous 10 7 11

gossip 2 0 6

wuthering 0 0 38

How similar arethe novelsSaS: Sense andSensibilityPaP: Pride andPrejudice, andWH: WutheringHeights? Term frequencies (counts)

Sec. 6.3

Note: To simplify this example, we don’t do idf weighting.169

Page 170: IR2

Advanced Topics in Information Systems: Information Retrieval

3 documents example contd.

Log frequency weighting

term SaS PaP WH

affection 3.06 2.76 2.30

jealous 2.00 1.85 2.04

gossip 1.30 0 1.78

wuthering 0 0 2.58

After length normalization

term SaS PaP WH

affection 0.789 0.832 0.524

jealous 0.515 0.555 0.465

gossip 0.335 0 0.405

wuthering 0 0 0.588

cos(SaS,PaP) ≈0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0≈ 0.94cos(SaS,WH) ≈ 0.79cos(PaP,WH) ≈ 0.69

Why do we have cos(SaS,PaP) > cos(SAS,WH)?

Sec. 6.3

170

Page 171: IR2

Advanced Topics in Information Systems: Information Retrieval

Computing cosine scores

Sec. 6.3

171

Page 172: IR2

Advanced Topics in Information Systems: Information Retrieval

tf-idf weighting has many variants

Columns headed ‘n’ are acronyms for weight schemes.

Why is the base of the log in idf immaterial?

Sec. 6.4

172

Page 173: IR2

Advanced Topics in Information Systems: Information Retrieval

Weighting may differ in queries vs documents Many search engines allow for different

weightings for queries vs. documents SMART Notation: denotes the combination in

use in an engine, with the notation ddd.qqq, using the acronyms from the previous table

A very standard weighting scheme is: lnc.ltc Document: logarithmic tf (l as first character),

no idf and cosine normalization Query: logarithmic tf (l in leftmost column), idf

(t in second column), no normalization …A bad idea?

Sec. 6.4

173

Page 174: IR2

Advanced Topics in Information Systems: Information Retrieval

tf-idf example: lnc.ltc

Term Query

Document

Prod

tf-raw

tf-wt df idf wt n’lize tf-raw tf-wt wt n’lize

auto 0 0 5000 2.3 0 0 1 1 1 0.52 0

best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0

car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27

insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53

Document: car insurance auto insuranceQuery: best car insurance

Exercise: what is N, the number of docs?

Score = 0+0+0.27+0.53 = 0.8

Doc length =

12 + 02 +12 +1.32 ᅫ 1.92

Sec. 6.4

174

Page 175: IR2

Advanced Topics in Information Systems: Information Retrieval

Summary – vector space ranking

Represent the query as a weighted tf-idf vector Represent each document as a weighted tf-idf

vector Compute the cosine similarity score for the

query vector and each document vector Rank documents with respect to the query by

score Return the top K (e.g., K = 10) to the user

175