Top Banner
| 1 Gertjan van Noord 2014 Zoekmachines Lecture 2: vocabulary, posting lists
25

| 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Jan 02, 2016

Download

Documents

Mitchell Pitts
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

| 1

Gertjan van Noord 2014

Zoekmachines

Lecture 2: vocabulary, posting lists

Page 2: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Agenda for today

• Questions Chapter 1• Chapter 2: Term vocabulary & posting lists• Chapter 2: Posting lists with positions• Homework/lab assignment

Page 3: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Chapter 2 Overview

Preprocessing of documents• choose the unit of indexing (granularity)• tokenization (removing punctuation, splitting

in words)• stop list?• normalization: case folding, stemming versus

lemmatizing, ...• extensions to postings lists

Page 4: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Tokens, types and terms

token each separate word in the texttype same words belong to one type

(index) term finally included in the indexindex term is an equivalence class

of tokens and/or types

Page 5: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Tokens, types and terms

The Lord of the Rings

• Number of tokens?• 5• Number of types?• 4• Number of terms?• 4? 2? 1?

26-01-12

Page 6: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Equivalence classes

• Casefolding• Diacritics• Stemming/lemmatisation• Decompounding• Synonym lists• Variant spellings

26-01-12

Page 7: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Equivalence classes

• Implicit: mapping rules• Relational: query expansion• Relational: double indexing• Mapping should be done:

– Indexing– Querying

26-01-12

Page 8: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Words and word forms

• Inflection (D: verbuiging/vervoeging)- changing a word to express person, case,

aspect, ...- for determiners, nouns, pronouns, adjectives:

declination (D: verbuiging)- for verbs: conjugation (D: vervoeging)

• Derivation (D: afleiding)- formation of a new word from another word

(e.g. by adding an affix (prefix or suffix) or changing the grammatical category)

Page 9: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Inflection examples

DeterminersE: the D: de, het G: der, des, dem, den, die, das

AdjectivesE: young D: jonge, jonge G: junger, junge, junges,

jungenNouns

E: man, men D: man, mannen G: mann, mannes, Verbs

E write / writes / wrote / writtenD schrijf/ schrijft /schrijven / schreef/ schreven /

geschrevenG schreibe/ schreibst / schreibt / schreiben / schrieben /

geschrieben

Page 10: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Derivation examples

to browse -> a browserred -> to redden, reddish Google -> to google

arm(s) -> to arm, to disarm -> disarmament, disarming

Page 11: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Stemming and lemmatizing

verb forms inform, informs, informed, informingderivations information, informative, informal??stem inform lemma inform, information, informative,

informal

verb forms sing: sings, sang, sung, singingderivations singer, singers, song, songs stem sing, sang, sung, song, lemma sing, singer, song

Page 12: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Discussion

Why is stemming used when lemmatizing is much more precise?

Lemmatizing is a more complex processit needs - a vocabulary (problem: new words)- morphologic analysis (knowledge of inflection rules)- syntactic analysis, parsing (noun or verb?)

Page 13: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Compound splitting

Marketingjargon -> marketing AND jargon

• Increased retrieval• Decreased precision• Must be applied to both query and index!• But what to do with the query marketing

jargon ?• And with spreekwoord appel boom ?

26-01-12

Page 14: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Chapter 2 Overview

Preprocessing of documents• choose the unit of indexing (granularity)• tokenization (removing punctuation, splitting

in words)• stop list?• normalization: case folding, stemming versus

lemmatizing, ...• extensions to postings lists

Page 15: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Efficient merging of postings

For X AND Y, we have to intersect 2 listsMost documents will contain only one of the two

terms

Page 16: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Recall basic intersection algorithm

Page 17: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Skip pointers

Page 18: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Skip pointers

• Makes intersection of 2 lists more efficient• think of millions of list items

• How many skip pointers and where?• Trade-off:

• More pointers, often useful but small skips.

• Less pointers …• Heuristic: distance √n, evenly distributed

Page 19: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Skip pointers: useful?

Yes, certainly in the pastWith very fast CPUs less important

Especially in a rather static indexIf a list keeps changing less effective

Page 20: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Extensions of the simple term indexTo support phrase queries• “information retrieval”• “retrieval of information”

Different approaches• biword indexes• phrase indexes• positional indexes• combinations

Page 21: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Biword and phrase indexes

• Holding terms together in the index• Simple biword index:

• retrieval of, of information • Sophisticated: POS tagger selects nouns

• N x* N retrieval of this information• Phrase index: includes variable lengths of word

sequences • terms of 1 and 2 words both included

Page 22: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Positional index

Add in the postings lists for each doc the list of positions of the termfor phrase queriesfor proximity search

Example[information, 4] : [1:<4,22, 35>, 2:<5,17, 30>, …][retrieval, 2] : [1:<5,20>, 2:<18,31>]

Page 23: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Combination schemes

Often queried combinations: phrase indexnames of persons and organizationesp. combinations of common terms (!)find out from query log

For other phrases a positional indexWilliams e.a.: next word index added

Page 24: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

H.E. Williams, J.Zobel, and D.Bahle (2004) Fast Phrase Querying With Combined Indexes (ACM Dig Library):

Phrase querying with a combination of three approaches

(next word index, phrase index and inverted file)... is more than 60% faster on average than using an

inverted index alone ... requires structures that total only 20% of the size

of the collection.

Page 25: | 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

A nextword index (Williams e.a.)

docfreq,(<doc,freq,[pos, pos,..]>,<doc, freq, [..]

No of matching docs

Doc ID

No of occurrences in doc

position