Top Banner
Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell
27

Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Dec 17, 2015

Download

Documents

Lorraine Moore
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Under The Hood [Part I]Web-Based Information Architectures

MSEC 20-760 – Mini II

28-October-2003

Jaime Carbonell

Page 2: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Topics Covered

•The Vector Space Model for IR (VSM)

•Evaluation Metrics for IR

•Query Expansion (the Rocchio Method)

•Inverted Indexing for Efficiency

•A Glimpse into Harder Problems

Page 3: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

The Vector Space Model

• Definitions of document and query vectors, where wj = jth word, and c(wj,di) = count the occurrences of wi in document dj

)],(),...,,(),,([

)],(),...,,(),,([

},...,{

21

21

2

iniii

iniii

ni

qwcqwcqwcq

dwcdwcdwcd

wwwVocabulary

Page 4: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Computing the Similarity

• Dot-product similarity:

• Cosine similarity:

ii dqdqsim

),(

i

ii

dq

dqdqsim

),(cos

Page 5: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Computing Norms and Products

• Dot product:

• Eucledian vector norm (aka “2-norm”):

),(),(,...1,...1

,

nj

ijjnj

jiji dwcqwcdqdq

nj

jdd,...1

2

2

Page 6: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Similarity in Retrieval

• Similarity ranking:

If sim(q,di) > sim(q,dj), di ranks higher

• Retrieving top k documents:

)],([max),,( cos jCd

k dqsimArgkCqSearchj

Page 7: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Refinements to VSM (1)Word normalization• Words in morphological root form

countries => countryinteresting => interest

• Stemming as a fast approximationcountries, country => countrmoped => mop

• Reduces vocabulary (always good)• Generalizes matching (usually good)• More useful for non-English IR

(Arabic has > 100 variants per verb)

Page 8: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Refinements to VSM (2)

Stop-Word Elimination

• Discard articles, auxiliaries, prepositions, ... typically 100-300 most frequent small words

• Reduce document “length” by 30-40%

• Retrieval accuracy improves slightly (5-10%)

Page 9: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Refinements to VSM (3)

Proximity Phrases• E.g.: "air force" => airforce• Found by high-mutual information

p(w1 w2) >> p(w1)p(w2)

p(w1 & w2 in k-window) >>

p(w1 in k-window) p(w2 in same k-window)

• Retrieval accuracy improves slightly (5-10%)• Too many phrases => inefficiency

Page 10: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Refinements to VSM (4)

Words => Terms

• term = word | stemmed word | phrase

• Use exactly the same VSM method on terms (vs words)

Page 11: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Evaluating Information Retrieval (1)

Contingency table:

relevant not-relevant

retrieved a b

not retrieved c d

Recall = a/(a+c) = fraction of relevant retrieved

Precision = a/(a+b) = fraction of retrieved that is relevant

Page 12: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Evaluating Information Retrieval (2)

P = a/(a+b) R = a/(a+c)

Accuracy = (a+d)/(a+b+c+d)

F1 = 2PR/(P+R)

Miss = c/(a+c) = 1 - R

(false negatives)

F/A = b/(a+b+c+d)

(false positives)

Page 13: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Evaluating Information Retrieval (3)

11-point precision curves

• IR system generates total ranking

• Plot precision at 10%, 20%, 30% ... recall,

Page 14: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Query Expansion (1)Observations:• Longer queries often yield better results• User’s vocabulary may differ from document

vocabularyQ: how to avoid heart diseaseD: "Factors in minimizing stroke and cardiac arrest: Recommended dietary and exercise regimens"

• Maybe longer queries have more chances to help recall.

Page 15: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Query Expansion (2)

Bridging the Gap• Human query expansion (user or expert)• Thesaurus-based expansion

Seldom works in practice (unfocused)• Relevance feedback

– Widen a thin bridge over vocabulary gap– Adds words from document space to query

• Pseudo-Relevance feedback• Local Context analysis

Page 16: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Relevance Feedback:Rocchio’s Method

• Idea: update the query via user feedback

• Exact method: (vector sums)

)},{,( userdqfq retnew

irrelevantrelevantoldnew ddqq

Page 17: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Relevance Feedback (2)For example, if:Q = (heart attack medicine)W(heart,Q) = W(attack,Q) = W(medicine,Q) = 1

Drel = (cardiac arrest prevention medicinenitroglycerine heart disease...)

W(nitroglycerine,D) = 2, W(medicine,D) = 1

Dirr = (terrorist attack explosive semtex attack nitroglycerine proximity fuse...)

W(attack,D) = 1, W(nitroglycerine = 2),W(explosive,D) = 1

AND α =1, β =2, γ =.5

Page 18: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Relevance Feedback (3)

Then:

W(attack,Q’) = 1*1 - 0.5*1 = 0.5

W(nitroglycerine, Q’) =

W(medicine, Q’) =

w(explosive, Q’) =

Page 19: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Term Weighting Methods (1)

Salton’s Tf*IDfTf = term frequency in a document

Df = document frequency of term= # documents in collection

with this term

IDf = Df-1

Page 20: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Term Weighting Methods (2)

Salton’s Tf*IDfTfIDf = f1(Tf)*f2(IDf)

E.g. f1(Tf) = Tf*ave(|Dj|)/|D|

E.g. f2(IDf) = log2(IDF)

f1 and f2 can differ for Q and D

Page 21: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Efficient Implementations of VSM (1)

Exploit sparseness

• Only compute non-zero multiplies in dot-products

• Do not even look at zero elements (how?)

• => Use non-stop terms to index documents

Page 22: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Efficient Implementations of VSM (2)

Inverted Indexing• Find all unique [stemmed] terms in document

collection• Remove stopwords from word list• If collection is large (over 100,000 documents),

[Optionally] remove singletonsUsually spelling errors or obscure names

• Alphabetize or use hash table to store list• For each term create data structure like:

Page 23: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Efficient Implementations of VSM (3)[term IDFtermi

,

<doci, freq(term, doci )

docj, freq(term, docj )...>]

or:

[term IDFtermi,

<doci, freq(term, doci), [pos1,i, pos2,i, ...]

docj, freq(term, docj), [pos1,j, pos2,j, ...]...>]

posl,1 indicates the first position of term in documentj and so on.

Page 24: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Open Research Problems in IR (1)

Beyond VSM

• Vectors in different Spaces:

Generalized VSM, Latent Semantic Indexing...

• Probabilistic IR (Language Modeling):

P(D|Q) = P(Q|D)P(D)/P(Q)

Page 25: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Open Research Problems in IR (2)

Beyond Relevance

• Appropriateness of doc to user comprehension level, etc.

• Novelty of information in doc to user anti-redundancy as approx to novelty

Page 26: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Open Research Problems in IR (3)

Beyond one Language

• Translingual IR

• Transmedia IR

Page 27: Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Open Research Problems in IR (4)

Beyond Content Queries

• "What’s new today?"

• "What sort of things to you know about"

• "Build me a Yahoo-style index for X"

• "Track the event in this news-story"