Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

CS490W: Web Information Search & Management

CS-490WWeb Information Search & Management

Information Retrieval: Retrieval Models

Luo SiDepartment of Computer Science

Purdue University

Retrieval Models

Information Need

Retrieval Model

Representation

Query Indexed Objects

Retrieved Objects

Evaluation/Feedback

Representation

Overview of Retrieval Models

Retrieval ModelsBooleanVector space

Basic vector space SMART, LuceneExtended Boolean

Probabilistic modelsStatistical language models LemurTwo Possion model OkapiBayesian inference networks Inquery

Citation/Link analysis modelsPage rank GoogleHub & authorities Clever

Retrieval Models: Outline

Retrieval ModelsExact-match retrieval method

Unranked Boolean retrieval methodRanked Boolean retrieval method

Best-match retrieval methodVector space retrieval methodLatent semantic indexing

Retrieval Models: Unranked Boolean

Unranked Boolean: Exact match methodSelection Model

Retrieve a document iff it matches the precise queryOften return unranked documents (or with chronological order)

OperatorsLogical Operators: AND OR, NOTApproximately operators: #1(white house) (i.e., within one word distance, phrase) #sen(Iraq weapon) (i.e., within a sentence)String matching operators: Wildcard (e.g., ind* for india and indonesia)Field operators: title(information and retrieval)…


Unranked Boolean: Exact match methodA query example(#2(distributed information retrieval) OR (#1 (federated search)) AND author(#1(Jamie Callan) AND NOT (Steve))


WestLaw system: Commercial Legal/Health/Finance Information Retrieval SystemLogical operatorsProximity operators: Phrase, word proximity, same sentence/paragraphString matching operator: wildcard (e.g., ind*)Field operator: title(#1(“legal retrieval”)) date(2000)Citations: Cite (Salton)


Advantages:Work well if user knows exactly what to retrievePredicable; easy to explainVery efficient

Disadvantages:It is difficult to design the query; high recall and low precision for loose query; low recall and high precision for strict queryResults are unordered; hard to find useful onesUsers may be too optimistic for strict queries. A few very relevant but a lot more are missing

Retrieval Models: Ranked Boolean

Ranked Boolean: Exact matchSimilar as unranked Boolean but documents are ordered by some criterion

Reflect importance of document by its words

Query: (Thailand AND stock AND market)

Retrieve docs from Wall Street Journal Collection

Which word is more important?

Term Frequency (TF): Number of occurrence in query/doc; larger number means more important

Inversed Document Frequency (IDF): Larger means more important

Total number of docs Number of docs contain a term

There are many variants of TF, IDF: e.g., consider document length

Many “stock” and “market”, but fewer “Thailand”. Fewer may be more indicative


Ranked Boolean: Calculate doc scoreTerm evidence: Evidence from term i occurred in doc j: (tfij) and (tfij*idfi)AND weight: minimum of argument weightsOR weight: maximum of argument weights

Term evidence

0.2 0.6 0.4

ANDMin=0.2

0.2 0.6 0.4

ORMax=0.6

Query: (Thailand AND stock AND market)


Advantages:All advantages from unranked Boolean algorithm

Works well when query is precise; predictive; efficient

Results in a ranked list (not a full list); easier to browse and find the most relevant ones than BooleanRank criterion is flexible: e.g., different variants of term evidence

Disadvantages:Still an exact match (document selection) model: inverse correlation for recall and precision of strict and loose queriesPredictability makes user overestimate retrieval quality

Retrieval Models: Vector Space Model

Vector space modelAny text object can be represented by a term vector

Documents, queries, passages, sentencesA query can be seen as a short document

Similarity is determined by distance in the vector spaceExample: cosine of the angle between two vectors

The SMART systemDeveloped at Cornell University: 1960-1999Still quite popular

The Lucene systemOpen source information retrieval library; (Based on Java)Work with Hadoop (Map/Reduce) in large scale app (e.g., Amazon Book)


Vector space model vs. Boolean modelBoolean models

Query: a Boolean expression that a document must satisfyRetrieval: Deductive inference

Vector space modelQuery: viewed as a short document in a vector spaceRetrieval: Find similar vectors/objects


Vector representation


Vector representationJava

Sun

Starbucks

D2

D3 D1

Query


Give two vectors of query and documentquery as document ascalculate the similarity

1 2( , ,..., )nq q q q=

1 2( , ,..., )j j j jnd d d d=

Cosine similarity: Angle between vectors

1 ,1 2 ,2 , 1 ,1 2 ,2 ,

2 2 2 21 1

cos( ( , ))... ...

... ...

j

j j j j j n j j j j n

n j jn

q dq d q d q d q d q d q dq d

q d q d q q d d

θ+ + + + + +

= = =+ + + +

i

( , )jq dθ

q

jd( , ) cos( ( , ))j jsim q d q dθ=


Vector representation


Vector CoefficientsThe coefficients (vector elements) represent term evidence/ term importance It is derived from several elements

Document term weight: Evidence of the term in the document/query Collection term weight: Importance of term from observation of collectionLength normalization: Reduce document length bias

Naming convention for coefficients:

,. .k j kq d DCL DCL= First triple represents query term; second for document term


Common vector weight components:lnc.ltc: widely used term weight

“l”: log(tf)+1“n”: no weight/normalization“t”: log(N/df)“c”: cosine normalization

( )( )

( )[ ] ( )2

2

2211

)(log1)(log(1)(log(

)(log1)(log(1)(log(

..

∑∑

∑

⎥⎦

⎤⎢⎣

⎡++

⎥⎦

⎤⎢⎣

⎡++

=++

kj

kq

kjq

j

jnnjj

kdfNktfktf

kdfNktfktf

dq

dqdqdq


Common vector weight components:dnn.dtb: handle varied document lengths

“d”: 1+ln(1+ln(tf))“t”: log((N/df)“b”: 1/(0.8+0.2*docleng/avg_doclen)


Standard vector spaceRepresent query/documents in a vector spaceEach dimension corresponds to a term in the vocabularyUse a combination of components to represent the term evidence in both query and documentUse similarity function to estimate the relationship between query/documents (e.g., cosine similarity)


Advantages:Best match method; it does not need a precise queryGenerated ranked lists; easy to explore the resultsSimplicity: easy to implementEffectiveness: often works wellFlexibility: can utilize different types of term weighting methodsUsed in a wide range of IR tasks: retrieval, classification, summarization, content-based filtering…


Disadvantages:Hard to choose the dimension of the vector (“basic concept”); terms may not be the best choiceAssume independent relationship among termsHeuristic for choosing vector operations

Choose of term weightsChoose of similarity function

Assume a query and a document can be treated in the same way


Disadvantages:Hard to choose the dimension of the vector (“basic concept”); terms may not be the best choiceAssume independent relationship among termsHeuristic for choosing vector operations

Choose of term weightsChoose of similarity function

Assume a query and a document can be treated in the same way


What are good vector representation:Orthogonal: the dimensions are linearly independent (“no overlapping”)No ambiguity (e.g., Java)Wide coverage and good granularityGood interpolations (e.g., representation of semantic meaning)Many possibilities: words, stemmed words, “latent concepts”….

Retrieval Models: Latent Semantic Indexing

Dual space of terms and documents


Latent Semantic Indexing (LSI): Explore correlation between terms and documents

Two terms are correlated (may share similar semantic concepts) if they often co-occurTwo documents are correlated (share similar topics) if they have many common words

Latent Semantic Indexing (LSI): Associate each term and document with a small number of semantic concepts/topics


Using singular value decomposition (SVD) to find the small set of concepts/topicsm: number of concepts/topics

Representation of concept in document space; VTV=Im

Representation of concept in term space; UTU=Im

Diagonal matrix: concept space

X=USVT

UTU=ImVTV=Im


Using singular value decomposition (SVD) to find the small set of concepts/topicsm: number of concepts/topics

Representation of document in concept space

Representation of term in concept space

Diagonal matrix: concept space

X=USVT

UTU=ImVTV=Im


Properties of Latent Semantic Indexing

Diagonal elements of S as Sk in descending order, the larger the more important

is the rank-k matrix that best approximates X, where uk and vk are the column vector of U and V

'k k k k

i kx u S v

≤

= ∑


Other properties of Latent Semantic Indexing

The columns of U are eigenvectors of XXT

The columns of V are eigenvectors of XTXThe singular values on the diagonal of S, are the positive

square roots of the nonzero eigenvalues of both AAT and ATA


X X


X X


X X


X X

Importance of concepts

Size of Sk

Importance of Concept

Reflect Error of Approximating X

with small S


SVD representationReduce high dimensional representation of document or query intolow dimensional concept spaceSVD tries to preserve the Euclidean distance of document/term vector

Concept 1

Concept 2


C1 C2

SVD representation

Representation of the documents in two dimensional concept space


B

C

SVD representation

Representation of the terms in two dimensional concept space


B

C


Retrieval with respect to a queryMap (fold-in) a query into the representation of the concept space ' ( )

T

k kq q U Inv S=

Use the new representation of the query to calculate the similarity between query and all documents

Cosine Similarity


Qry: Machine Learning Protein

Representation of the query in the term vector space:

[0 0 1 1 0 1 0 0 0]T


' ( )T

k kq q U Inv S=

Representation of the query in the latent semantic space (2 concepts):

=[-0.3571 0.1635]T

B

C

Query


Comparison of Retrieval Results in term space and concept space

Qry: Machine Learning Protein


Problems with latent semantic indexingDifficult to decide the number of conceptsThere is no probabilistic interpolation for the resultsThe complexity of the LSI model obtained from SVD is

costly

Language Models: Motivation

Vector space model for information retrievalDocuments and queries are vectors in the term spaceRelevance is measure by the similarity between document vectors and query vector

Problems for vector space modelAd-hoc term weighting schemesAd-hoc similarity measurement

No justification of relationship between relevance and similarity

We need more principled retrieval models…

Introduction to Language Models:

Language model can be created for any language sampleA documentA collection of documentsSentence, paragraph, chapter, query…

The size of language sample affects the quality of language model

Long documents have more accurate modelShort documents have less accurate modelModel for sentence, paragraph or query may not be reliable

Introduction to Language Models:

A document language model defines a probability distribution over indexed terms

E.g., the probability of generating a termSum of the probabilities is 1

A query can be seen as observed data from unknown modelsQuery also defines a language model

How might the models be used for IR?Rank documents by Pr( | )idq

Multinomial/Unigram Language Models

Language model built by multinomial distribution on single terms (i.e., unigram) in the vocabulary

id

Examples:Five words in vocabulary (sport, basketball, ticket, finance, stock)For a document , its language mode is:

{Pi(“sport”), Pi(“basketball”), Pi(“ticket”), Pi(“finance”), Pi(“stock”)}

Formally:The language model is: {Pi(w) for any word w in vocabulary V}

( ) 1 0 ( ) 1i k i kk

P w P w= ≤ ≤∑

Language Model for IR: Example

Estimating language model for each document

sport, basketball, ticket, sport

1dbasketball, ticket, finance, ticket, sport

2dstock, finance, finance, stock

3d

Language Model for 1d



Estimate the generation probability of Pr( | )q idq

sport, basketball

Generate retrieval results

Estimating language model for each document

2dbasketball, ticket, finance, ticket, sport

(psp, pb, pt, pf, pst) =(0.2,0.2,0.4,0.2,0)

Maximum Likelihood Estimation (MLE)

= ?

For query “basketball ticket”

Retrieval Models: Outline

Retrieval ModelsExact-match retrieval method

Unranked Boolean retrieval methodRanked Boolean retrieval method

Best-match retrievalVector space retrieval methodLatent semantic indexingLanguage Modeling Approach