CS490W: Web Information Search & Management
CS-490WWeb Information Search & Management
Information Retrieval: Retrieval Models
Luo SiDepartment of Computer Science
Purdue University
Retrieval Models
Information Need
Retrieval Model
Representation
Query Indexed Objects
Retrieved Objects
Evaluation/Feedback
Representation
Overview of Retrieval Models
Retrieval ModelsBooleanVector space
Basic vector space SMART, LuceneExtended Boolean
Probabilistic modelsStatistical language models LemurTwo Possion model OkapiBayesian inference networks Inquery
Citation/Link analysis modelsPage rank GoogleHub & authorities Clever
Retrieval Models: Outline
Retrieval ModelsExact-match retrieval method
Unranked Boolean retrieval methodRanked Boolean retrieval method
Best-match retrieval methodVector space retrieval methodLatent semantic indexing
Retrieval Models: Unranked Boolean
Unranked Boolean: Exact match methodSelection Model
Retrieve a document iff it matches the precise queryOften return unranked documents (or with chronological order)
OperatorsLogical Operators: AND OR, NOTApproximately operators: #1(white house) (i.e., within one word distance, phrase) #sen(Iraq weapon) (i.e., within a sentence)String matching operators: Wildcard (e.g., ind* for india and indonesia)Field operators: title(information and retrieval)…
Retrieval Models: Unranked Boolean
Unranked Boolean: Exact match methodA query example(#2(distributed information retrieval) OR (#1 (federated search)) AND author(#1(Jamie Callan) AND NOT (Steve))
Retrieval Models: Unranked Boolean
WestLaw system: Commercial Legal/Health/Finance Information Retrieval SystemLogical operatorsProximity operators: Phrase, word proximity, same sentence/paragraphString matching operator: wildcard (e.g., ind*)Field operator: title(#1(“legal retrieval”)) date(2000)Citations: Cite (Salton)
Retrieval Models: Unranked Boolean
Advantages:Work well if user knows exactly what to retrievePredicable; easy to explainVery efficient
Disadvantages:It is difficult to design the query; high recall and low precision for loose query; low recall and high precision for strict queryResults are unordered; hard to find useful onesUsers may be too optimistic for strict queries. A few very relevant but a lot more are missing
Retrieval Models: Ranked Boolean
Ranked Boolean: Exact matchSimilar as unranked Boolean but documents are ordered by some criterion
Reflect importance of document by its words
Query: (Thailand AND stock AND market)
Retrieve docs from Wall Street Journal Collection
Which word is more important?
Term Frequency (TF): Number of occurrence in query/doc; larger number means more important
Inversed Document Frequency (IDF): Larger means more important
Total number of docs Number of docs contain a term
There are many variants of TF, IDF: e.g., consider document length
Many “stock” and “market”, but fewer “Thailand”. Fewer may be more indicative
Retrieval Models: Ranked Boolean
Ranked Boolean: Calculate doc scoreTerm evidence: Evidence from term i occurred in doc j: (tfij) and (tfij*idfi)AND weight: minimum of argument weightsOR weight: maximum of argument weights
Term evidence
0.2 0.6 0.4
ANDMin=0.2
0.2 0.6 0.4
ORMax=0.6
Query: (Thailand AND stock AND market)
Retrieval Models: Ranked Boolean
Advantages:All advantages from unranked Boolean algorithm
Works well when query is precise; predictive; efficient
Results in a ranked list (not a full list); easier to browse and find the most relevant ones than BooleanRank criterion is flexible: e.g., different variants of term evidence
Disadvantages:Still an exact match (document selection) model: inverse correlation for recall and precision of strict and loose queriesPredictability makes user overestimate retrieval quality
Retrieval Models: Vector Space Model
Vector space modelAny text object can be represented by a term vector
Documents, queries, passages, sentencesA query can be seen as a short document
Similarity is determined by distance in the vector spaceExample: cosine of the angle between two vectors
The SMART systemDeveloped at Cornell University: 1960-1999Still quite popular
The Lucene systemOpen source information retrieval library; (Based on Java)Work with Hadoop (Map/Reduce) in large scale app (e.g., Amazon Book)
Retrieval Models: Vector Space Model
Vector space model vs. Boolean modelBoolean models
Query: a Boolean expression that a document must satisfyRetrieval: Deductive inference
Vector space modelQuery: viewed as a short document in a vector spaceRetrieval: Find similar vectors/objects
Retrieval Models: Vector Space Model
Vector representation
Retrieval Models: Vector Space Model
Vector representationJava
Sun
Starbucks
D2
D3 D1
Query
Retrieval Models: Vector Space Model
Give two vectors of query and documentquery as document ascalculate the similarity
1 2( , ,..., )nq q q q=
1 2( , ,..., )j j j jnd d d d=
Cosine similarity: Angle between vectors
1 ,1 2 ,2 , 1 ,1 2 ,2 ,
2 2 2 21 1
cos( ( , ))... ...
... ...
j
j j j j j n j j j j n
n j jn
q dq d q d q d q d q d q dq d
q d q d q q d d
θ+ + + + + +
= = =+ + + +
i
( , )jq dθ
q
jd( , ) cos( ( , ))j jsim q d q dθ=
Retrieval Models: Vector Space Model
Vector representation
Retrieval Models: Vector Space Model
Vector CoefficientsThe coefficients (vector elements) represent term evidence/ term importance It is derived from several elements
Document term weight: Evidence of the term in the document/query Collection term weight: Importance of term from observation of collectionLength normalization: Reduce document length bias
Naming convention for coefficients:
,. .k j kq d DCL DCL= First triple represents query term; second for document term
Retrieval Models: Vector Space Model
Common vector weight components:lnc.ltc: widely used term weight
“l”: log(tf)+1“n”: no weight/normalization“t”: log(N/df)“c”: cosine normalization
( )( )
( )[ ] ( )2
2
2211
)(log1)(log(1)(log(
)(log1)(log(1)(log(
..
∑∑
∑
⎥⎦
⎤⎢⎣
⎡++
⎥⎦
⎤⎢⎣
⎡++
=++
kj
kq
kjq
j
jnnjj
kdfNktfktf
kdfNktfktf
dq
dqdqdq
Retrieval Models: Vector Space Model
Common vector weight components:dnn.dtb: handle varied document lengths
“d”: 1+ln(1+ln(tf))“t”: log((N/df)“b”: 1/(0.8+0.2*docleng/avg_doclen)
Retrieval Models: Vector Space Model
Standard vector spaceRepresent query/documents in a vector spaceEach dimension corresponds to a term in the vocabularyUse a combination of components to represent the term evidence in both query and documentUse similarity function to estimate the relationship between query/documents (e.g., cosine similarity)
Retrieval Models: Vector Space Model
Advantages:Best match method; it does not need a precise queryGenerated ranked lists; easy to explore the resultsSimplicity: easy to implementEffectiveness: often works wellFlexibility: can utilize different types of term weighting methodsUsed in a wide range of IR tasks: retrieval, classification, summarization, content-based filtering…
Retrieval Models: Vector Space Model
Disadvantages:Hard to choose the dimension of the vector (“basic concept”); terms may not be the best choiceAssume independent relationship among termsHeuristic for choosing vector operations
Choose of term weightsChoose of similarity function
Assume a query and a document can be treated in the same way
Retrieval Models: Vector Space Model
Disadvantages:Hard to choose the dimension of the vector (“basic concept”); terms may not be the best choiceAssume independent relationship among termsHeuristic for choosing vector operations
Choose of term weightsChoose of similarity function
Assume a query and a document can be treated in the same way
Retrieval Models: Vector Space Model
What are good vector representation:Orthogonal: the dimensions are linearly independent (“no overlapping”)No ambiguity (e.g., Java)Wide coverage and good granularityGood interpolations (e.g., representation of semantic meaning)Many possibilities: words, stemmed words, “latent concepts”….
Retrieval Models: Latent Semantic Indexing
Dual space of terms and documents
Retrieval Models: Latent Semantic Indexing
Latent Semantic Indexing (LSI): Explore correlation between terms and documents
Two terms are correlated (may share similar semantic concepts) if they often co-occurTwo documents are correlated (share similar topics) if they have many common words
Latent Semantic Indexing (LSI): Associate each term and document with a small number of semantic concepts/topics
Retrieval Models: Latent Semantic Indexing
Using singular value decomposition (SVD) to find the small set of concepts/topicsm: number of concepts/topics
Representation of concept in document space; VTV=Im
Representation of concept in term space; UTU=Im
Diagonal matrix: concept space
X=USVT
UTU=ImVTV=Im
Retrieval Models: Latent Semantic Indexing
Using singular value decomposition (SVD) to find the small set of concepts/topicsm: number of concepts/topics
Representation of document in concept space
Representation of term in concept space
Diagonal matrix: concept space
X=USVT
UTU=ImVTV=Im
Retrieval Models: Latent Semantic Indexing
Properties of Latent Semantic Indexing
Diagonal elements of S as Sk in descending order, the larger the more important
is the rank-k matrix that best approximates X, where uk and vk are the column vector of U and V
'k k k k
i kx u S v
≤
= ∑
Retrieval Models: Latent Semantic Indexing
Other properties of Latent Semantic Indexing
The columns of U are eigenvectors of XXT
The columns of V are eigenvectors of XTXThe singular values on the diagonal of S, are the positive
square roots of the nonzero eigenvalues of both AAT and ATA
Retrieval Models: Latent Semantic Indexing
X X
Retrieval Models: Latent Semantic Indexing
X X
Retrieval Models: Latent Semantic Indexing
X X
Retrieval Models: Latent Semantic Indexing
X X
Importance of concepts
Size of Sk
Importance of Concept
Reflect Error of Approximating X
with small S
Retrieval Models: Latent Semantic Indexing
SVD representationReduce high dimensional representation of document or query intolow dimensional concept spaceSVD tries to preserve the Euclidean distance of document/term vector
Concept 1
Concept 2
Retrieval Models: Latent Semantic Indexing
C1 C2
SVD representation
Representation of the documents in two dimensional concept space
Retrieval Models: Latent Semantic Indexing
B
C
SVD representation
Representation of the terms in two dimensional concept space
Retrieval Models: Latent Semantic Indexing
B
C
Retrieval Models: Latent Semantic Indexing
Retrieval with respect to a queryMap (fold-in) a query into the representation of the concept space ' ( )
T
k kq q U Inv S=
Use the new representation of the query to calculate the similarity between query and all documents
Cosine Similarity
Retrieval Models: Latent Semantic Indexing
Qry: Machine Learning Protein
Representation of the query in the term vector space:
[0 0 1 1 0 1 0 0 0]T
Retrieval Models: Latent Semantic Indexing
' ( )T
k kq q U Inv S=
Representation of the query in the latent semantic space (2 concepts):
=[-0.3571 0.1635]T
B
C
Query
Retrieval Models: Latent Semantic Indexing
Comparison of Retrieval Results in term space and concept space
Qry: Machine Learning Protein
Retrieval Models: Latent Semantic Indexing
Problems with latent semantic indexingDifficult to decide the number of conceptsThere is no probabilistic interpolation for the resultsThe complexity of the LSI model obtained from SVD is
costly
Language Models: Motivation
Vector space model for information retrievalDocuments and queries are vectors in the term spaceRelevance is measure by the similarity between document vectors and query vector
Problems for vector space modelAd-hoc term weighting schemesAd-hoc similarity measurement
No justification of relationship between relevance and similarity
We need more principled retrieval models…
Introduction to Language Models:
Language model can be created for any language sampleA documentA collection of documentsSentence, paragraph, chapter, query…
The size of language sample affects the quality of language model
Long documents have more accurate modelShort documents have less accurate modelModel for sentence, paragraph or query may not be reliable
Introduction to Language Models:
A document language model defines a probability distribution over indexed terms
E.g., the probability of generating a termSum of the probabilities is 1
A query can be seen as observed data from unknown modelsQuery also defines a language model
How might the models be used for IR?Rank documents by Pr( | )idq
Multinomial/Unigram Language Models
Language model built by multinomial distribution on single terms (i.e., unigram) in the vocabulary
id
Examples:Five words in vocabulary (sport, basketball, ticket, finance, stock)For a document , its language mode is:
{Pi(“sport”), Pi(“basketball”), Pi(“ticket”), Pi(“finance”), Pi(“stock”)}
Formally:The language model is: {Pi(w) for any word w in vocabulary V}
( ) 1 0 ( ) 1i k i kk
P w P w= ≤ ≤∑
Language Model for IR: Example
Estimating language model for each document
sport, basketball, ticket, sport
1dbasketball, ticket, finance, ticket, sport
2dstock, finance, finance, stock
3d
Language Model for 1d
Language Model for 2d
Language Model for 3d
Estimate the generation probability of Pr( | )q idq
sport, basketball
Generate retrieval results
Estimating language model for each document
2dbasketball, ticket, finance, ticket, sport
(psp, pb, pt, pf, pst) =(0.2,0.2,0.4,0.2,0)
Maximum Likelihood Estimation (MLE)
= ?
For query “basketball ticket”
Retrieval Models: Outline
Retrieval ModelsExact-match retrieval method
Unranked Boolean retrieval methodRanked Boolean retrieval method
Best-match retrievalVector space retrieval methodLatent semantic indexingLanguage Modeling Approach