E.G.M. Petrakis Information Retrieval Models 1 Classic IR Models Boolean model simple model based on set theory queries as Boolean expressions adopted by many commercial systems Vector space model queries and documents as vectors in an M-dimensional space M is the number of terms find documents most similar to the query in the M- dimensional space Probabilistic model a probabilistic approach assume an ideal answer set for each query iteratively refine the properties of the ideal answer set
43
Embed
E.G.M. PetrakisInformation Retrieval Models1 Classic IR Models Boolean model simple model based on set theory queries as Boolean expressions adopted.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
E.G.M. Petrakis Information Retrieval Models 1
Classic IR Models
Boolean model simple model based on set theory queries as Boolean expressions adopted by many commercial systems
Vector space model queries and documents as vectors in an M-dimensional
space M is the number of terms find documents most similar to the query in the M-
dimensional space
Probabilistic model a probabilistic approach assume an ideal answer set for each query iteratively refine the properties of the ideal answer set
E.G.M. Petrakis Information Retrieval Models 2
Document Index Terms
Each document is represented by a set of representative index terms or keywords requires text pre-processing (off-line) these terms summarize document contents adjectives, adverbs, connectives are less useful the index terms are mainly nouns (lexicon look-
up) Not all terms are equally useful
very frequent terms are not useful very infrequent terms are not useful neither terms have varying relevance (weights) when
used to describe documents
E.G.M. Petrakis Information Retrieval Models 3
Text Preprocessing
Extract terms from documents and queriesdocument - query profile
Processing stagesword separation sentence splittingchange terms to a standard form (e.g.,
lowercase)eliminate stop-words (e.g. and, is, the, …)reduce terms to their base form (e.g., eliminate
prefixes, suffixes)construct term indices (usually inverted files)
E.G.M. Petrakis Information Retrieval Models 4
Text Preprocessing Chart
from Baeza – Yates & Ribeiro – Neto, 1999
E.G.M. Petrakis Information Retrieval Models 5
Inverted Index
άγαλμααγάπη…δουλειά…πρωί…ωκεανός
index posting list
(1,2)(3,4)
(4,3)(7,5)
(10,3)
123456789
1011
………
documents
E.G.M. Petrakis Information Retrieval Models 6
Basic NotationDocument: usually text
D: document collection (corpus)d: an instance of D
Query: same representation with documentsQ: set of all possible queriesq: an instance of Q
Relevance: R(d,q)binary relation R: D x Q {0,1}d is “relevant” to q iff R(d,q) = 1 or degree of relevance: R(d,q) [0,1] or probability of relevance R(d,q) = Prob(R|d,q)
E.G.M. Petrakis Information Retrieval Models 7
Term Weights
T = {t1, t2, ….tM } the terms in corpus N number of documents in corpusdj a document
dj is represented by (w1j,w2j,…wMj) where
wij > 0 if ti appears in dj
wij = 0 otherwise
q is represented by (q1,q2,…qM)R(d,q) > 0 if q and d have common terms
E.G.M. Petrakis Information Retrieval Models 8
Term Weighting
t2
wMNwM1tM
w1Nw12w11t1
dN….d2d1 docsterms
w2i
E.G.M. Petrakis Information Retrieval Models 9
Document Space (corpus)
q
D
query
relevant document
non-relevant document
E.G.M. Petrakis Information Retrieval Models 10
Boolean Model
Based on set theory and Boolean algebraBoolean queries: “John” and “Mary” not “Ann”terms linked by “and”, “or”, “not”terms weights are 0 or 1 (wij=0 or 1)query terms are present or absent in a documenta document is relevant if the query condition is
satisfied
Pros: simple, in many commercial systemsCons: no ranking, not easy for complex
queries
E.G.M. Petrakis Information Retrieval Models 11
Query Processing
For each term ti in query q={t1,t2,…tM}1) use the index to retrieve all dj with wij > 02) sort them by decreasing order (e.g., by term
frequency) Return documents satisfying the query
condition Slow for many terms: involves set
intersections Keep only the top K documents for each
term at step 2 or Do not process all query terms
E.G.M. Petrakis Information Retrieval Models 12
Vector Space Model
Documents and queries are M – dimensional term vectorsnon-binary weights to index termsa query is similar to a document if their
vectors are similarretrieved documents are sorted by
decreasing order a document may match a query only