Statistical Models for Information Retrieval and Text Mining. ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Goal of the Course• Overview of techniques for information retrieval (IR)
• Detailed explanation of a few statistical models for IR and text mining– Probabilistic retrieval models (for search)– Probabilistic topic models (for text mining)
• Potential benefit for you:– Some ideas working well for text retrieval may also work for
computer vision– Techniques for computer vision may be applicable to IR– IR and text mining raise new challenges as well as opportunities
– 1945: V. Bush’s article “As we may think”– 1957: H. P. Luhn’s idea of word counting and matching
• Indexing & Evaluation Methodology (1960’s)– Smart system (G. Salton’s group)– Cranfield test collection (C. Cleverdon’s group)– Indexing: automatic can be as good as manual (controlled vocabulary)
• TR Models (1970’s & 1980’s) …
• Large-scale Evaluation & Applications (1990’s-Present)– TREC (D. Harman & E. Voorhees, NIST)– Web search, PubMed, …– Boundary with related areas are disappearing
Short vs. Long Term Info Need• Short-term information need (Ad hoc retrieval)
– “Temporary need”, e.g., info about used cars– Information source is relatively static – User “pulls” information– Application example: library search, Web search
• Long-term information need (Filtering)– “Stable need”, e.g., new data mining algorithms– Information source is dynamic– System “pushes” information to user– Applications: news filter
Ranking is often preferred• Relevance is a matter of degree
• A user can stop browsing anywhere, so the boundary is controlled by the user– High recall users would view more items– High precision users would view only a few
• Theoretical justification: Probability Ranking Principle [Robertson 77]
• Assumptions: Independent relevance and sequential browsing (not necessarily all hold in reality)
“If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately a possible on the basis of whatever data made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.”
Summarize a Ranking: MAP• Given that n docs are retrieved
– Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs
– E.g., if the first rel. doc is at the 2nd rank, then p(1)=1/2. – If a relevant document never gets retrieved, we assume the precision
corresponding to that rel. doc to be zero
• Compute the average over all the relevant documents– Average precision = (p(1)+…p(k))/k
• This gives us (non-interpolated) average precision, which captures both precision and recall and is sensitive to the rank of each relevant document
• Mean Average Precisions (MAP)– MAP = arithmetic mean average precision over a set of topics– gMAP = geometric mean average precision over a set of topics (more
Summarize a Ranking: NDCG• What if relevance judgments are in a scale of [1,r]? r>2
• Cumulative Gain (CG) at rank n– Let the ratings of the n documents be r1, r2, …rn (in ranked order)– CG = r1+r2+…rn
• Discounted Cumulative Gain (DCG) at rank n– DCG = r1 + r2/log22 + r3/log23 + … rn/log2n– We may use any base for the logarithm, e.g., base=b – For rank positions above b, do not discount
• Normalized Cumulative Gain (NDCG) at rank n– Normalize DCG at rank n by the DCG value at rank n of the ideal
ranking– The ideal ranking would first return the documents with the highest
relevance level, then the next highest relevance level, etc– Compute the precision (at rank) where each (new) relevant
document is retrieved => p(1),…,p(k), if we have k rel. docs
• NDCG is now quite popular in evaluating Web search
The Pooling Strategy• When the test collection is very large, it’s impossible to
completely judge all the documents
• TREC’s strategy: pooling – Appropriate for relative comparison of different systems– Given N systems, take top-K from the result of each, combine them
to form a “pool”– Users judge all the documents in the pool; unjudged documents
are assumed to be non-relevant
• Advantage: less human effort
• Potential problem: – bias due to incomplete judgments (okay for relative comparison)– Favor a system contributing to the pool, but when reused, a new
User Studies• Limitations of Cranfield evaluation strategy:
– How do we evaluate a technique for improving the interface of a search engine?
– How do we evaluate the overall utility of a system?
• User studies are needed
• General user study procedure:– Experimental systems are developed– Subjects are recruited as users – Variation can be in the system or the users – Users use the system and user behavior is logged– User information is collected (before: background, after: experience with
the system)
• Clickthrough-based real-time user studies: – Assume clicked documents to be relevant– Mix results from multiple methods and compare their clickthroughs
What is a good indexing term?• Specific (phrases) or general (single word)?
• Luhn found that words with middle frequency are most useful– Not too specific (low utility, but still useful!)– Not too general (lack of discrimination, stop words)– Stop word removal is common, but rare words are kept
• All words or a (controlled) subset? When term weighting is used, it is a matter of weighting not selecting of indexing terms
Tokenization• Word segmentation is needed for some languages
– Is it really needed?
• Normalize lexical units: Words with similar meanings should be mapped to the same indexing term– Stemming: Mapping all inflectional forms of words to the same root form, e.g.
Vector Space Model• Represent a doc/query by a term vector
– Term: basic concept, e.g., word or phrase– Each term defines one dimension– N terms define a high-dimensional space– Element of vector corresponds to term weight
– E.g., d=(x1,…,xN), xi is “importance” of term i
• Measure relevance by the distance between the query vector and document vector in the vector space
TF Weighting• Idea: A term is more important if it occurs more frequently in a
document
• Some formulas: Let f(t,d) be the frequency count of term t in doc d– Raw TF: TF(t,d) = f(t,d)– Log TF: TF(t,d)=log f(t,d)– Maximum frequency normalization: TF(t,d) = 0.5
+0.5*f(t,d)/MaxFreq(d)– “Okapi/BM25 TF”: TF(t,d) = k f(t,d)/(f(t,d)
Rocchio in Practice• Negative (non-relevant) examples are not very important
(why?)
• Often truncate the vector onto to lower dimension (i.e., consider only a small number of words that have high weights in the centroid vector) (efficiency concern)
• Avoid overfitting by keeping a relatively high weight on the original query weights (why?)
• Can be used for relevance feedback and pseudo feedback
• Basic idea: relevance depends on how well a query matches a document– Define features on Q x D, e.g., #matched terms, # the highest IDF of a
matched term, #doclen,..– P(R=1|Q,D)=g(f1(Q,D), f2(Q,D),…,fn(Q,D), )– Using training data (known relevance judgments) to estimate parameter – Apply the model to rank new documents
• Early work (e.g., logistic regression [Cooper 92, Gey 94])– Attempted to compete with other models
• Recent work (e.g. Ranking SVM [Joachims 02], RankNet (Burges et al. 05))– Attempted to leverage other models – More features (notably PageRank, anchor text)– More sophisticated learning (Ranking SVM, RankNet, …)
• Advantages– May combine multiple features (helps improve accuracy and
combat web spams)– May re-use all the past relevance judgments (self-improving)
• Problems– Don’t learn the semantic associations between query words and
document words– No much guidance on feature generation (rely on traditional
retrieval models)
• All current Web search engines use some kind of learning algorithms to combine many features such as PageRank and many different representations of a page
Robertson-Sparck Jones Model(Robertson & Sparck Jones 76)
Two parameters for each term Ai: pi = P(Ai=1|Q,R=1): prob. that term Ai occurs in a relevant doc qi = P(Ai=1|Q,R=0): prob. that term Ai occurs in a non-relevant doc
k
qdi ii
iiRank
iipqqpDQRO
1,1 )1()1(log),|1(log (RSJ model)
How to estimate parameters?Suppose we have relevance judgments,
1).(#5.0).(#ˆ
1).(#5.0).(#ˆ
docnonrelAwithdocnonrelq
docrelAwithdocrelp i
ii
i
“+0.5” and “+1” can be justified by Bayesian estimation
Lecture 1: Key Points • Vector Space Model is a family of models, not a single model
• Many variants of TF-IDF weighting and some are more effective than others
• State of the art retrieval performance is achieved through– Bag of words representation – TF-IDF weighting (BM25) + length normalization– Pseudo relevance feedback (mostly for recall)– For web search, add PageRank, anchor text, …, plus learning to
rank
• Principled approaches didn’t lead to good performance directly (before the “language modeling approach” was proposed); heuristic modification has been necessary