Page 1
Probabilistic Models of Novel Probabilistic Models of Novel Document Rankings for Document Rankings for Faceted Topic RetrievalFaceted Topic RetrievalBen Cartrette and Praveen Chandar
Dept. of Computer and Information ScienceUniversity of Delaware
Newark, DE ( CIKM ’09 )
Date: 2010/05/03Speaker: Lin, Yi-JhenAdvisor: Dr. Koh, Jia-Ling
Page 2
AgendaAgendaIntroduction
- Motivation, GoalFaceted Topic Retrieval
- Task, EvaluationFaceted Topic Retrieval Models
- 4 kinds of modelsExperiment & ResultsConclusion
Page 3
Introduction - Motivation Introduction - Motivation Modeling documents as independently
relevant does not necessarily provide the optimal user experience.
Page 4
Traditional evaluation measure
would reward System1 since it has higher
recall
Introduction - MotivationIntroduction - Motivation
Actually, we prefer System2 (since it has more
information)
System2 is better !
Page 5
Introduction Introduction Novelty and diversity become the
new definition of relevance and evaluation measures .
They can be achieved through retrieving documents that are relevant to query, but cover different facets of the topic.
we call faceted topic retrieval !
Page 6
Introduction - Goal Introduction - Goal The faceted topic retrieval
system must be able to find a small set of documents that covers all of the facets
3 documents that cover 10 facets is preferable to 5 documents that cover 10 facets
Page 7
Faceted Topic Retrieval - Faceted Topic Retrieval - TaskTaskDefine the task in terms ofInformation need :
A faceted topic retrieval information need is one that has a set of answers – facets – that are clearly delineated
How that need is best satisfied :Each answer is fully contained within at least one document
Page 8
Faceted Topic Retrieval - Faceted Topic Retrieval - TaskTask
Information need
invest in next generation technologies
increase use of renewable energy sourcesInvest in renewable energy sources
double ethanol in gas supply
shift to biodiesel
shift to coal
Facets (a set of answers)
Page 9
Faceted Topic Retrieval Faceted Topic Retrieval A Query :A sort list of keywords
A ranked list of documents that contain as many
unique facets as possible.
D1
Dn
D2
Page 10
Faceted Topic Retrieval -Faceted Topic Retrieval -EvaluationEvaluationS-recallS-precisionRedundancy
Page 11
Evaluation – Evaluation – an example for S-recall and S-precisionan example for S-recall and S-precisionTotal : 10 facets (assume all facets
in documents are non-overlapped)
Page 12
Evaluation – Evaluation – an example for Redundancyan example for Redundancy
Page 13
Faceted topic retrieval Faceted topic retrieval modelsmodels4 kinds of models
- MMR (Maximal Marginal Relevance)- Probabilistic Interpretation of MMR- Greedy Result Set Pruning- A Probabilistic Set-Based Approach
Page 14
1. MMR1. MMR
2. Probabilistic 2. Probabilistic Interpretation of MMRInterpretation of MMR
Let c1=0, c3=c4
Page 15
3. Greedy Result Set 3. Greedy Result Set PruningPruningFirst, rank without considering
novelty (in order of relevance)Second, step down the list of
documents, prune documents with similarity greater than some threshold ϴ
I.e., at rank i, remove any document Dj, j > i, with sim(Dj,Di) > ϴ
Page 16
4. A Probabilistic Set-Based 4. A Probabilistic Set-Based ApproachApproach P(F ϵ D) :Probability of D contains Fthe probability that a facet Fj occurs
in at least one document in a set D is
the probability that all of the facets in a set F are captured by the documents D is
Page 17
4. A Probabilistic Set-Based 4. A Probabilistic Set-Based ApproachApproach4.1 Hypothesizing Facets4.2 Estimating Document-Facet
Probabilities4.3 Maximizing Likelihood
Page 18
4.1 Hypothesizing Facets4.1 Hypothesizing FacetsTwo unsupervised probabilistic methods
:Relevance modelingTopic modeling with LDA
Instead of extract facets directly from any particular word or phrase, we build a “ facet model ”P(w|F)
Page 19
4.1 Hypothesizing Facets4.1 Hypothesizing FacetsSince we do not know the facet
terms or the set of documents relevant to the facet, we will estimate them from the retrieved documents
Obtain m models from the top m retrieved documents by taking each document along with its k nearest neighbors as the basis for a facet model
Page 20
Relevance modelingRelevance modelingEstimate m ”facet models“ P(w|Fj) from a set of retrieved documents using the so-called RM2 approach:
DFj : the set of documents relevant to facet Fj fk : facet terms
Page 21
Topic modeling with LDATopic modeling with LDAProbabilistic P(w|Fj) and P(Fj)
can found through expectation maximization
Page 22
4.2 Estimating Document-4.2 Estimating Document-Facet ProbabilitiesFacet ProbabilitiesBoth the facet relevance model and
LDA model produce generation probabilistic P(Di|Fj)
P(Di|Fj) : the probability that sampling terms from the facet model Fj will produce document Di
Page 23
4.3 Maximizing Likelihood4.3 Maximizing LikelihoodDefine the likelihood function
Constrain : K : hypothesized minimum number
required to cover the facetsMaximizing L(y) is a NP-Hard problemApproximate solution :For each facet Fj, take the document
Di with maximum
Page 24
Experiment - DataExperiment - DataA Query :A sort list of keywords
Top 130 retrieved documents
D1
D130
D2Query Likelihood L.M.
Page 25
Experiment - DataExperiment - Data
Top 130 retrieved
documents
D1
D130
D2
2 assessors to judge
44.7 relevant documents per query
Each document contains 4.3 facets
39.2 unique facets on average( for average one unique facet per relevant document )
Agreement :72% of all relevant documents were judged relevant by both assessors
For 60 queries :
Page 26
Experiment - DataExperiment - DataTDT5 sample topic definition
Judgments
Query
Page 27
Experiment – Retrieval Experiment – Retrieval EnginesEnginesUsing Lemur toolkitLM baseline: a query-likelihood language
modelRM baseline: a pseudo-feedback with
relevance modelMMR: query similarity scores from LM
baseline and cosine similarity for noveltyAvgMix (Prob MMR) : the probabilistic MMR
model using query-likelihood scores from LM baseline and the AvgMix novelty score.
Pruning: removing documents from the LM baseline on cosine similarity
FM: the set-based facet model
Page 28
Experiment – Retrieval Experiment – Retrieval EnginesEnginesFM: the set-based facet model FM-RM:
each of the top m documents and their K nearest neighbors becomes a “facet model ”P(w|Fj), then compute the probability P(Di|Fj)
FM-LDA: use LDA to discover subtopics zj, and get P(zj|D) , we extract 50 subtopics
Page 29
Experiments - EvaluationExperiments - EvaluationUse five-fold cross-validation to
train and test systems48 queries in four folds to train
model parameters Parameters are used to obtain
ranked results on the remaining 12 queries
At the minimum optimal rank S-rec, we report S-recall, redundancy, MAP
Page 32
ConclusionConclusionWe defined a type of novelty retrieval
task called faceted topic retrieval retrieve the facets of information need in a small set of documents.
We presented two novel models: One that prunes a retrieval ranking and one a formally-motivated probabilistic models.
Both models are competitive with MMR, and outperform another probabilistic model.