A Few Examples Go A Long Way Krisztian Balog, Wouter Weerkamp, Maarten de Rijke Constructing Query Models from Elaborate Query Formulations ISLA, University of Amsterdam http://ilps.science.uva.nl 31st annual international ACM SIGIR Conference on Research and Development in Information Retrieval Singapore, Singapore July 20 - 24, 2008 Motivation • Task: create an overview page for a given topic — find documents that discuss the topic in detail + “User’s input to the search engine” cancer risk How realistic is this?! • Enterprise setting • Users are willing to provide their information need a more elaborate form • a few keywords • a sample documents • Sample documents can be obtained from click-through data along the way Research Questions • Can we make use of these sample documents in an effective and theoretically transparent manner? • What is the effect of lifting the conditional dependence between the original query and expansion terms? • Can we improve “aspect recall”? Outline • TREC Enterprise Track 2007 • Query model from sample documents • Comparison with relevance models • Results • Conclusions and further work TREC Enterprise Track 2007 • Document collection: web crawl of CSIRO (~370.000 docs, 4.2 GB) • 50 topics • Topic description is enriched with sample documents (on average 3 examples/topic) • Relevance judgments on a 3-point scale (not relevant, possibly relevant, highly relevant)
6
Embed
OutlineTREC Enterprise Track 2007 How realistic is this?! … · 2008. 7. 31. · nanohouse nanotechnology csiro cameron dr research technology gene control fiona nanohouse nanotechnology
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A Few Examples Go A Long Way
Krisztian Balog, Wouter Weerkamp, Maarten de Rijke
Constructing Query Models from Elaborate Query Formulations
ISLA, University of Amsterdamhttp://ilps.science.uva.nl
31st annual international ACM SIGIR Conference on Research and Development in Information RetrievalSingapore, Singapore July 20 - 24, 2008
Motivation• Task: create an overview page for a given
topic — find documents that discuss the topic in detail
+
“User’s input to the search engine”
cancer risk
How realistic is this?!
• Enterprise setting
• Users are willing to provide their information need a more elaborate form
• a few keywords
• a sample documents
• Sample documents can be obtained from click-through data along the way
Research Questions
• Can we make use of these sample documents in an effective and theoretically transparent manner?
• What is the effect of lifting the conditional dependence between the original query and expansion terms?
• Can we improve “aspect recall”?
Outline
• TREC Enterprise Track 2007
• Query model from sample documents
• Comparison with relevance models
• Results
• Conclusions and further work
TREC Enterprise Track 2007
• Document collection: web crawl of CSIRO (~370.000 docs, 4.2 GB)
• 50 topics
• Topic description is enriched with sample documents (on average 3 examples/topic)
• Relevance judgments on a 3-point scale(not relevant, possibly relevant, highly relevant)
Example Topic<top>
<num>CE-012</num><query>cancer risk</query><narr>Focus on genome damage and therefore cancer risk in humans.</narr><page>CSIRO145-10349105</page><page>CSIRO140-15970492</page><page>CSIRO139-07037024</page><page>CSIRO138-00801380</page>
</top>
<top><num>CE-012</num><query>cancer risk</query><narr>Focus on genome damage and therefore cancer risk in humans.</narr><page>CSIRO145-10349105</page><page>CSIRO140-15970492</page><page>CSIRO139-07037024</page><page>CSIRO138-00801380</page></top>
Example
Outline
• TREC Enterprise Track 2007
• Query model from sample documents
• Comparison with relevance models
• Results
• Conclusions and further work
Retrieval Model
• Standard Language Modeling
• Ranking documents by their likelihood of being relevant given the query Q:
Retrieval Model (2)
documentmodel
querymodel
• Assuming uniform document priors, it provides the same ranking as minimizing the KL-divergence:
Query Modeling• Baseline QM assign probability mass
uniformly across query terms
• Potential issues
• Not all query terms are equally important
• The query model is extremely sparse
• Solution: query expansion
Original query
Expandedquery
A Query Model from Sample Documents
sampling distributionsample
documents
expandedquery
top K
terms
...
Importance of a Sample Document
1. Uniform
• All sample document are equally important
2. Query-biased
• A sample document’s importance is proportional to its relevance to the query
3. Inverse query-biased
• We reward documents that bring in new aspects
A Query Model from Sample Documents
sampling distributionsample
documents
expandedquery
top K
terms
...
Estimating Term Importance
1. Maximum likelihood estimate
2. Smoothed estimate
3. Ranking function by Ponte (2000)
J. Ponte, “Language models for relevance feedback”, in Advances in Information Retrieval, ed. W.B. Croft, 73-96, 2000.
Research Questions
• Can we make use of these sample documents in an effective and theoretically transparent manner?
• What is the effect of lifting the conditional dependence between the original query and expansion terms?
• Can we improve “aspect recall”?
ResultsTerm Importance
Method MAP MRRBaseline (no expansion) 0.3576 0.7134(ML) Maximum Likelihood 0.4449 0.8533(SM) Smoothed 0.4406 0.8771(EXP) Ponte Q.Exp. 0.4016 0.8148
• Improvement can be up to 24% in MAP and 23% in MRR
★ Results reported on relevance level 1 (“possibly relevant”); see Table 3 in the paper for results on both relevance levels. P(D|S) is uniform.
ResultsDocument Importance
• Biasing sampling on the original query hurts MAP, but improves on early precision
• P(t|D) estimated using SM and EXP display similar behavior
<num>CE-036</num><query>termites</query><narr>Resources describing termites or ‘white ants’ as well as food identification through vibrations will all contain useful information. Current CSIRO research in termite pest management looks at deterring termites through non-chemical means using the vibrations of wood (termite food) to manipulate their feeding habits.</narr>
<num>CE-035</num><query>nanohouse</query><narr>CSIRO have developed a model house that shows how new materials, products and processes that are emerging from nanotechnology research and development might be applied to our living environment. ... Resources describing molecular and nanoscale components, industrial physics, biomimetics, nanoparticle films, biosensors and molecular electronics would all be relevant to this topic.</narr>
AP
Baseline 0.0451
RM2, blind fb. 0.1290
RM2, sample docs 0.1457
QM, sample docs 0.3810
nanohouse
nanotechnology
csiro
cameron
dr
research
technology
gene
control
fiona
nanohouse
nanotechnology
csiro
technology
research
conference
australia
molecules
chemistry
information
nanohouse
physics
csiro
nanoscale
nanotechnology
materials
devices
structures
molecular
building
Wrap up
• Method for sampling expansion terms in a query-independent way
• Various expansions based on term and document importance weighting
• Outperforms a high performing baseline as well as query-dependent expansion methods
• Helps to address the “aspect recall” problem
Further Work
• Other ways of exploitings sample documents
• Layout, link structure, document structure, etc.
• Combining terms extracted from blind feedback documents with terms from sample documents
Further work (2)
• Use expanded query models for expert finding [Balog and de Rijke, CIKM 2008]