USING THE PAST TO SCORE THE PRESENT: EXTENDING TERM WEIGHTING MODELS WITH REVISION HISTORY ANALYSIS Ablimit Aji, Yu Wang Eugene Agichtein, Evgeniy Gabrilovich 1 Oct. 28, 2010
Feb 01, 2016
1
USING THE PAST TO SCORE THE PRESENT: EXTENDING TERM WEIGHTING MODELS
WITH REVISION HISTORY ANALYSISAblimit Aji, Yu Wang
Eugene Agichtein, Evgeniy Gabrilovich
Oct. 28, 2010
2
Revisions of “Topology” on Wikipedia
1st revision:
250th revision:
Current revision:
3
Observable Document Generation Process
In mathematics, '''topology''' is a branch concerned with the study of topological spaces. Roughly speaking, topology is the study of geometric objects without considering their dimensions.
In mathematics, '''topology''' is a branch concerned with the study of topological spaces. Topology is also concerned with the study of the so called topological properties of figures, that is to say properties that does not change under a bicontinuous one-to-one transformation (call homeomorphisms
95th revision 96th revision
#i#i-1
4
How Revision History Analysis Could Help Retrieval
Revision History Analysis
5
Selected Prior Work
• J. Elsas and S. Dumais. Leveraging temporal dynamics of document content in relevance ranking. In Proc. of WSDM,2010.
• M. Efron. Linear time series models for term weighting in information retrieval. JASIST, 2010.
• J. He, H. Yan, and T. Suel. Compact full-text indexing of versioned document collections. In CIKM, New York, NY, USA, 2009.
6
Revision History Analysis (RHA)
RHA redefines term frequency (TF):- TF is a key indicator of document relevance- TF can be naturally integrated into ranking models
𝑆 (𝑄 ,𝐷 )=∑𝑡 𝜖𝑄
𝐼𝐷𝐹 (𝑡 ) ∙𝑇𝐹 (𝑡 ,𝐷 ) ∙ (𝑘1+1 )
𝑇𝐹 (𝑡 ,𝐷 )+𝑘1(1−𝑏+𝑏∙|𝐷|𝑎𝑣𝑔𝑑𝑙 )
𝑆 (𝑄 ,𝐷 )=𝐷 ¿
BM25
Language Model
7
Model 1: Steady growth
Topology (from the Greek τόπος, “place”, and λόγος, “study”) is a major area of mathematics concerned with spatial properties that are preserved under continuous deformations of objects, for example…..basic examples include compactness and connectedness
Topology, in mathematics, is both a structure used to capture the notions of continuity, connectedness and convergence, and the name of the branch of mathematics which studies these.
First revision
Current version
8
Model 1 (continued)
9
RHA Global Model: definition
Define the term frequency over the whole document generation process– a document grows steadily over time– a term is relatively important if it appears in the early
revisions.
𝑇𝐹 𝑔𝑙𝑜𝑏𝑎𝑙 (𝑡 ,𝑑)=∑𝑗=1
𝑛 𝑐 (𝑡 ,𝑣 𝑗)
𝑗𝛼
Frequency of term in revision
Decay factor
10
But… Some pages are different: “Avatar(2009 film)”
1st revision:
500th revision:
Current revision:
11
Model 2: Bursty Growth
TimeTerm Frequency
Document Length“Pandora” “James Cameron”
Nov. 2009 9 23 2576Dec. 2009 25 50 6306
Month (2009) Jul. Aug. Sep. Oct Nov. Dec.Edit Activity 89 224 67 154 232 1892
First photo & trailer released Movie released
Burst of Document (Length) & Change of Term Frequency
Burst of Edit Activity & Associated Events
Global Model might be insufficient
12
RHA Burst Model: Definition
• A burst resets the decay clock for a term.• The weight will decrease after a burst.
𝑇𝐹 𝑏𝑢𝑟𝑠𝑡 (𝑡 ,𝑑 )=∑𝑗=1
𝑚
∑𝑘=𝑏 𝑗
𝑛 𝑐 (𝑡 ,𝑣𝑘)
(𝑘−𝑏 𝑗+1)𝛽
Frequency of term in revision
Decay factor for jth Burst
13
Burst Detection (1): Content-based
Relative content change potential burst
Content-based Burst for “Avatar”
14
Burst Detection (2): Activity Based
Intensive edit activity potential bursts
Activity-based Burst for “Avatar”
Average revision counts
Deviation
15
Burst Detection (3): Combined Model
16
Putting it All Together: RHA Term Frequency--Combining global model and burst model
𝑇𝐹 h𝑟 𝑎 (𝑡 ,𝐷 )=𝜆1 ∙𝑇𝐹𝑔 (𝑡 ,𝐷 )+𝜆2 ∙𝑇𝐹 𝑏 (𝑡 ,𝐷 )+𝜆3 ∙𝑇𝐹 (𝑡 ,𝐷 )RHA Term Frequency:
ndicate the weights of RHA global model, burst model and original term frequency (probability).
𝜆1+𝜆2+𝜆3=1
17
Integrating RHA into Retrieval Models
𝑆 (𝑄 ,𝐷 )=∑𝑡 𝜖𝑄
𝐼𝐷𝐹 (𝑡 ) ∙𝑇𝐹 (𝑡 ,𝐷 ) ∙ (𝑘1+1 )
𝑇𝐹 (𝑡 ,𝐷 )+𝑘1(1−𝑏+𝑏∙|𝐷|𝑎𝑣𝑔𝑑𝑙 )
BM25
𝑆 (𝑄 ,𝐷 )=𝐷 ¿Statistical Language Models
𝑇𝐹 h𝑟 𝑎 (𝑡 ,𝐷 )
𝑇𝐹 h𝑟 𝑎 (𝑡 ,𝐷 )
𝑃 h𝑟 𝑎 (𝑡 ,𝐷 )
+ RHA
+ RHA
RHA Term Probability:
𝑃 h𝑟 𝑎 (𝑡 ,𝐷 )=𝜆1 ∙𝑃𝑔 (𝑡 ,𝐷 )+𝜆2 ∙𝑃𝑏 (𝑡 ,𝐷 )+𝜆3 ∙𝑃 (𝑡 ,𝐷 )
18
Experimental Setup
19
Datasets
INEX: well established forum for structured retrieval tasks (based on Wikipedia collection)TREC: performance comparison on different set of queries and general applicability
INEX 65 topic
Top 1000 retrieved articles
1000 revisions for each article Corpus for INEX
TREC 68 topic
Top 1000 retrieved articles
1000 revisions for each article Corpus for TREC
WikiDump
20
Results
21
INEX Results
Model bpref MAP R-precision
BM25 0.354 0.354 0.314
BM25+RHA 0.375 (+5.93%) 0.360 (+1.69%) 0.337 (+7.32%)
LM 0.357 0.370 0.348
LM+RHA 0.372 (+4.20%) 0.378 (+2.16%) 0.359 (+3.16%)
Parameters tuned on INEX query Set
BM25: , LM: ,
22
TREC Results
Model bpref MAP NDCGBM25 0.524 0.548 0.634BM25+RHA 0.547** (+4.39%) 0.568 ** (+3.65%) 0.656** (+3.47%)LM 0.527 0.556 0.645LM+RHA 0.532 (+0.95%) 0.567 (+1.98%) 0.653 (+1.24%)
parameters tuned on INEX query Set, ** indicates statistically significant differences @ the 0.01 significance level with two tailed paired t-test
Lab members manually labeled top 20 results for each topic
BM25: , LM: ,
23
Performance AnalysisPerformance Improvements on bpref for BM25+RHA over baseline (BM25)
INEX: significant improvement on 40% queriesTREC: significant improvement on 37% queriesEx: “circus acts skills” , “olive oil health benefit” (+20% BM25 ,+11% LM improvement)
INEX TREC
24
Summary
o RHA captures importance signal from document authoring process.
o Introduced RHA term weighting approacho Natural integration with state of the art
retrieval models.o Consistent improvement over baseline
retrieval models
25
Thank you!
Using the Past to Score the Present: Extending Term Weighting Models with Revision History Analysis
Ablimit Aji, Yu Wang, Eugene Agichtein, Evgeniy Gabrilovich
Research partially supported by:
26
Query Sets and Evaluation Metrics
• Queries and Labels:– INEX: provided– TREC: subset of ad-hoc track
• Metrics: – Bpref (robust to missing judgments)– MAP: mean average precision– R-prec: precision at position R
27
RHA in Statistical Language Models
o (Global Model)
o (Burst Model)
28
Cross validation on INEX
Model bpref MAP R-precisionBM25 0.307 0.281 0.324BM25+RHA 0.312 (+1.63%) 0.291 (+3.56%) 0.320 (-1.23%)LM 0.311 0.284 0.348LM+RHA 0.338 (+8.68%) 0.298 (+4.93%) 0.359 (+0.61%)
5-fold cross validation on INEX 2008 query Set
Model bpref MAP R-precision
BM25 0.354 0.354 0.314
BM25+RHA 0.363 (+2.54%) 0.348 (-1.70%) 0.333 (+6.05%)
LM 0.357 0.370 0.348
LM+RHA 0.366 (+2.52%) 0.375 (+1.35%) 0.352 (+1.15%)
5-fold cross validation on INEX 2009 query Set