Top Banner
USING THE PAST TO SCORE THE PRESENT: EXTENDING TERM WEIGHTING MODELS WITH REVISION HISTORY ANALYSIS Ablimit Aji, Yu Wang Eugene Agichtein, Evgeniy Gabrilovich 1 Oct. 28, 2010
28

Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich

Feb 01, 2016

Download

Documents

halia

Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich. Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis. Oct. 28, 2010. Revisions of “Topology” on Wikipedia. 1 st revision:. 250 th revision:. Current revision:. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

1

USING THE PAST TO SCORE THE PRESENT: EXTENDING TERM WEIGHTING MODELS

WITH REVISION HISTORY ANALYSISAblimit Aji, Yu Wang

Eugene Agichtein, Evgeniy Gabrilovich

Oct. 28, 2010

Page 2: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

2

Revisions of “Topology” on Wikipedia

1st revision:

250th revision:

Current revision:

Page 3: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

3

Observable Document Generation Process

In mathematics, '''topology''' is a branch concerned with the study of topological spaces. Roughly speaking, topology is the study of geometric objects without considering their dimensions.

In mathematics, '''topology''' is a branch concerned with the study of topological spaces. Topology is also concerned with the study of the so called topological properties of figures, that is to say properties that does not change under a bicontinuous one-to-one transformation (call homeomorphisms

95th revision 96th revision

#i#i-1

Page 4: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

4

How Revision History Analysis Could Help Retrieval

Revision History Analysis

Page 5: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

5

Selected Prior Work

• J. Elsas and S. Dumais. Leveraging temporal dynamics of document content in relevance ranking. In Proc. of WSDM,2010.

• M. Efron. Linear time series models for term weighting in information retrieval. JASIST, 2010.

• J. He, H. Yan, and T. Suel. Compact full-text indexing of versioned document collections. In CIKM, New York, NY, USA, 2009.

Page 6: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

6

Revision History Analysis (RHA)

RHA redefines term frequency (TF):- TF is a key indicator of document relevance- TF can be naturally integrated into ranking models

𝑆 (𝑄 ,𝐷 )=∑𝑡 𝜖𝑄

𝐼𝐷𝐹 (𝑡 ) ∙𝑇𝐹 (𝑡 ,𝐷 ) ∙ (𝑘1+1 )

𝑇𝐹 (𝑡 ,𝐷 )+𝑘1(1−𝑏+𝑏∙|𝐷|𝑎𝑣𝑔𝑑𝑙 )

𝑆 (𝑄 ,𝐷 )=𝐷 ¿

BM25

Language Model

Page 7: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

7

Model 1: Steady growth

Topology (from the Greek τόπος, “place”, and λόγος, “study”) is a major area of mathematics concerned with spatial properties that are preserved under continuous deformations of objects, for example…..basic examples include compactness and connectedness

Topology, in mathematics, is both a structure used to capture the notions of continuity, connectedness and convergence, and the name of the branch of mathematics which studies these.

First revision

Current version

Page 8: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

8

Model 1 (continued)

Page 9: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

9

RHA Global Model: definition

Define the term frequency over the whole document generation process– a document grows steadily over time– a term is relatively important if it appears in the early

revisions.

𝑇𝐹 𝑔𝑙𝑜𝑏𝑎𝑙 (𝑡 ,𝑑)=∑𝑗=1

𝑛 𝑐 (𝑡 ,𝑣 𝑗)

𝑗𝛼

Frequency of term in revision

Decay factor

Page 10: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

10

But… Some pages are different: “Avatar(2009 film)”

1st revision:

500th revision:

Current revision:

Page 11: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

11

Model 2: Bursty Growth

TimeTerm Frequency

Document Length“Pandora” “James Cameron”

Nov. 2009 9 23 2576Dec. 2009 25 50 6306

Month (2009) Jul. Aug. Sep. Oct Nov. Dec.Edit Activity 89 224 67 154 232 1892

First photo & trailer released Movie released

Burst of Document (Length) & Change of Term Frequency

Burst of Edit Activity & Associated Events

Global Model might be insufficient

Page 12: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

12

RHA Burst Model: Definition

• A burst resets the decay clock for a term.• The weight will decrease after a burst.

𝑇𝐹 𝑏𝑢𝑟𝑠𝑡 (𝑡 ,𝑑 )=∑𝑗=1

𝑚

∑𝑘=𝑏 𝑗

𝑛 𝑐 (𝑡 ,𝑣𝑘)

(𝑘−𝑏 𝑗+1)𝛽

Frequency of term in revision

Decay factor for jth Burst

Page 13: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

13

Burst Detection (1): Content-based

Relative content change potential burst

Content-based Burst for “Avatar”

Page 14: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

14

Burst Detection (2): Activity Based

Intensive edit activity potential bursts

Activity-based Burst for “Avatar”

Average revision counts

Deviation

Page 15: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

15

Burst Detection (3): Combined Model

Page 16: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

16

Putting it All Together: RHA Term Frequency--Combining global model and burst model

𝑇𝐹 h𝑟 𝑎 (𝑡 ,𝐷 )=𝜆1 ∙𝑇𝐹𝑔 (𝑡 ,𝐷 )+𝜆2 ∙𝑇𝐹 𝑏 (𝑡 ,𝐷 )+𝜆3 ∙𝑇𝐹 (𝑡 ,𝐷 )RHA Term Frequency:

ndicate the weights of RHA global model, burst model and original term frequency (probability).

𝜆1+𝜆2+𝜆3=1

Page 17: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

17

Integrating RHA into Retrieval Models

𝑆 (𝑄 ,𝐷 )=∑𝑡 𝜖𝑄

𝐼𝐷𝐹 (𝑡 ) ∙𝑇𝐹 (𝑡 ,𝐷 ) ∙ (𝑘1+1 )

𝑇𝐹 (𝑡 ,𝐷 )+𝑘1(1−𝑏+𝑏∙|𝐷|𝑎𝑣𝑔𝑑𝑙 )

BM25

𝑆 (𝑄 ,𝐷 )=𝐷 ¿Statistical Language Models

𝑇𝐹 h𝑟 𝑎 (𝑡 ,𝐷 )

𝑇𝐹 h𝑟 𝑎 (𝑡 ,𝐷 )

𝑃 h𝑟 𝑎 (𝑡 ,𝐷 )

+ RHA

+ RHA

RHA Term Probability:

𝑃 h𝑟 𝑎 (𝑡 ,𝐷 )=𝜆1 ∙𝑃𝑔 (𝑡 ,𝐷 )+𝜆2 ∙𝑃𝑏 (𝑡 ,𝐷 )+𝜆3 ∙𝑃 (𝑡 ,𝐷 )

Page 18: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

18

Experimental Setup

Page 19: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

19

Datasets

INEX: well established forum for structured retrieval tasks (based on Wikipedia collection)TREC: performance comparison on different set of queries and general applicability

INEX 65 topic

Top 1000 retrieved articles

1000 revisions for each article Corpus for INEX

TREC 68 topic

Top 1000 retrieved articles

1000 revisions for each article Corpus for TREC

WikiDump

Page 20: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

20

Results

Page 21: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

21

INEX Results

Model bpref MAP R-precision

BM25 0.354 0.354 0.314

BM25+RHA 0.375 (+5.93%) 0.360 (+1.69%) 0.337 (+7.32%)

LM 0.357 0.370 0.348

LM+RHA 0.372 (+4.20%) 0.378 (+2.16%) 0.359 (+3.16%)

Parameters tuned on INEX query Set

BM25: , LM: ,

Page 22: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

22

TREC Results

Model bpref MAP NDCGBM25 0.524 0.548 0.634BM25+RHA 0.547** (+4.39%) 0.568 ** (+3.65%) 0.656** (+3.47%)LM 0.527 0.556 0.645LM+RHA 0.532 (+0.95%) 0.567 (+1.98%) 0.653 (+1.24%)

parameters tuned on INEX query Set, ** indicates statistically significant differences @ the 0.01 significance level with two tailed paired t-test

Lab members manually labeled top 20 results for each topic

BM25: , LM: ,

Page 23: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

23

Performance AnalysisPerformance Improvements on bpref for BM25+RHA over baseline (BM25)

INEX: significant improvement on 40% queriesTREC: significant improvement on 37% queriesEx: “circus acts skills” , “olive oil health benefit” (+20% BM25 ,+11% LM improvement)

INEX TREC

Page 24: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

24

Summary

o RHA captures importance signal from document authoring process.

o Introduced RHA term weighting approacho Natural integration with state of the art

retrieval models.o Consistent improvement over baseline

retrieval models

Page 25: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

25

Thank you!

Using the Past to Score the Present: Extending Term Weighting Models with Revision History Analysis

Ablimit Aji, Yu Wang, Eugene Agichtein, Evgeniy Gabrilovich

Research partially supported by:

Page 26: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

26

Query Sets and Evaluation Metrics

• Queries and Labels:– INEX: provided– TREC: subset of ad-hoc track

• Metrics: – Bpref (robust to missing judgments)– MAP: mean average precision– R-prec: precision at position R

Page 27: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

27

RHA in Statistical Language Models

o (Global Model)

o (Burst Model)

Page 28: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

28

Cross validation on INEX

Model bpref MAP R-precisionBM25 0.307 0.281 0.324BM25+RHA 0.312 (+1.63%) 0.291 (+3.56%) 0.320 (-1.23%)LM 0.311 0.284 0.348LM+RHA 0.338 (+8.68%) 0.298 (+4.93%) 0.359 (+0.61%)

5-fold cross validation on INEX 2008 query Set

Model bpref MAP R-precision

BM25 0.354 0.354 0.314

BM25+RHA 0.363 (+2.54%) 0.348 (-1.70%) 0.333 (+6.05%)

LM 0.357 0.370 0.348

LM+RHA 0.366 (+2.52%) 0.375 (+1.35%) 0.352 (+1.15%)

5-fold cross validation on INEX 2009 query Set