Top Banner
Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification Tu Ngoc Nguyen and Nattiya Kanhabua 1
32
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

Leveraging Dynamic Query Subtopics for

Time-aware Search Result Diversification

Tu Ngoc Nguyen and Nattiya Kanhabua

1

Page 2: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

Motivation

• Query underlying aspects change over time

2

Page 3: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

I. Subtopic Mining

A. From Query Logs

B. From Document Collection

II. Time-aware Diversifying Models

III. Experiment

3

Outline

Page 4: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

4

temporally

ambiguous, multi-

faceted queries

subtopic mining

query log

document

collection

time-aware

diversification

d1

d2

.

.

dn

dynamic

subtopics

t

System Pipeline

1.a

Page 5: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

co-click bipartite

q1

q2

.

.

qn

related queries

querying

time t

clustering

subtopics

querying

time t + 1

5

Subtopic Mining Approach: Query Log

Page 6: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

march madness

began

14/03/2006

ncaa women

tournament began

18/03/2006 01/04/2006

final four began

query: ncaa

6

Subtopic Dynamics: Query Log

Page 7: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

7

temporally

ambiguous, multi-

faceted queries

subtopic mining

query log

document

collection

time-aware

diversification

d1

d2

.

.

dn

dynamic

subtopics

t

System Pipeline

1.b

Page 8: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

– Probabilistic subtopic modeling: Latent Dirichlet Allocation (LDA)

8

Subtopic Mining: Document Collection

query: apple

Page 9: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

9

Subtopic Dynamics: Document Collection

subtopic dynamics : Blogs08 query volume: GoogleTrend

Page 10: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

10

temporally

ambiguous, multi-

faceted queries

subtopic mining

query log

document

collection

time-aware

diversification

d1

d2

.

.

dn

dynamic

subtopics

t

System Pipeline

2.

Page 11: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

- Probabilistic model:

- Pr(c|q): weight of certain subtopic in a query

- Pr(q|d), Pr(d|q): relation between document and query

- Pr(c|d), Pr(d|c): relation between document and subtopic

- IA-Select objective function:

11

IA-Select Model [Vallet and Castells. 2012]

document

relevance novelty

Page 12: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

• MMR [Carbonell and Goldstein. 98]

- diversify based on similarity of document contents

• IA-Select [Agrawal et al. 2009]

- diversify based on the taxonomy of subtopic categories

• xQuaD [Santos et al. 2010]

- general form of IA-Select

- define objective function as a mixture of relevance and diversity

probabilities

• Topic richness [Dou et al. 2011]

- general form of xQuaD and IA-Select models

- accepts topics from multiple sources

12

Search Result Diversification Models

Page 13: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

- Probabilistic model:

- Pr(c|q): weight of certain subtopic in a query

- Pr(q|d), Pr(d|q): relation between document and query

- Pr(c|d), Pr(d|c): relation between document and subtopic

- xQuaD objective function:

13

xQuaD Model [Vallet and Castells. 2012]

noveltydocument-topic

relevance

document-query

relevance

Page 14: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

• Temp-IASelect

- IA-Select objective function:

- Temp-IASelect objective function:

14

Temporal Diversifying Models

Page 15: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

• Temp-xQuaD

- xQuaD objective function:

- Temp-xQuaD objective function:

• Temp-topic richness

- generalization of temp-xQuaD and temp-IASelect15

Temporal Diversifying Models

Page 16: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

• TREC Blogs08 Collection

- crawled from Jan 2008 to Feb 2009

- clean HTML tags using HtmlCleaner and Boilerpipe libraries

- index using Lucene Core

- document’s publication date extracted from:

- Blog content

- URL

- Retrieval date

16

Experiment Settings

Page 17: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

• Retrieval baseline:

- Okapi BM25

• Relevance assessments:

- human assessment

- binary relevance judgment, follows TREC Diversity Track 2009 and 2011

- 2 dimensional assessment: relevance and time

- exclude topics mined from query log (time gap between AOL and

Blogs08)

- top 10 most probability words represents a topic.

• Querying-time points:

- how popular is the query at particular time t

- how different to the previous time slice t-1

17

Experiment Settings

Page 18: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

18

Relevance assessments

Document Publication

Date

Subtopic Hitting time Relevance

Title: It’s The Most Wonderful Time Of The

Year

Content: The greatest sports week of the

year is upon us, and Chris’s Sports Blog is

ready. Check back daily for coverage of the

ACC and NCAA Tournament, tips on how to

fill out your bracket …

2008-03-17 ncaa basketball

tournament

2008-03 1

Title: Is there a bigger joke than the NCAA?

Content: Each year the NCAA discovers a

new way to make a bigger ass of itself than

the previous one. This years specialty is to

bar from NCAA playoffs in every sport any

school who persists in using "unacceptable"

team names and school mascots, by their

exclusive definition…

2005-08-22 ncaa basketball

tournament

2008-03 0

Title: Apple Quince Jam

Content: The apple quince is a fruit that

ripens in the period from October to

November, it is an apple with a strange

shape: it looks like a pear and apple, and it is

lumpy…

2007-11-15 apple jam 2008-03 1

Page 19: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

• α-NDCG

- adding diversity and novelty to nDCG

• Intent-Aware Precision (Precision-IA)

- intent-aware version of Precision

- treat subtopic as distinct interpretation of query

• Intent-Aware Expected Reciprocal Rank (ERR-IA)

- based on cascade model of search

19

Evaluation Metrics

Page 20: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

20

Experimental Results

* baselines with dynamic subtopic mining

Page 21: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

21

Experimental Results

△ (p < 0.05), △△ (p < 0.01)

Page 22: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

Conclusion• studied temporally ambiguous and multi-faceted queries

- subtopic temporal variability

- subtopic mining from two different sources (query logs, document

collection)

• propose time-aware search results diversification frameworks

• Model and predict the subtopic change

• Combine diversifying by subtopics and time in a unified framework

22

Future work

Page 23: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

THANK YOU.

23

Page 24: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

Settings

• Estimate natural number of subtopics

– Suresh et al. 2010 view LDA as matrix factorization mechanism

– Cd×w = M1d×t × M2t×w

• d: number of document in the corpus

• w: size of vocabulary

• t: the number of topics

– optimum t is with minimal divergence value

• CM1 is the distribution of singular values of M1

• CM2 is obtained by normalize vector L · M2, L is 1 x D vector of lengths of

each document in C.

24

Page 25: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

• New documents appear all the time

• Document content change over time

• Queries and query volumes change over time

• Example: [Kulkarni et al. 2011]

25

march madness

ncaa

Motivation

Page 26: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

Query Dynamic Metrics

• Kendall τ Coefficient based:

• Jaccard Coefficient based:

26

Page 27: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

Cluster Subtopic Candidates

• Clustering approach [Song et al. 2011]:

– step 1: Construct a similarity matrix of the related queries

– step 2: Cluster using Affinity Propagation algorithm

– step 3: Extract a set of exemplars as subtopics of the query

• Similarity metrics:

– lexical similarity:

• keywords and cosine similarity

– co-click similarity:

• based on fraction of common clicks

– semantic similarity:

• use WordNet as external KB

27

Page 28: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

• Vector-based:

– Cosine Similarity

• Bag of words-based:

– Jaccard Coefficient

• Ranked list of words-based:

– Kendall τ Coefficient -based:

• Multinomial distribution-based:

– Kullback-Leibler Divergence

– Jensen-Shannon Divergence

Topic Similarity Metrics [Kim and Oh. 2011]

28

Page 29: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

Subtopic Mining Approach

• Dynamic queries: select 57 out of 61 queries from the AOL query log i.e.

yearly recurrent or time-independent.

• Settings

– partition collection into 14 one-month length time slices

– training data in time slice ti is top 2000 documents D with d ∈ D,

pubDate(d) ∈ t

– number of subtopic is preset in the range from 5 to 20

• Subtopic weight:

– weight w(c) is the probability that a given query q implies subtopic c

29

Page 30: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

Temporal Document Collection

• TREC Blogs08 Collection

- crawled from Jan 2008 to Feb 2009

- clean HTML tags using HtmlCleaner and Boilerpipe libraries

- index using Lucene Core

- document’s publication date extracted from:

- Blog content

- URL

- Retrieval date

30

Page 31: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

Subtopic Evaluation

• 61 queries: 51 event-related queries, 10 standard ambiguous queries

– aspect removed e.g. march madness brackets → march madness

• Subtopic evaluation metrics [Radlinski et al. 2010]:

– coherence

– distinctness

– plausibility

– completeness

31

Page 32: Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

Subtopic Evaluation

- Perplexity: a measure of the ability of a model to generalize documents

-

- use holdout validation with 90% data for training and 10% for testing

- randomly select 20 out of 57 queries at a random time slice

32