Top Banner
NTNU Speech Lab Dirichlet Mixtures for Query Es timation in Information Retriev al Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachuset ts, Amherst. Presented by Yi-Ting
19

Dirichlet Mixtures for Query Estimation in Information Retrieval

Jan 01, 2016

Download

Documents

Dirichlet Mixtures for Query Estimation in Information Retrieval. Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts, Amherst. Presented by Yi-Ting. Outline. Introduction Methods and Materials - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Dirichlet Mixtures for Query Estimation in Information Retrieval

NTNU Speech Lab

Dirichlet Mixtures for Query Estimation in Information Retrieval

Mark D. Smucker, David Kulp, James AllanCenter for Intelligent Information Retrieval, Department of Computer Science, University of

Massachusetts, Amherst.Presented by Yi-Ting

Page 2: Dirichlet Mixtures for Query Estimation in Information Retrieval

NTNU Speech Lab

Outline

• Introduction• Methods and Materials• Experiments• Conclusion

Page 3: Dirichlet Mixtures for Query Estimation in Information Retrieval

NTNU Speech Lab

Introduction (1/2)

• Most generative models need to be smoothed to avoid zero probabilities

• Smoothing’s goal goes beyond avoiding zero probabilities and to produce better probability estimates for all words

• With queries being short and documents being relatively long, larger performance gains are likely to be had from improving the query model than from improving the document model.

• This paper focuses on automatic methods to estimate query models without user interaction.

Page 4: Dirichlet Mixtures for Query Estimation in Information Retrieval

NTNU Speech Lab

Introduction (2/2)

• While never called smoothing methods, both automatic query expansion and local feedback are effectively smoothing techniques

• These methods are attempting to better estimate their model of the user’s query

• Siolander et al. developed a sophisticated method, Dirichlet mixtures, for smoothing multinomial models in bioinformatics

• Given a sample, Dirichlet mixtures the sample’s true model using prior knowledge

• This paper will show that relevance model can be seen as a special case of Dirichlet Mixture

Page 5: Dirichlet Mixtures for Query Estimation in Information Retrieval

NTNU Speech Lab

Methods and Materials

• Text Modeling and Retrieval– A multinomial model of text specifies a probability or each

word in the vocabulary V– A standard approach to parameter estimation is maximum

likelihood estimation (MLE)

– The MLE model has zero probabilities for all words not in the sample of text

– This is a problem for document retrieval

( )( | )T

T wP w M

T

Page 6: Dirichlet Mixtures for Query Estimation in Information Retrieval

NTNU Speech Lab

Methods and Materials

• Document Retrieval– Documents are ranked by how similar they are to query.– The cross entropy measures how well the document model en

codes the query model:

– When the query model is the MLE model, cross entropy ranks equivalently to query likelihood:

– The zero probabilities must be eliminated from the document models

– The MLE model of a query is inherently a poor representation of the true model that generated it

( | ) ( | ) log ( | )Q D Q Dw V

H M M P w M P w M

( )( | ) ( | )

Q w

D Dw Q

P Q M P w M

Page 7: Dirichlet Mixtures for Query Estimation in Information Retrieval

NTNU Speech Lab

Methods and Materials

• Dirichlet Prior Smoothing– A natural fit as a prior for the multinomial is the Dirichlet den

sity– The Deirichlet density has the same number of parameters a

s the multinomials for which it is a prior– Each multinomial is weighted by its probability given the obs

erved text and the Dirichlet density– This estimate is the mean posterior estimate:

– Which reduces to:

( | ) ( | ) ( | , )T

M

P w M P w M P M T dM

( )( | )

| | | |w

T

T wP w M

T

Page 8: Dirichlet Mixtures for Query Estimation in Information Retrieval

NTNU Speech Lab

Methods and Materials

• Dirichlet Prior Smoothing– The bioinformatics community’s common name for equation

is pseudo-counts– The parameters of the Dirichlet density can be determined u

sing maximum likelihood estimatiion– The parameters of a Dirichlet density can be represented as

a multinomial probability distribution M and weight m= . Thus, with p(w|M)=

– The machine learning community ters this formulation of Dirichlet prior smoothing the m-estimate

| |

/ | |w

( ) ( | )( | )

| |T

T w mP w MP w M

T m

Page 9: Dirichlet Mixtures for Query Estimation in Information Retrieval

NTNU Speech Lab

Methods and Materials

• Dirichlet Prior Smoothing– The Dirichlet prior smoothing is a form of linear iterpolated s

moothing

– Thus Dirichlet prior smoothing can be seen as the mixing of two multinomial models

– The amount of mixing depends on the length of text relative to Dirichlet prior’s equivalent sample size m

– Common practice in IR is to use the collection model and empirically select m

( | ) (1 ) ( | ) ( | )DP w M P w D P w C

| |1

| |

T

T m

Page 10: Dirichlet Mixtures for Query Estimation in Information Retrieval

NTNU Speech Lab

Methods and Materials

• Dirichlet Mixtures– Rather than use only one Dirichlet density to provide prior inf

ormation, a mixture of Dirichlet densities is used as the prior:

– Where are the individual Dirichlet densities, are known as the mixture coefficients

– has its own parameters

– Dirichlet mixture allows different densities to exert different prior weight with varying equivalent sample sizes

1 1 ........ n nq q

i iq

i

,

1

( )( | ) ( | , )

| | | |

ni w

T ii i

T wP w M P T

T

Page 11: Dirichlet Mixtures for Query Estimation in Information Retrieval

NTNU Speech Lab

Methods and Materials

• Dirichlet Mixtures– Dirichlet mixtures, like the single density Dirichlet prior smoot

hing, determine the degree to which the original sample is retained based on it’s size

– The parameters of a mixture of Dirichlet densities are determined using an expectation maximization (EM)

– Dirichlet mixtures can be rewritten as:

1 1

( | ) ( | ) ( | , )(1 ) ( | , ) ( | )n n

T i i i i i ii i

P w M P w T P T P T m P w M

,

| |1 , ( | ) ( ) / | |

| | | |

| | , ( | ) / | |

ii

i i i i w i

TP w T T w T

T

m P w M

Page 12: Dirichlet Mixtures for Query Estimation in Information Retrieval

NTNU Speech Lab

Methods and Materials

• Dirichlet Mixtures– The spirit of Dirichlet mixtures is to smooth a text sample by

finding a set of models and mixing them with the text sample in proportion to their similarity with the sample:

– An entire family of smoothing methods could be developed and studied by determining:

1. : How much to discount the original sample

2. M : the set of model

3. : How to weight each model

( | ) (1 ) ( | ) ( | ) ( | )T i iM

P w M P w T P M T P w M

( | )iP M T

Page 13: Dirichlet Mixtures for Query Estimation in Information Retrieval

NTNU Speech Lab

Methods and Materials

• Relevance Models– As used for ad-hoc retrieval, relevance models is a local fee

dback technique

– The original query model is often mixed with the relevance model to help keep the query “focused”

– Which is a linear interpolated smoothing of the query with the relevance model (a special case of Dirilet mixtures)

1

1

( | ) ( | ) ( | )

( | )( | )

( | )

k

R i ii

i k

jj

P w M P D Q P w D

P Q DP D Q

P Q D

( | ) (1 ) ( | ) ( | )Q RP w M P w Q P w M

1

( ) ( | )( | ) ( | )

| |

ki

Q ii

Q w mP w DP w M P D Q

Q m

Page 14: Dirichlet Mixtures for Query Estimation in Information Retrieval

NTNU Speech Lab

Methods and Materials

• Relevance Models

1

1 1

1

( | ) ( | ) 1 ( | ) ( | )

1 | | /(| | )

1 ( | ) ( | ) ( | ) ( | )

( | ) 1 ( | ) ( | ) ( | )

k

Q i ii

k k

i i ii i

k

Q i ii

P w M P D Q P w Q P w D

Q Q m

P w Q P D Q P D Q P w D

P w M P w Q P D Q P w D

Page 15: Dirichlet Mixtures for Query Estimation in Information Retrieval

NTNU Speech Lab

Experiments

• Topics and Collection– The topics used for the experiments consists of TREC topics

351-450, which are the ad-hoc topics for TREC 7 and 8– TREC topics consist of a short title, and sentence length des

cription, and a paragraph sized narrative– The experiments use the titles and descriptions separately– The collection for the TREC 7 and 8 topics consists of TRE

C volumes 4 and 5 minus the CR subcollection– This 1.85GB, heterogeneous collection contains 528,155 do

cuments.– The collection and queries are preprocessed in the same ma

nner

Page 16: Dirichlet Mixtures for Query Estimation in Information Retrieval

NTNU Speech Lab

Experiments

• The first experiment was to use Dirichlet mixtures in a manner similar to relevance models for query estimation using local feedback

• The second experiment examined the effect of remixing the models produced by both methods with the original model

• Query likelihood with Dirichlet prior smoothing formed the baseline retrieval

• A compromise setting of m=1500• The baseline was used to determine the top K docum

ents for blind feedback used by both RM and DM

Page 17: Dirichlet Mixtures for Query Estimation in Information Retrieval

NTNU Speech Lab

Experiments

Page 18: Dirichlet Mixtures for Query Estimation in Information Retrieval

NTNU Speech Lab

Experiments

Page 19: Dirichlet Mixtures for Query Estimation in Information Retrieval

NTNU Speech Lab

Conclusion

• The used of Dirichlet mixtures smoothing technique is investeigated in the text domain by applying the technique to the problem of query estimation

• On some queries, Dirichlet mixtures perform very well and this shows that there may be value to utilizing aspect-based prior information.