1 Smoothing Methods for LM in IR Alejandro Figueroa.

1

Smoothing Methods for LM in IRSmoothing Methods for LM in IR

Alejandro FigueroaAlejandro Figueroa

2

OutlineOutline

• The linguistic phenomena behind the retrieval of documents.

• Language Modeling Approach.• Smoothing methods.

– Overview.– Methods.– Parameters setting.

• Interpolation vs. Back-off.• Comparison of methods.• Combination of methods.• Personal outlook and conclusions.

• The linguistic phenomena behind the retrieval of documents.

• Language Modeling Approach.• Smoothing methods.

– Overview.– Methods.– Parameters setting.

• Interpolation vs. Back-off.• Comparison of methods.• Combination of methods.• Personal outlook and conclusions.

3

The Linguistic Phenomena behind IR.

The Linguistic Phenomena behind IR.

•„Reducing Information Variation on Texts“ (Agata Savary and Christian Jacquemin).

•Work on our QA Group – DFKI.

4

Information VariationInformation Variation

• The problem: simply keyword matching is not enough to retrieve the best documents for a query. For example: „When was Albert Einstein born?„– The nobel prize of physics Albert Einstein was born

in 1879 in Ulm, Germany.– Born: 14 March 1879 in Ulm, Württemberg,

Germany.– Physics nobel prize Albert Einstein was born at Ulm,

in Württemberg, Germany, on March 14, 1879.– Died 18 Apr 1955 (born 14 Mar 1879) German-

American physicist.

• The same information can be found in several ways:

• The problem: simply keyword matching is not enough to retrieve the best documents for a query. For example: „When was Albert Einstein born?„– The nobel prize of physics Albert Einstein was born

in 1879 in Ulm, Germany.– Born: 14 March 1879 in Ulm, Württemberg,

Germany.– Physics nobel prize Albert Einstein was born at Ulm,

in Württemberg, Germany, on March 14, 1879.– Died 18 Apr 1955 (born 14 Mar 1879) German-

American physicist.

• The same information can be found in several ways:

5

Information VariationInformation Variation

• Kinds of variation:– Graphic: "14 March 1879“ and "14 Mar

1879“.– Morphological:” Physics nobel prize“– Syntactical: “German-American physicist“– Semantic:"Albert Einstein was born at

Ulm“ and "German-American physicist“.• Appropriateness:

– Precision.– Economy.

• Kinds of variation:– Graphic: "14 March 1879“ and "14 Mar

1879“.– Morphological:” Physics nobel prize“– Syntactical: “German-American physicist“– Semantic:"Albert Einstein was born at

Ulm“ and "German-American physicist“.• Appropriateness:

– Precision.– Economy.

6

Language Modeling Approach Language Modeling Approach

•„A Study of smoothing methods for Language Models applied to Information Retrieval“ (Chengxiang Zhai and John Lafferty)

7

Language ModelingLanguage Modeling

• The probability that a query Q was generated by a probabilistic model based on a document.

• The probability that a query Q was generated by a probabilistic model based on a document.

nqqqq ....21 mdddd ...21

)|( dqp

)(*)|()|( dpdqpqdp

• Uni-gram model:• Uni-gram model:

n

ii dqPdqP

1

)|()|(

P(q|d)?0

8


• Smoothing methods makes use of two probabilites for the model Pu(w|d) and Ps(w|d).

• Smoothing methods makes use of two probabilites for the model Pu(w|d) and Ps(w|d).

n

ii dqpdqp

1

)|(log)|(log

)|()|( CqPdqP idiu

n

iiu

dqci iu

is dqPdqP

dqp

i 10),(:

)|(log)|(

)|(log

9


n

iid

dqci id

is CqPnCqP

dqPdqP

i 10);(:

)|(loglog)|(

)|(log)|(log

carried out

over the

matched terms. Longer documents => less smoothing,

longer documents => greater penalty!!.

10

Smoothing Methods Smoothing Methods

11

OverviewOverview

• The problem: Adjust the MLE to compensate data sparseness.

• The role of smoothing is:– LM more accurate.– Explain the non-informative words in the query.

• Goal of the work:– How sensitive is retrieval performance to the

smoothing of a document LM?– How should be the model and the parameters

chosen?

• The problem: Adjust the MLE to compensate data sparseness.

• The role of smoothing is:– LM more accurate.– Explain the non-informative words in the query.

• Goal of the work:– How sensitive is retrieval performance to the

smoothing of a document LM?– How should be the model and the parameters

chosen?

12

OverviewOverview• The unsmoothed model is the MLE:• The unsmoothed model is the MLE:

Vw

ml dwc

dwcdwP

*

);(

);()|(

*

otherwiseCwP

seeniswwordifdwPdwP

d

s

)|(

)|()|(

0);(:

0);(:

)|(1

)|(1

dwcVw

dwcVws

d CwP

dwP

13

OverviewOverview

• Smoothing: tackles the effect of statistical variability in small training sets.

• Discounting: the relative frequencies of seen events are discounted; the gained probability mass is then distributed over the unseen words.

• Smoothing: tackles the effect of statistical variability in small training sets.

• Discounting: the relative frequencies of seen events are discounted; the gained probability mass is then distributed over the unseen words.

14

Smoothing MethodsSmoothing Methods

• Based on the Good-turing idea: Estimate the probabilities of new events by taking the counts of singleton events, dividing it by the total number of events (0,1).

• Based on the Good-turing idea: Estimate the probabilities of new events by taking the counts of singleton events, dividing it by the total number of events (0,1).

15

GooD-Turing IdeaGooD-Turing Idea

tf

tf

N

NEtftf

)()1(* 1

The probability of a term with freq. tf is given by:

Nd = Total number of terms occurred in d.

dN

tf *

dtf

tfGT NNS

NStfdtP

)(

)1()1()|(

Number of terms with frequency tf in

a document.

Expected value of Ntf.

Total number of terms occurred in d.

16


• Jelinek-mercer method: involves a linear interpolation of the ML model with the collection model.

• Jelinek-mercer method: involves a linear interpolation of the ML model with the collection model.

)|()|()1()|( CwPdwPdwP ml

17


• Absolute discounting: decrease the probability of seen words by substracting a constant from their counts.

• Absolute discounting: decrease the probability of seen words by substracting a constant from their counts.

)|();(

)0,);(max()|(

*

*CwP

dwc

dwcdwP

Vw

s

18


• Bayesian smoothing using Dirichlet priors: A multinomial distribution, for which the conjugate prior for bayesian analysis is the dirichlet distribution:

• Bayesian smoothing using Dirichlet priors: A multinomial distribution, for which the conjugate prior for bayesian analysis is the dirichlet distribution:

Vw

dwc

CwPdwcdwP

*

);(

)|();()|(

*

• The idea is to adjust the probabilities according to the query.

• The idea is to adjust the probabilities according to the query.

19

Summary: Smoothing MethodsSummary: Smoothing Methods

Method Ps(w|d) αd Parameter

Jelinek-Mercer λ λ

Dirichlet μ

Absolute discounting

δ

|)d|(

|d|

||

ud

C)|P(w d)|(w)P -(1 ml

)|d(|

C)|µp(wd)C(w,

|d|

C)|P(w|d|

|d|

,0)-d)Max(c(w,

20

Parameters SettingParameters Setting

• 5 databases from TREC:– Financial Times on disk 4.– FBIS on disk 5.– Los Angeles on disk 5.– Disk 4 and disk 5 minus Congressional Record.– The TREC8 web data.

• Queries:– Topics 351-400 (TREC 7 ad-hoc task).– Topics 401-450 (TREC 8 ad hoc web task).

• 5 databases from TREC:– Financial Times on disk 4.– FBIS on disk 5.– Los Angeles on disk 5.– Disk 4 and disk 5 minus Congressional Record.– The TREC8 web data.

• Queries:– Topics 351-400 (TREC 7 ad-hoc task).– Topics 401-450 (TREC 8 ad hoc web task).

21


<num> Number: 384

<title> space station moon

<desc> Description:

Identify documents that discuss the building of

a space station with the intent of colonizing the

moon.

<narr> Narrative:

A relevant document will discuss the purpose of a

space station, initiatives towards colonizing the

moon, impediments which thus far have thwarted such a

project, plans currently underway or in the planning

stages for such a venture; cost, countries prepared

to make a commitment of men, resources, facilities

and money to accomplish such a feat.

</top>

TREC7

22


<num> Number: 414

<title> Cuba, sugar, exports

<desc> Description:

How much sugar does Cuba export and which

countries import it?

<narr> Narrative:

A relevant document will provide information

regarding Cuba's sugar trade. Sugar production

statistics are not relevant unless exports

are mentioned explicitly.

</top>

TREC8

23


• Interaction query length/type:– Two different version of each set of queries:

• Title only (2 or 3 words).• A long version (Title + description + narrative).

• Optimize the performance of each method by means of the non-interpolated average precision.

• Interaction query length/type:– Two different version of each set of queries:

• Title only (2 or 3 words).• A long version (Title + description + narrative).

• Optimize the performance of each method by means of the non-interpolated average precision.

24


• Jelinek-Mercer smoothing:– Weight for a matched term:

• Jelinek-Mercer smoothing:– Weight for a matched term:

)|(

)|()1(1log

CqP

dqP

i

iml

)|(

)|()1(1

CqP

dqP

i

iml

i i

iml

CqP

dqP

)|(

)|(

λ->1

25


• Dirichlet priors: – Term weight:

• Dirichlet priors: – Term weight:

)|(

);(1log

CqP

dqc

i

i

)|(

)|(||1log

CqP

dqPd

i

iml

αd is a document-dependent

length normalization factor that penalizes

long documents.

26


• Absolute discounting: αd is a document-dependent:– Larger for a document with a flatter

distribution of words. – Weight of a matched term:

• Absolute discounting: αd is a document-dependent:– Larger for a document with a flatter

distribution of words. – Weight of a matched term:

)|(||

);(1log

CqPd

dqc

iu

i

27


• Conclusions Jelinek-Mercer:– The precision is much more sensitive to λ for

long queries than for title queries.• Long queries need more smoothing, that is, lees

emphasis on the relative weighting of terms.

– In the web collection, it was sensitive to smoothing for title queries too.

– For title queries the retrieval performance tends to be optimized when λ=0.1.

• Conclusions Jelinek-Mercer:– The precision is much more sensitive to λ for

long queries than for title queries.• Long queries need more smoothing, that is, lees

emphasis on the relative weighting of terms.

– In the web collection, it was sensitive to smoothing for title queries too.

– For title queries the retrieval performance tends to be optimized when λ=0.1.

28


• Conclusions Dirichlet Priors:– The precision is more sensitive to μ for long queries

than for title queries, especially, when μ is small. – When μ is large, all long queries performed better

than short queries, opposite to μ small.– The optimal value of μ tends to be larger for long

queries than for title queries.– The value of μ tends to vary from collection to

collection.

• Conclusions Dirichlet Priors:– The precision is more sensitive to μ for long queries

than for title queries, especially, when μ is small. – When μ is large, all long queries performed better

than short queries, opposite to μ small.– The optimal value of μ tends to be larger for long

queries than for title queries.– The value of μ tends to vary from collection to

collection.

29


• Conclusions Absolute discounting:– The precision is more sensitive to δ for long

queries than for title queries.– The optimal value of δ0.7 does not seem to

be much different for title queries and long queries.

– Smoothing plays a more important role for long verbose queries than for concise queries.

• Conclusions Absolute discounting:– The precision is more sensitive to δ for long

queries than for title queries.– The optimal value of δ0.7 does not seem to

be much different for title queries and long queries.

– Smoothing plays a more important role for long verbose queries than for concise queries.

30

Interpolation vs. Back-offInterpolation vs. Back-off

31


• Interpolation-based methods: counts of the seen words and the extra counts are shared by both the seen words and unseen words.

• Back-off: Trust in the MLE for the high count words and discount and redistribute mass only for the less common terms.

• Interpolation-based methods: counts of the seen words and the extra counts are shared by both the seen words and unseen words.

• Back-off: Trust in the MLE for the high count words and discount and redistribute mass only for the less common terms.

32

• Interpolation:• Interpolation:


)|()()( CwPwPwP ddmls

)|()( CwPwP du

33


• Back-Off:• Back-Off:

0),'(:'

)|'(1

)|()(

)()(

dwcVw

du

dmls

CwP

CwPwP

wPwP

34


• Results:– The performance of the back-off strategy is

more sensitive to the smoothing parameters.• Specially: Jeliner-Mercer and Dirichlet priors.

– This sensitivity is smaller for the absolute discounting method, due to the lower upper bound.

• Results:– The performance of the back-off strategy is

more sensitive to the smoothing parameters.• Specially: Jeliner-Mercer and Dirichlet priors.

– This sensitivity is smaller for the absolute discounting method, due to the lower upper bound.

||

||

d

d u

35

Comparisson of methodsComparisson of methods

36

Comparison of methodsComparison of methods

• For title queries:– Dirichlet prior is better than absolute

discounting, which is better than J-M.– Dirichlet prior performed extremelly well on

the web collection and it is insensitive to the value of μ.

– Many no optimal runs were better than the other two methods.

• For title queries:– Dirichlet prior is better than absolute

discounting, which is better than J-M.– Dirichlet prior performed extremelly well on

the web collection and it is insensitive to the value of μ.

– Many no optimal runs were better than the other two methods.

37


• For long queries:– Jelinek-Mercer is better than Dirichlet, which

is better than absolute discounting.– The three methods perform better on long

queries than in short queries.– Jelinek-Mercer is much more effective for long

and verbose queries.– All methods perform better for long queries

than for short queries.

• For long queries:– Jelinek-Mercer is better than Dirichlet, which

is better than absolute discounting.– The three methods perform better on long

queries than in short queries.– Jelinek-Mercer is much more effective for long

and verbose queries.– All methods perform better for long queries

than for short queries.

38


• General Remark:– The strong correlation between the effect of

smoothing and the type of the query is unexpected.

– Smoothing only improves accuracy in estimating the unigram language model based on a document.

• General Remark:– The strong correlation between the effect of

smoothing and the type of the query is unexpected.

– Smoothing only improves accuracy in estimating the unigram language model based on a document.

Effect of verbose

Queries???

39

Query Length/VerbosityQuery Length/Verbosity

• Four types of query:– Short keywords: Only the title of the topic description.– Long keywords: Using only the description field.– Short verbose: Using the concept field, 28 keywords on

average.– Long verbose: Using the title, description and the narrative field

(more than 50 words on average).

• Generated for the TREC topics 1-150.• Both keywords queries behaved in the similar way and

the verbose query too.• The retrieval performance is much less sensitive to

smoothing in the case of the keyword queries than for the verbose queries.

• Four types of query:– Short keywords: Only the title of the topic description.– Long keywords: Using only the description field.– Short verbose: Using the concept field, 28 keywords on

average.– Long verbose: Using the title, description and the narrative field

(more than 50 words on average).

• Generated for the TREC topics 1-150.• Both keywords queries behaved in the similar way and

the verbose query too.• The retrieval performance is much less sensitive to

smoothing in the case of the keyword queries than for the verbose queries.

40

Combining MethodsCombining Methods•„A General Language Model for Information Retrieval“ (Fei Song / W. Bruce Croft)

41

A general LM for IRA general LM for IR

• They propose a extensible model based on:– Good-turing estimate.– Curve-fitting functions.– Model combinations.

• The idea is to use n-grams is taking into account the local context, the uni-gram models assume independence.

• They propose a extensible model based on:– Good-turing estimate.– Curve-fitting functions.– Model combinations.

• The idea is to use n-grams is taking into account the local context, the uni-gram models assume independence.

42

A general LM for IRA general LM for IR

• The new model:1. Smooth each document with the Good-turing

estimate.

2. Expand each document with the corpus.

3. Consider terms pairs and expand the unigram model to the bi-gram model.

• The new model:1. Smooth each document with the Good-turing

estimate.

2. Expand each document with the corpus.

3. Consider terms pairs and expand the unigram model to the bi-gram model.

43

Step 1: Good turing Idea-RevisingStep 1: Good turing Idea-Revising

tf

tf

N

NEtftf

)()1(* 1

Ntf = Number of terms with frequency tf in a doc.

E(Ntf)= Expected value of Ntf .

The probability of a term with freq. tf is given by:

Nd = Total number of terms occurred in d.dN

tf *

dtf

tfGT NNS

NStfdtP

)(

)1()1()|(

44

Step 2Step 2

• Expanding a document model with the corpus:

• Expanding a document model with the corpus:

)(*)1()|(*)|( tPwdtPwdtP corpusdsum

wcorpus

wdweighted tPdtPdtP 1)()|()|(

45

Step 3Step 3

• Modeling a query as a sequence of terms:

• Modeling a query as a sequence of terms:

QtQt

set dtPdtPdQP ))|(0.1(*)|()|(

m

iiseq dtPdQP

1

)|()|(

46

Step 4Step 4

• Combining uni-grams and bi-grams:• Combining uni-grams and bi-grams:

)|,()|()|,( 1211 dttPdtPdttP iiiii

121

47

ResultsResults

• Two collections:– The wall street journal (WSJ), 250 MB, 74.520

docs.– TREC 4, 2 GB, 567.529 docs.

• Phrases of word pairs can be useful in improving the retrieval performance.

• The strategy can be easily extended.

• Two collections:– The wall street journal (WSJ), 250 MB, 74.520

docs.– TREC 4, 2 GB, 567.529 docs.

• Phrases of word pairs can be useful in improving the retrieval performance.

• The strategy can be easily extended.

48

Personal Outlook / ConclusionsPersonal Outlook / Conclusions

49


• Stop-List.

• Porter Steemer.

• N-grams can not capture large-span relationships in the language.

• The performance of the n-gram model has reached a plateau.

• P(d).

• Stop-List.

• Porter Steemer.

• N-grams can not capture large-span relationships in the language.

• The performance of the n-gram model has reached a plateau.

• P(d).

50

Principal Component AnalysisPrincipal Component Analysis

• A low dimensional representation of the data.• Relation between features.• PCA tries to find a low-rank approximation,

where the quality of the approximation depends on how close the data is to lying in a subspace of the given dimensionality.

• A low dimensional representation of the data.• Relation between features.• PCA tries to find a low-rank approximation,

where the quality of the approximation depends on how close the data is to lying in a subspace of the given dimensionality.

51

Latent Semantic AnalysisLatent Semantic Analysis

• Latent Semantic Analysis– Semantic Information is extracted by means

of the Singular Value Decomposition (SVD).

• Latent Semantic Analysis– Semantic Information is extracted by means

of the Singular Value Decomposition (SVD).

VUD

kUdd )(

LSI uses a reduction of the first k columns of U.

52


• Latent Semantic Analysis– The eigenvectors for a set of documents can

be viewed as concepts described by a linear combination of terms chosen in such a way that documents are described as accurately as possible using only k such concepts.

– Terms that co-occur frequently will tend to align in the same eigenvectors.

• Latent Semantic Analysis– The eigenvectors for a set of documents can

be viewed as concepts described by a linear combination of terms chosen in such a way that documents are described as accurately as possible using only k such concepts.

– Terms that co-occur frequently will tend to align in the same eigenvectors.

53


• SVD is expensive to compute.• Cristianini developed an approximation strategy,

based on the Gram-Schmidt decomposition.• Multilinguality:

– The semantic space proposed here provides an ideal representation for performing multilingual information retrieval.

• SVD is expensive to compute.• Cristianini developed an approximation strategy,

based on the Gram-Schmidt decomposition.• Multilinguality:

– The semantic space proposed here provides an ideal representation for performing multilingual information retrieval.

54


• What happens if we use LSA to improve smoothing?– We can think:

• We can smooth terms assigning probability mass according to their semantic distance to the terms in the collection/query.

– Problem:• Scalability of the model: if a term is not in the set

W, from which the SVD decomposition was made, then we should do an approximation.

• What happens if we use LSA to improve smoothing?– We can think:

• We can smooth terms assigning probability mass according to their semantic distance to the terms in the collection/query.

– Problem:• Scalability of the model: if a term is not in the set

W, from which the SVD decomposition was made, then we should do an approximation.

55


• What happens if we use LSA to improve smoothing?– Problem:

• If the documents belong diverse topics, the classification on the new space becomes too heterogeneous.

• If the documents belong diverse topics, the classification of the words in the new space is ambiguous.

• What happens if we use LSA to improve smoothing?– Problem:

• If the documents belong diverse topics, the classification on the new space becomes too heterogeneous.

• If the documents belong diverse topics, the classification of the words in the new space is ambiguous.

56


• Conclusions:– Smoothing methods are simple and efficient.– They provide a elegant way to deal with the data

sparseness problem.– They can be choose according to the taste of the

consumer.– But, they do not model the linguistic phenomena

behind the scenes... At least for the moment.– Even though, the techniques does not requieres

language knowledge, the fact of the markov assumption, drives to some sort of language dependecy.

• Conclusions:– Smoothing methods are simple and efficient.– They provide a elegant way to deal with the data

sparseness problem.– They can be choose according to the taste of the

consumer.– But, they do not model the linguistic phenomena

behind the scenes... At least for the moment.– Even though, the techniques does not requieres

language knowledge, the fact of the markov assumption, drives to some sort of language dependecy.

57

Questions?Questions?

• English only?.

• Query Expansion?.

• How would help smoothing to the Question Answering task?

• Which method would help in a more appropriate way to a QA System? Why?

• English only?.

• Query Expansion?.

• How would help smoothing to the Question Answering task?

• Which method would help in a more appropriate way to a QA System? Why?

1 Smoothing Methods for LM in IR Alejandro Figueroa.

Documents

study of smoothing methods

comparison of methods

combination of methods

german american physicist

role of smoothing

language modeling approach

retrieval of documents

john lafferty slide