1 Smoothing Methods for LM in IR Alejandro Figueroa
Dec 18, 2015
1
Smoothing Methods for LM in IRSmoothing Methods for LM in IR
Alejandro FigueroaAlejandro Figueroa
2
OutlineOutline
• The linguistic phenomena behind the retrieval of documents.
• Language Modeling Approach.• Smoothing methods.
– Overview.– Methods.– Parameters setting.
• Interpolation vs. Back-off.• Comparison of methods.• Combination of methods.• Personal outlook and conclusions.
• The linguistic phenomena behind the retrieval of documents.
• Language Modeling Approach.• Smoothing methods.
– Overview.– Methods.– Parameters setting.
• Interpolation vs. Back-off.• Comparison of methods.• Combination of methods.• Personal outlook and conclusions.
3
The Linguistic Phenomena behind IR.
The Linguistic Phenomena behind IR.
•„Reducing Information Variation on Texts“ (Agata Savary and Christian Jacquemin).
•Work on our QA Group – DFKI.
4
Information VariationInformation Variation
• The problem: simply keyword matching is not enough to retrieve the best documents for a query. For example: „When was Albert Einstein born?„– The nobel prize of physics Albert Einstein was born
in 1879 in Ulm, Germany.– Born: 14 March 1879 in Ulm, Württemberg,
Germany.– Physics nobel prize Albert Einstein was born at Ulm,
in Württemberg, Germany, on March 14, 1879.– Died 18 Apr 1955 (born 14 Mar 1879) German-
American physicist.
• The same information can be found in several ways:
• The problem: simply keyword matching is not enough to retrieve the best documents for a query. For example: „When was Albert Einstein born?„– The nobel prize of physics Albert Einstein was born
in 1879 in Ulm, Germany.– Born: 14 March 1879 in Ulm, Württemberg,
Germany.– Physics nobel prize Albert Einstein was born at Ulm,
in Württemberg, Germany, on March 14, 1879.– Died 18 Apr 1955 (born 14 Mar 1879) German-
American physicist.
• The same information can be found in several ways:
5
Information VariationInformation Variation
• Kinds of variation:– Graphic: "14 March 1879“ and "14 Mar
1879“.– Morphological:” Physics nobel prize“– Syntactical: “German-American physicist“– Semantic:"Albert Einstein was born at
Ulm“ and "German-American physicist“.• Appropriateness:
– Precision.– Economy.
• Kinds of variation:– Graphic: "14 March 1879“ and "14 Mar
1879“.– Morphological:” Physics nobel prize“– Syntactical: “German-American physicist“– Semantic:"Albert Einstein was born at
Ulm“ and "German-American physicist“.• Appropriateness:
– Precision.– Economy.
6
Language Modeling Approach Language Modeling Approach
•„A Study of smoothing methods for Language Models applied to Information Retrieval“ (Chengxiang Zhai and John Lafferty)
7
Language ModelingLanguage Modeling
• The probability that a query Q was generated by a probabilistic model based on a document.
• The probability that a query Q was generated by a probabilistic model based on a document.
nqqqq ....21 mdddd ...21
)|( dqp
)(*)|()|( dpdqpqdp
• Uni-gram model:• Uni-gram model:
n
ii dqPdqP
1
)|()|(
P(q|d)?0
8
Language ModelingLanguage Modeling
• Smoothing methods makes use of two probabilites for the model Pu(w|d) and Ps(w|d).
• Smoothing methods makes use of two probabilites for the model Pu(w|d) and Ps(w|d).
n
ii dqpdqp
1
)|(log)|(log
)|()|( CqPdqP idiu
n
iiu
dqci iu
is dqPdqP
dqp
i 10),(:
)|(log)|(
)|(log
9
Language ModelingLanguage Modeling
n
iid
dqci id
is CqPnCqP
dqPdqP
i 10);(:
)|(loglog)|(
)|(log)|(log
carried out
over the
matched terms. Longer documents => less smoothing,
longer documents => greater penalty!!.
10
Smoothing Methods Smoothing Methods
11
OverviewOverview
• The problem: Adjust the MLE to compensate data sparseness.
• The role of smoothing is:– LM more accurate.– Explain the non-informative words in the query.
• Goal of the work:– How sensitive is retrieval performance to the
smoothing of a document LM?– How should be the model and the parameters
chosen?
• The problem: Adjust the MLE to compensate data sparseness.
• The role of smoothing is:– LM more accurate.– Explain the non-informative words in the query.
• Goal of the work:– How sensitive is retrieval performance to the
smoothing of a document LM?– How should be the model and the parameters
chosen?
12
OverviewOverview• The unsmoothed model is the MLE:• The unsmoothed model is the MLE:
Vw
ml dwc
dwcdwP
*
);(
);()|(
*
otherwiseCwP
seeniswwordifdwPdwP
d
s
)|(
)|()|(
0);(:
0);(:
)|(1
)|(1
dwcVw
dwcVws
d CwP
dwP
13
OverviewOverview
• Smoothing: tackles the effect of statistical variability in small training sets.
• Discounting: the relative frequencies of seen events are discounted; the gained probability mass is then distributed over the unseen words.
• Smoothing: tackles the effect of statistical variability in small training sets.
• Discounting: the relative frequencies of seen events are discounted; the gained probability mass is then distributed over the unseen words.
14
Smoothing MethodsSmoothing Methods
• Based on the Good-turing idea: Estimate the probabilities of new events by taking the counts of singleton events, dividing it by the total number of events (0,1).
• Based on the Good-turing idea: Estimate the probabilities of new events by taking the counts of singleton events, dividing it by the total number of events (0,1).
15
GooD-Turing IdeaGooD-Turing Idea
tf
tf
N
NEtftf
)()1(* 1
The probability of a term with freq. tf is given by:
Nd = Total number of terms occurred in d.
dN
tf *
dtf
tfGT NNS
NStfdtP
)(
)1()1()|(
Number of terms with frequency tf in
a document.
Expected value of Ntf.
Total number of terms occurred in d.
16
Smoothing MethodsSmoothing Methods
• Jelinek-mercer method: involves a linear interpolation of the ML model with the collection model.
• Jelinek-mercer method: involves a linear interpolation of the ML model with the collection model.
)|()|()1()|( CwPdwPdwP ml
17
Smoothing MethodsSmoothing Methods
• Absolute discounting: decrease the probability of seen words by substracting a constant from their counts.
• Absolute discounting: decrease the probability of seen words by substracting a constant from their counts.
)|();(
)0,);(max()|(
*
*CwP
dwc
dwcdwP
Vw
s
18
Smoothing MethodsSmoothing Methods
• Bayesian smoothing using Dirichlet priors: A multinomial distribution, for which the conjugate prior for bayesian analysis is the dirichlet distribution:
• Bayesian smoothing using Dirichlet priors: A multinomial distribution, for which the conjugate prior for bayesian analysis is the dirichlet distribution:
Vw
dwc
CwPdwcdwP
*
);(
)|();()|(
*
• The idea is to adjust the probabilities according to the query.
• The idea is to adjust the probabilities according to the query.
19
Summary: Smoothing MethodsSummary: Smoothing Methods
Method Ps(w|d) αd Parameter
Jelinek-Mercer λ λ
Dirichlet μ
Absolute discounting
δ
|)d|(
|d|
||
ud
C)|P(w d)|(w)P -(1 ml
)|d(|
C)|µp(wd)C(w,
|d|
C)|P(w|d|
|d|
,0)-d)Max(c(w,
20
Parameters SettingParameters Setting
• 5 databases from TREC:– Financial Times on disk 4.– FBIS on disk 5.– Los Angeles on disk 5.– Disk 4 and disk 5 minus Congressional Record.– The TREC8 web data.
• Queries:– Topics 351-400 (TREC 7 ad-hoc task).– Topics 401-450 (TREC 8 ad hoc web task).
• 5 databases from TREC:– Financial Times on disk 4.– FBIS on disk 5.– Los Angeles on disk 5.– Disk 4 and disk 5 minus Congressional Record.– The TREC8 web data.
• Queries:– Topics 351-400 (TREC 7 ad-hoc task).– Topics 401-450 (TREC 8 ad hoc web task).
21
Parameters SettingParameters Setting
<num> Number: 384
<title> space station moon
<desc> Description:
Identify documents that discuss the building of
a space station with the intent of colonizing the
moon.
<narr> Narrative:
A relevant document will discuss the purpose of a
space station, initiatives towards colonizing the
moon, impediments which thus far have thwarted such a
project, plans currently underway or in the planning
stages for such a venture; cost, countries prepared
to make a commitment of men, resources, facilities
and money to accomplish such a feat.
</top>
TREC7
22
Parameters SettingParameters Setting
<num> Number: 414
<title> Cuba, sugar, exports
<desc> Description:
How much sugar does Cuba export and which
countries import it?
<narr> Narrative:
A relevant document will provide information
regarding Cuba's sugar trade. Sugar production
statistics are not relevant unless exports
are mentioned explicitly.
</top>
TREC8
23
Parameters SettingParameters Setting
• Interaction query length/type:– Two different version of each set of queries:
• Title only (2 or 3 words).• A long version (Title + description + narrative).
• Optimize the performance of each method by means of the non-interpolated average precision.
• Interaction query length/type:– Two different version of each set of queries:
• Title only (2 or 3 words).• A long version (Title + description + narrative).
• Optimize the performance of each method by means of the non-interpolated average precision.
24
Parameters SettingParameters Setting
• Jelinek-Mercer smoothing:– Weight for a matched term:
• Jelinek-Mercer smoothing:– Weight for a matched term:
)|(
)|()1(1log
CqP
dqP
i
iml
)|(
)|()1(1
CqP
dqP
i
iml
i i
iml
CqP
dqP
)|(
)|(
λ->1
25
Parameters SettingParameters Setting
• Dirichlet priors: – Term weight:
• Dirichlet priors: – Term weight:
)|(
);(1log
CqP
dqc
i
i
)|(
)|(||1log
CqP
dqPd
i
iml
αd is a document-dependent
length normalization factor that penalizes
long documents.
26
Parameters SettingParameters Setting
• Absolute discounting: αd is a document-dependent:– Larger for a document with a flatter
distribution of words. – Weight of a matched term:
• Absolute discounting: αd is a document-dependent:– Larger for a document with a flatter
distribution of words. – Weight of a matched term:
)|(||
);(1log
CqPd
dqc
iu
i
27
Parameters SettingParameters Setting
• Conclusions Jelinek-Mercer:– The precision is much more sensitive to λ for
long queries than for title queries.• Long queries need more smoothing, that is, lees
emphasis on the relative weighting of terms.
– In the web collection, it was sensitive to smoothing for title queries too.
– For title queries the retrieval performance tends to be optimized when λ=0.1.
• Conclusions Jelinek-Mercer:– The precision is much more sensitive to λ for
long queries than for title queries.• Long queries need more smoothing, that is, lees
emphasis on the relative weighting of terms.
– In the web collection, it was sensitive to smoothing for title queries too.
– For title queries the retrieval performance tends to be optimized when λ=0.1.
28
Parameters SettingParameters Setting
• Conclusions Dirichlet Priors:– The precision is more sensitive to μ for long queries
than for title queries, especially, when μ is small. – When μ is large, all long queries performed better
than short queries, opposite to μ small.– The optimal value of μ tends to be larger for long
queries than for title queries.– The value of μ tends to vary from collection to
collection.
• Conclusions Dirichlet Priors:– The precision is more sensitive to μ for long queries
than for title queries, especially, when μ is small. – When μ is large, all long queries performed better
than short queries, opposite to μ small.– The optimal value of μ tends to be larger for long
queries than for title queries.– The value of μ tends to vary from collection to
collection.
29
Parameters SettingParameters Setting
• Conclusions Absolute discounting:– The precision is more sensitive to δ for long
queries than for title queries.– The optimal value of δ0.7 does not seem to
be much different for title queries and long queries.
– Smoothing plays a more important role for long verbose queries than for concise queries.
• Conclusions Absolute discounting:– The precision is more sensitive to δ for long
queries than for title queries.– The optimal value of δ0.7 does not seem to
be much different for title queries and long queries.
– Smoothing plays a more important role for long verbose queries than for concise queries.
30
Interpolation vs. Back-offInterpolation vs. Back-off
31
Interpolation vs. Back-offInterpolation vs. Back-off
• Interpolation-based methods: counts of the seen words and the extra counts are shared by both the seen words and unseen words.
• Back-off: Trust in the MLE for the high count words and discount and redistribute mass only for the less common terms.
• Interpolation-based methods: counts of the seen words and the extra counts are shared by both the seen words and unseen words.
• Back-off: Trust in the MLE for the high count words and discount and redistribute mass only for the less common terms.
32
• Interpolation:• Interpolation:
Interpolation vs. Back-offInterpolation vs. Back-off
)|()()( CwPwPwP ddmls
)|()( CwPwP du
33
Interpolation vs. Back-offInterpolation vs. Back-off
• Back-Off:• Back-Off:
0),'(:'
)|'(1
)|()(
)()(
dwcVw
du
dmls
CwP
CwPwP
wPwP
34
Interpolation vs. Back-offInterpolation vs. Back-off
• Results:– The performance of the back-off strategy is
more sensitive to the smoothing parameters.• Specially: Jeliner-Mercer and Dirichlet priors.
– This sensitivity is smaller for the absolute discounting method, due to the lower upper bound.
• Results:– The performance of the back-off strategy is
more sensitive to the smoothing parameters.• Specially: Jeliner-Mercer and Dirichlet priors.
– This sensitivity is smaller for the absolute discounting method, due to the lower upper bound.
||
||
d
d u
35
Comparisson of methodsComparisson of methods
36
Comparison of methodsComparison of methods
• For title queries:– Dirichlet prior is better than absolute
discounting, which is better than J-M.– Dirichlet prior performed extremelly well on
the web collection and it is insensitive to the value of μ.
– Many no optimal runs were better than the other two methods.
• For title queries:– Dirichlet prior is better than absolute
discounting, which is better than J-M.– Dirichlet prior performed extremelly well on
the web collection and it is insensitive to the value of μ.
– Many no optimal runs were better than the other two methods.
37
Comparison of methodsComparison of methods
• For long queries:– Jelinek-Mercer is better than Dirichlet, which
is better than absolute discounting.– The three methods perform better on long
queries than in short queries.– Jelinek-Mercer is much more effective for long
and verbose queries.– All methods perform better for long queries
than for short queries.
• For long queries:– Jelinek-Mercer is better than Dirichlet, which
is better than absolute discounting.– The three methods perform better on long
queries than in short queries.– Jelinek-Mercer is much more effective for long
and verbose queries.– All methods perform better for long queries
than for short queries.
38
Comparison of methodsComparison of methods
• General Remark:– The strong correlation between the effect of
smoothing and the type of the query is unexpected.
– Smoothing only improves accuracy in estimating the unigram language model based on a document.
• General Remark:– The strong correlation between the effect of
smoothing and the type of the query is unexpected.
– Smoothing only improves accuracy in estimating the unigram language model based on a document.
Effect of verbose
Queries???
39
Query Length/VerbosityQuery Length/Verbosity
• Four types of query:– Short keywords: Only the title of the topic description.– Long keywords: Using only the description field.– Short verbose: Using the concept field, 28 keywords on
average.– Long verbose: Using the title, description and the narrative field
(more than 50 words on average).
• Generated for the TREC topics 1-150.• Both keywords queries behaved in the similar way and
the verbose query too.• The retrieval performance is much less sensitive to
smoothing in the case of the keyword queries than for the verbose queries.
• Four types of query:– Short keywords: Only the title of the topic description.– Long keywords: Using only the description field.– Short verbose: Using the concept field, 28 keywords on
average.– Long verbose: Using the title, description and the narrative field
(more than 50 words on average).
• Generated for the TREC topics 1-150.• Both keywords queries behaved in the similar way and
the verbose query too.• The retrieval performance is much less sensitive to
smoothing in the case of the keyword queries than for the verbose queries.
40
Combining MethodsCombining Methods•„A General Language Model for Information Retrieval“ (Fei Song / W. Bruce Croft)
41
A general LM for IRA general LM for IR
• They propose a extensible model based on:– Good-turing estimate.– Curve-fitting functions.– Model combinations.
• The idea is to use n-grams is taking into account the local context, the uni-gram models assume independence.
• They propose a extensible model based on:– Good-turing estimate.– Curve-fitting functions.– Model combinations.
• The idea is to use n-grams is taking into account the local context, the uni-gram models assume independence.
42
A general LM for IRA general LM for IR
• The new model:1. Smooth each document with the Good-turing
estimate.
2. Expand each document with the corpus.
3. Consider terms pairs and expand the unigram model to the bi-gram model.
• The new model:1. Smooth each document with the Good-turing
estimate.
2. Expand each document with the corpus.
3. Consider terms pairs and expand the unigram model to the bi-gram model.
43
Step 1: Good turing Idea-RevisingStep 1: Good turing Idea-Revising
tf
tf
N
NEtftf
)()1(* 1
Ntf = Number of terms with frequency tf in a doc.
E(Ntf)= Expected value of Ntf .
The probability of a term with freq. tf is given by:
Nd = Total number of terms occurred in d.dN
tf *
dtf
tfGT NNS
NStfdtP
)(
)1()1()|(
44
Step 2Step 2
• Expanding a document model with the corpus:
• Expanding a document model with the corpus:
)(*)1()|(*)|( tPwdtPwdtP corpusdsum
wcorpus
wdweighted tPdtPdtP 1)()|()|(
45
Step 3Step 3
• Modeling a query as a sequence of terms:
• Modeling a query as a sequence of terms:
QtQt
set dtPdtPdQP ))|(0.1(*)|()|(
m
iiseq dtPdQP
1
)|()|(
46
Step 4Step 4
• Combining uni-grams and bi-grams:• Combining uni-grams and bi-grams:
)|,()|()|,( 1211 dttPdtPdttP iiiii
121
47
ResultsResults
• Two collections:– The wall street journal (WSJ), 250 MB, 74.520
docs.– TREC 4, 2 GB, 567.529 docs.
• Phrases of word pairs can be useful in improving the retrieval performance.
• The strategy can be easily extended.
• Two collections:– The wall street journal (WSJ), 250 MB, 74.520
docs.– TREC 4, 2 GB, 567.529 docs.
• Phrases of word pairs can be useful in improving the retrieval performance.
• The strategy can be easily extended.
48
Personal Outlook / ConclusionsPersonal Outlook / Conclusions
49
Personal Outlook / ConclusionsPersonal Outlook / Conclusions
• Stop-List.
• Porter Steemer.
• N-grams can not capture large-span relationships in the language.
• The performance of the n-gram model has reached a plateau.
• P(d).
• Stop-List.
• Porter Steemer.
• N-grams can not capture large-span relationships in the language.
• The performance of the n-gram model has reached a plateau.
• P(d).
50
Principal Component AnalysisPrincipal Component Analysis
• A low dimensional representation of the data.• Relation between features.• PCA tries to find a low-rank approximation,
where the quality of the approximation depends on how close the data is to lying in a subspace of the given dimensionality.
• A low dimensional representation of the data.• Relation between features.• PCA tries to find a low-rank approximation,
where the quality of the approximation depends on how close the data is to lying in a subspace of the given dimensionality.
51
Latent Semantic AnalysisLatent Semantic Analysis
• Latent Semantic Analysis– Semantic Information is extracted by means
of the Singular Value Decomposition (SVD).
• Latent Semantic Analysis– Semantic Information is extracted by means
of the Singular Value Decomposition (SVD).
VUD
kUdd )(
LSI uses a reduction of the first k columns of U.
52
Latent Semantic AnalysisLatent Semantic Analysis
• Latent Semantic Analysis– The eigenvectors for a set of documents can
be viewed as concepts described by a linear combination of terms chosen in such a way that documents are described as accurately as possible using only k such concepts.
– Terms that co-occur frequently will tend to align in the same eigenvectors.
• Latent Semantic Analysis– The eigenvectors for a set of documents can
be viewed as concepts described by a linear combination of terms chosen in such a way that documents are described as accurately as possible using only k such concepts.
– Terms that co-occur frequently will tend to align in the same eigenvectors.
53
Latent Semantic AnalysisLatent Semantic Analysis
• SVD is expensive to compute.• Cristianini developed an approximation strategy,
based on the Gram-Schmidt decomposition.• Multilinguality:
– The semantic space proposed here provides an ideal representation for performing multilingual information retrieval.
• SVD is expensive to compute.• Cristianini developed an approximation strategy,
based on the Gram-Schmidt decomposition.• Multilinguality:
– The semantic space proposed here provides an ideal representation for performing multilingual information retrieval.
54
Personal Outlook / ConclusionsPersonal Outlook / Conclusions
• What happens if we use LSA to improve smoothing?– We can think:
• We can smooth terms assigning probability mass according to their semantic distance to the terms in the collection/query.
– Problem:• Scalability of the model: if a term is not in the set
W, from which the SVD decomposition was made, then we should do an approximation.
• What happens if we use LSA to improve smoothing?– We can think:
• We can smooth terms assigning probability mass according to their semantic distance to the terms in the collection/query.
– Problem:• Scalability of the model: if a term is not in the set
W, from which the SVD decomposition was made, then we should do an approximation.
55
Personal Outlook / ConclusionsPersonal Outlook / Conclusions
• What happens if we use LSA to improve smoothing?– Problem:
• If the documents belong diverse topics, the classification on the new space becomes too heterogeneous.
• If the documents belong diverse topics, the classification of the words in the new space is ambiguous.
• What happens if we use LSA to improve smoothing?– Problem:
• If the documents belong diverse topics, the classification on the new space becomes too heterogeneous.
• If the documents belong diverse topics, the classification of the words in the new space is ambiguous.
56
Personal Outlook / ConclusionsPersonal Outlook / Conclusions
• Conclusions:– Smoothing methods are simple and efficient.– They provide a elegant way to deal with the data
sparseness problem.– They can be choose according to the taste of the
consumer.– But, they do not model the linguistic phenomena
behind the scenes... At least for the moment.– Even though, the techniques does not requieres
language knowledge, the fact of the markov assumption, drives to some sort of language dependecy.
• Conclusions:– Smoothing methods are simple and efficient.– They provide a elegant way to deal with the data
sparseness problem.– They can be choose according to the taste of the
consumer.– But, they do not model the linguistic phenomena
behind the scenes... At least for the moment.– Even though, the techniques does not requieres
language knowledge, the fact of the markov assumption, drives to some sort of language dependecy.
57
Questions?Questions?
• English only?.
• Query Expansion?.
• How would help smoothing to the Question Answering task?
• Which method would help in a more appropriate way to a QA System? Why?
• English only?.
• Query Expansion?.
• How would help smoothing to the Question Answering task?
• Which method would help in a more appropriate way to a QA System? Why?