Integrating Word Relationships into Language Models Guihong Cao , Jian-Yun Nie , Jing Bai Départment ďInformatique et de Recherc he Opérationnelle,Université de Montré al Presenter : Chia-Ha o Lee
Jan 12, 2016
Integrating Word Relationships into Language Models
Guihong Cao , Jian-Yun Nie , Jing BaiDépartment ďInformatique et de Recherche Opérati
onnelle,Université de Montréal
Presenter : Chia-Hao Lee
Outline• Introduction
• Previous Work
• A Dependency Model to Combine WordNet and Co-occurrence
• Parameter estimation– Estimating conditional probabilities – Estimating mixture weights
• Experiments
• Conclusion and feature work
Introduction• In recent years, language models for information retrieval
(IR) have increased in popularity.
• The basic idea behind is to compute the conditional probability .
• In most approaches, the computation is conceptually decomposed into two distinct steps:– (1) Estimating the document model– (2) Computing the query likelihood using the estimated document
model
DQP
• When estimating the document model, the words in the document are assumed to be independent with respect to one another, leading to the so called “bag-of-word” model.
• However, from our own knowledge of natural language, we know that the assumption of term independence is a matter of mathematical convenience rather than a reality.
• For example, the words “computer” and “program” are not independent. A query requesting for “computer” might be well satisfied by a document about “program”.
Introduction (cont.)
• Some studies have been carried out to relax the independence assumption.
• The first one is data-driven, which tries to capture dependency among terms by statistical information derived from the corpus directly.
• Another direction is to exploit hand-crafted thesauri, such as WordNet.
Introduction (cont.)
Previous Work• In classical language modeling approach to IR, a multinomial
model over terms is estimated for each document in the collection to be indexed and searched.
• In most cases, each query term is assumed to be independent of the others, the query likelihood is estimated by .
• After the specification of a document prior ,the posteriori probability of a document is given by:
dwP dC
n
ii dqPdqP
1
dP
1dPdqPqdP
• However the classical language model approach for IR does not address the problem of dependence between words.
• The term “dependence” may mean two different things:– Dependence between words within a query or within a document– Dependence between query words and document words
• The first meaning, one may try to recognize the relationships between words in sentence.
• Under the second meaning, dependence means any relationship that can be exploited during query evaluation.
Previous Work (cont.)
• The incorporate term relationships into the document language model, we propose a translation model .
• With the translation model, the document-to-query model becomes:
• Even though their model is general than other language models, it is different to determine the translation probability
in practice.
• To solve this problem, we generate an artificial collection of “synthetic” data for training by assuming that a sentence is parallel to the paragraph that contains the sentence.
Previous Work (cont.)
wqt i
21
n
i wi dwPwqtdqP
wqt i
A Dependency Model to Combine WordNet and Co-occurrence
• Given a query q and a document d, the query can be related directly, or they can be related indirectly through some word relationships.
• An example of the first case is that the document and the query contain the same words.
• In the second case, a document can contain a different word, but synonymous or related to the one in the query.
• In order to take both cases into our modeling, we assume that there are two sources to generate a term from a document: one from a dependency model and another from a non-dependency model.
:the parameter of dependency model
:the parameter of non-dependency model
A Dependency Model to Combine WordNet and Co-occurrence (cont.)
n
ii dqPdqP
1
n
iDiDi dqPdqP
1
,,
3,,1
n
iDDiDDi dPdqPdPdqP
DD
• The non-dependency model tries to capture the direct generation of the query by the document, we can model it by unigram document model:
• Then, we select a term in the document randomly first.• Second, a query term is generated based on the
observed term. Therefore we have:
A Dependency Model to Combine WordNet and Co-occurrence (cont.)
dUPdqPdPdqP iUDDi ,
4,,
dw
DiDi dwPwqPdqP
dqP iU :the probability of unigram model
• As for the translation model, we also have the problem of estimating the dependency between two term, i.e.
• To address the problem, we assume that some word relationships have been manually identified and stored in a linguistic resource, and some other relationships have to be found automatically according to co-occurrences.
A Dependency Model to Combine WordNet and Co-occurrence (cont.)
wqP i
• So, this combination can be achieved by a linear interpolation smoothing. Thus:
• In our study, we only consider co-occurrence information beside WordNet.
• So, is just the co-occurrence model.
A Dependency Model to Combine WordNet and Co-occurrence (cont.)
5,1, wLqPwLqPwqP iii
:the conditional probability of given according to WordNet.iq w wLqP i ,
:the probability that the link between and is achieved by other means.
iq w wLqP i ,
:the interpolation factor, which can be considered as a two-component model.
wLqP i ,
• For the simplicity of expression, we denote probability of link model as , i.e. , and the co-occurrence model as .
• Substitute Equations 4 and 5 into 3, we obtain Equation 6:
A Dependency Model to Combine WordNet and Co-occurrence (cont.)
wqP iL wLqPwqP iiL , wLqPwqP iiCO ,
n
iiUDDi dUPdqPdPdqPdqP
1
,
n
iiUD
dwDi dUPdqPdPdwPwqP
1
,
6,1,1
n
iiU
dwDiCOD
dwDiLD dUPdqPdwPwqPdPdwPwqPdP
dw
Dii dwPwLqPwLqP ,,1,
dw
Di dwPwqP ,
dw
Didw
Di dwPwqPdwPwqP ,1,
A Dependency Model to Combine WordNet and Co-occurrence (cont.)
+
+
+ + ++
• The idea can become more obvious if we make some simplification in the formula.
• So, we can get:
A Dependency Model to Combine WordNet and Co-occurrence (cont.)
7,
dw
DiLiL dwPwqPdqP
8,
dw
DiCOiCO dwPwqPdqP
911
n
iiUiCODiLD dUPdqPdqPdPdqPdPdqP
consisting of link model, co-occurrence model and unigram model
• Let , , denote the respect weights of link model, co-occurrence model, and unigram model.
• Then equation 9 can be rewritten as:
• For information retrieval, the most important terms are nouns. So, we concentrate on three relations related to nouns: synonym, hypernym, and hyponym.
A Dependency Model to Combine WordNet and Co-occurrence (cont.)
L CO U
101
n
iiUUiCOCOiLL dqPdqPdqPdqP
dP DL dP DCO 1 dUPU
111
54321
n
iiUiCOiHYPOiHYPEiSYN dqPdqPdqPdqPdqPdqP
NSLM
SLM
Parameter estimation• 1.Estimating conditional probabilities
– The unigram model ,we use the MLE estimation, smoothed by interpolated absolute discount, that is:
dwP iU
120,;maxCwP
d
d
d
dwcdwP iMLE
uiiabs
:the discount factor :the length of the document d:the count of unigram term in the document u
d:the maximum likelihood probability of the word in the collection
CwP iMLE
(related to D)
• For , it can be approximated by the maximum likelihood probability .
• This approximation is motivated by the fact that the word is primarily generated from in a way quite independent from the model .
• The estimation of - the probability of link between two words according to WordNet.
Parameter estimation (cont.) DdwP ,
Dd
w
dwPMLE
wwP iL
• Equation 13 defines our estimation of by interpolated Absolute discount:
Parameter estimation (cont.) wwP iL
LWwP
LWwwc
LWwc
LWwwc
LWwwcwwP ioneadd
wj
wj
iiL
jj
,,,
,*,
,,
0,,,max
131,,
1,,,
1 1
1
v
i
v
j ji
v
j ji
ioneaddLWwwc
LWwwcLWwP
and are assumed to have a relationship in WordNetwiw
LWwC ,*, :the number of unique terms which have a relationship with in WordNet and co-occur with it in .W
iw:the count of co-occurrences of with within the predefined window iw w LWwwC i ,,
?
• The estimation of the components of the co-occurrence model is similar to those of the link model expect that that when counting the co-occurrence frequency, the requirement of having a link in WordNet is removed.
Parameter estimation (cont.)
dwP ico dwP iL
WwP
Wwwc
Wwc
Wwwc
WwwcwwP ioneadd
wj
wj
iiCO
jj
,
*,
,
0,,max
141,
1,
1 1
1
v
i
v
j ji
v
j ji
ioneaddWwwc
WwwcWwP
• 2.Estimating mixture weights
We introduce an EM algorithm to estimate the mixture weights in NSLM.
Because NSLM is a three-component mixture model, the optimal weights should maximize the likelihood of the queries.
Let be the mixture weights, we then have:
Parameter estimation (cont.)
15logmaxarg11
*
m
jijCOCOijLLijUU
N
iiq dqPdqPdqP
q
UCOLq ,,
:the number of documents in the datasetNm :the length of query q
:the prior probability with which to choose the document to generate the query Nii 1
• However, some documents having high weights are not truly relevant to the query. They contain noise.
• To account for the noise, we further assume that there are two distinctive sources to generate the query.
• One is the relevant documents, another is a noisy source, which is approximated by the collection C
Parameter estimation (cont.)
:respectively unigram model, link model and co-occurrence model built from the collection
CqP jU
CqP jL
CqP jCO
:the weight of the noise
16
1
logmaxarg
1
11*
m
jjCOCOjLLjUU
m
jijCOCOijLLijUU
N
ii
q
CqPCqPCqP
dqPdqPdqP
q
smoothing
• With this setting, the hidden and can be estimated using the EM algorithm.
• The update formulas are as follows:
Parameter estimation (cont.) Nii 1 q
17
11
11
m
jijCO
rCOijL
rLijU
rU
N
i
ri
m
jijCO
rCOijL
rLijU
rU
ri
ri
dqPdqPdqP
dqPdqPdqP
CqPCqPCqPdqPdqPdqP
CqPdqP
mjCO
rCOjL
rLjU
rU
N
i ijCOrCOijL
rLijU
rU
ri
N
i jLrLijL
rL
rir
U
1
11
1
11
CqPCqPCqPdqPdqPdqP
CqPdqP
mjCO
rCOjL
rLjU
rU
N
i ijCOrCOijL
rLijU
rU
ri
N
i jCOrCOijCO
rCP
rir
U
1
11
1
11
CqPCqPCqPdqPdqPdqP
CqPdqP
mjCO
rCOjL
rLjU
rU
N
i ijCOrCOijL
rLijU
rU
ri
N
i jUrUijU
rU
rir
U
1
11
1
11
Experiments We evaluated our model described in the previous sections using three different TREC collections – WSJ,AP and SJM
Experiments (cont.)
Conclusion and feature work• In this paper, we integrate word relationships into the
language modeling framework.
• We used EM algorithm to train the parameters. This method worked well for our experiment.