A Markov Random Field Model for Term Dependencies Donald Metzl er W. Bruce Cro ft Present by Chia-Ha o Lee
Dec 30, 2015
A Markov Random Field Model for Term Dependencies
Donald Metzler W. Bruce Croft
Present by Chia-Hao Lee
2
outline
• Introduction• Model
– Overview– Variants– Potential Functions– Training
• Experimental Results• Conclusions
3
Introduction
• There is rich history of statistical models for information retrieval, including the binary independence model (BIM), language modeling, inference network model, and so on.
• It is well known that dependencies exist between terms in a collection of text.
• For example, with a SIGIR proceeding, occurrences of certain pairs of terms are correlated, such as information and retrieval.
4
Introduction
• Unfortunately, estimating statistical models for general term dependencies is infeasible, due to data sparsity.
• For this reason, most retrieval models assume some form of independence exists between terms.
• Most work on modeling term dependencies in the past has focused on phrases/proximity or term co-occurrences. Most of these models only consider dependencies between pairs of terms.
• Several recent studies have examined term dependence models for the language modeling framework.
5
Model
• Markov random fields (MRF), also called undirected graphical models, are commonly used in the statistical machine learning domain to succinctly model joint distributions.
• We use MRFs to model the joint distribution over queries Q and documents D, parameterized by Λ.
DQP ,
6
Model
• A markov random field is constructed from a graph G.• The nodes in the graph represent random variables, and
the edges define the independence semantics between the random variables.
• In this model, we assume G consists of query nodes and a document node D, such as the graphs in the figure.
GCc
cZ
DQP ;1
,
nqqQ ,,1
GC : the set of cliques in G
; : a non-negative potential function over clique configurations parameterized by Λ
DQ GCc
cZ,
; :normalizes the distribution
7
Model
• For ranking purposes we compute the posterior:
• As noted above, all potential functions must be non-negative, and are must commonly parameterized as:
GCc
rank
rank
c
QPDQP
QP
DQPQDP
;log
log,log
,
cfc c exp; cf : real-valued feature function over clique values
c : the weight given to that particular feature function
8
Model
• Substituting this back into ranking function, we end up with the following ranking function
• To utilize the model, the following steps must be taken for each query Q:– Construct a graph representing the query term dependencies to
model – Define a set of potential functions over the cliques of this graph– Rank documents in descending order of
1
GCc
c
rank
cfQDP
QDP
9
Model
• We now describe and analyze three variants of the MRF model, each with different underlying dependence assumptions.– Full independence (FI)– Sequential dependence (SD)– Full dependence (FD)
10
Model
• The full independence variant makes the assumption that query terms are independent given some document D.
• The likelihood of query term occurring is not affected by the occurrence of any other query term, or more succinctly,
.
• The sequential dependence variant assumes a dependence between neighboring query terms.
• Formally, this assumption states that only for nodes that are not adjacent to .
iq
iq
DqPqDqP iiji ,
DqPqDqP iji ,
iqjq
11
Model
• The full dependence variant, all query terms are in some way dependent on each other.
• Graphically, a query of length n translates into the complete graph , which includes edges from each query node to the document node D.
1nK
12
Model
• The potential functions φ play a very important role in how accurate our approximation of the true joint distribution is.
• For example : Consider a document D on the topic of information retrieval.
Using the sequential dependence variant, we would expect
, as the term
information and retrieval are much more “compatible” with the topicality
of document D than the terms information and assurance.
Dassurance,n,informatioDretrieval,n,informatio
13
Model
• Since documents are ranked by Equation 1, it is also important that the potential functions can be computed efficiently.
• Based on these criteria and previous research on phases and term dependence, we focus on three types of potential functions.
• These potential functions are attempt to abstract the idea of term co-occurrence.
14
Model
• Since potentials are over cliques in the graph, we now proceed to enumerate all of the possible ways graph cliques are formed in our model and how potential functions are defined for each.
• The simplest type of clique that can appear in our graph is a 2-clique consisting of an edge between a query term and the document D.
iq
15
Model
• In keeping with simple to compute measures, we define this potential as:
C
cf
D
tf
DqPc
ii qD
DqDT
iTT
,1log
log
DqP i : a smoothed language modeling estimate
Dwtf , : the number of the terms w occurs in document D
wcf : the number of times term w occurs in the entire collection
D : total number of terms in the document D
C : the length of the collection
16
Model
• Next, we consider cliques that contain two or more query terms.
• For example: In the query train station security measures, if any of the sub-phrases,
train station, train station security, station security measures, or
security measures appear in a document then there is strong
evidence in favor of relevance.
17
Model
• Therefore, for every clique that contains a contiguous set of two or more terms and the document node D,
we apply the following “ordered” potential function:
C
cf
D
tf
DqqPc
kiikii qqD
DqqDO
kiiOO
,,1#,,,1#1log
,,1#log
kii qqcf1# : the number of times term ω occurs in the entire collection
D : total number of terms in the document D
C : the length of the collection
kii qq ,,
Dqq kiitf ,1# : the number of the times the exact phrase occurs in document D kii qq ,,
18
Model
• Although the occurrence of contiguous sets of query terms provide strong evidence of relevance, it is also the case that the occurrence of non-contiguous sets of query terms can provide valuable evidence.
• In the previous example, documents containing the terms train and security within some short proximity of one another also provide additional evidence towards relevance.
19
Model
• For our purposes, we construct an “unordered” potential function over cliques that consist of sets of two or more query terms and the document node D. Such potential functions have the following from:
C
cf
D
tf
DqquwNPc
jiji qquwN
D
DqquwN
DU
jiUU
,,#,,,#1log
,,#log
DqquwN jitf ,#
: the number of the times the terms appear ordered or unordered with a window N terms.
ji qquwNcf # : the number of times term ω occurs in the entire collection
D : total number of terms in the document D
C : the length of the collection
ji qq ,,
20
Model
• Using these potential functions, we derive the following specific ranking function:
UOcUU
OcOO
TcTT
GCcc
rank
cfcfcf
cfQDP
21
Experimental Results
• We make use of the Associated Press and Wall Street Journal sub-collections of TREC, which are small homogeneous collections, and two web collections, WT10g and GOV2, which are considerably larger and less homogeneous.
22
Experimental Results
• Full independence
23
Experimental Results
• Sequential dependence
24
Experimental Results
• Full dependence
25
Conclusions
• In this paper, we develop as general term dependence model that can make use of arbitrary text feature.
• Three variants of the model are described, where each capture different dependencies between query terms.
26
Markov Random Fields
• Let be random variables taking values in some finite set S, and let be a finite graph such that , whose elements will sometime be called sites.
• For a set , let define its neighbor (or boundary) set: all elements in that have a neighbor in A. For
, let .
• The random variables are said to define a Markov random field if, for any vector :
nXX ,,1 ENG ,
NN ,,1
NA AAN \
Ni ii
NSx
ijxXxXiNjxXxX jjiijjii ,Pr\,Pr
27
Potentials
• A potential is a function indexed by subsets of N on the space . We will write potentials as for , .
• Given a full set of potentials, the energy of a configuration w will be defined as:
• Using the energy, we can define a probability measure, P, from a set of potentials by:
NS NA wVANSw
NA
A wVwU
Z
wUwP
exp
NSw
wUZ exp