The Cluster Hypothesis: Ranking Document Clusters Oren Kurland Faculty of Industrial Engineering and Management Technion * Based on joint work with Fiana Raiber * This work has been supported by, and carried out, at the Technion-Microsoft Electronic Commerce Research Center
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The Cluster Hypothesis: Ranking Document Clusters
Oren KurlandFaculty of Industrial Engineering and Management
Technion
* Based on joint work with Fiana Raiber* This work has been supported by, and carried out, at the Technion-Microsoft Electronic Commerce Research Center
Query = “oren kurland dblp”
Search #1 Search #2
Is search a solved problem?
The ad hoc retrieval task
Rank documents in a corpus by their relevance to the information need expressed by a query
o Term weighting scheme: TF.IDFo TF: the number of occurrences of a term in the documento IDF: The inverse of the document frequency of the term
Web search engineso Use a variety of relevance signals
o The textual similarity between the page and the query (query-dependent)
o The textual similarity between the query and the anchor text of pages that point to the page (query-dependent)
o The PageRank score of the page (query-independent)o The clickthrough rate for the page (query-independent)o …
o Learning-to-rank (Liu ’09)
The document-query similarityo Relevance is determined based on whether the
document content satisfies the information need expressed by the query
o The document-query similarity is among the most important features for ranking pages in Web search engines (Liu ’09)
Back to classical information retrieval?
The cluster hypothesis
“Closely associated documents tend to be relevant to the same requests”
(Jardine&van Rijsbergen ’71,van Rijsbergen ’79)
Operational consequence:
Relevant documents should be more similar to each other than to non-relevant documents
Cluster-based results interface
Clustering the pages Automatically labeling the clustersFinding the highest quality clusters
Cluster-based document retrieval
Query
Initial list of documents
Set of clusters
Ranking of clusters
Ranking of documents
Document ranking method
Clustering method
Each cluster is replaced with its
documents
Cluster ranking method
The optimal cluster problem(Kurland&Domshlak ’08)
p@5
Doc-query similarity
Query expansion
Oracle experiment
The cluster ranking task
( ) ( ) ( ),( )
,p C Q
p C Qp Q
p C Q= =
o Estimate the probability that cluster C is relevant to query Q:
o Estimate p(C,Q) using Markov Random Fields
rank
Markov Random Fields
( )( )( ),
ll L Gl
p C QZ
ψ∈=
∏
G
( )L G −
( )l lψ −Z −
l −
o Define a graph : o Nodes ‒ random variables representing Q and C’s documentso Edges ‒ dependencies between the variableso set of cliques in Go clique o potential defined over lo normalization factor
o A common instantiation of potential functions:o feature function defined over lo weight associated with ( ) ( )( )expl l ll f lψ λ=( )lf l −
lλ − ( )lf ldef
ClustMRF
o A linear (in feature functions) cluster ranking function that depends on the graph G
o Next:o Determine the clique set L(G)o Associate feature functions with cliques
( ) ( )( )
, l ll L G
p C Q f lλ∈
= ∑rank
The lQD clique
Q
d3d2d1
lQD lQD lQD
o Contains the query Qand a single document din cluster C
o Consider query-similarity values of C’s documents independently
( )1
| |log ( , ) Cgeo qsim QDf l sim Q d− =
( , )sim
def
(cf., Liu&Croft ’08)
. . is an inter-text similarity measurenumber of documents in C| |C −
The lQC cliqueo Contains the query Q
and all C’s documents
o Induce information from relations between query-similarity values of C’s documents
Q
d3d2d1
lQC
( ) { }( )log ( , )QC d CA qsimf l A sim Q d− ∈=def
{ }A∈ min, max, stdv
The lC cliqueo Contains only C’s
documents
o Induce information based on query-independent properties of C’s documents (e.g., PageRank score, ratio of stopwords to non stopwords)
Q
d3d2d1
lC
( ) ( ){ }( )logdP C CAf l A P d− ∈
=
def
{ }A∈ min, max, geometric mean
P is a query-independent document (quality) measure
Empirical evaluation
Query
Initial list of 50
documents
Set of clusters
Ranking of clusters
Ranking of documents
Markov Random Field (MRF)
(Metzler&Croft ’05)
Nearest Neighbor clustering
(Griffiths et al. ’86)
Each cluster is replaced with its
documents
ClustMRF
11
12.5
14
15.5
17
ClueWebB
MAP
11
12
13
14
15
GOV2
MAP
Comparison with the initial ranking■ Init: Markov Random Field (Metzler&Croft ’05)■ ClustMRF (Our algorithm)♦ Statistically significant differences with ClustMRF
♦♦
Clu
stM
RF
Clu
stM
RF
Comparison with other cluster ranking methods
■ AMean: Arithmetic mean of query similarity values (Liu&Croft ’08)
■ GMean: Geometric mean of query-similarity values (Liu&Croft ’08 ,Seo&Croft ’10)
■ ClustRanker: Uses measures of document and cluster biases (Kurland ’08)
■ ClustMRF (Our algorithm)♦ Statistically significant differences with ClustMRF
11
14
17
ClueWebB
MAP
12
13
14
15
GOV2
MAP
♦♦ ♦ ♦ ♦
Clu
stM
RF
Clu
stM
RF
Comparison with automatic query expansion
■ RM3 (Abdul-Jaleel et al. ’04)■ ClustMRF (Our algorithm)♦ Statistically significant differences with ClustMRF
12
13
14
GOV2
MAP
13
15
17
ClueWebB
MAP
Clu
stM
RF
Clu
stM
RF
♦♦
Diversifying search results
Diversifying search resultso MMR (Carbonell&Goldstein ’98) and xQuAD (Santos et al. ’10)
iteratively re-rank the initial listo In each iteration a document is scored by: