Automatic Suggestion of Query- Automatic Suggestion of Query- Rewrite Rules for Enterprise Rewrite Rules for Enterprise Search Search Benny Kimelfeld IBM Research – Almaden Zhuowei Bao University of Pennsylvania Yunyao Li IBM Research – Almaden Portland, Oregon, USA Portland, Oregon, USA SIGIR 2012 SIGIR 2012
25
Embed
Automatic suggestion of query-rewrite rules for enterprise search
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Automatic Suggestion of Query-Automatic Suggestion of Query-Rewrite Rules for Enterprise SearchRewrite Rules for Enterprise Search
strategy & change internal practicewelcome to strategywelcome strategystrategy change & internalto strategy change internal practiceinternal management index pagesInternal practice scip
⁞
We often get ≈ 100 candidates, sometimes ≈ 1000
13
AlgorithmAlgorithmInput: Query q, desired match (doc/URL) d
X
Candidates for s → tCandidates for s → t
Classifiernatural/unnatural rules
Classifiernatural/unnatural rules
Output: Suggested rewrite rules s → t
Next:
Effectiveness filterEffectiveness filter
Candidates for sn-grams of q
Candidates for sn-grams of q
Candidates for tn-grams of high-quality
fields of d
Candidates for tn-grams of high-quality
fields of d
14
Classification FeaturesClassification Features
• Syntactic features– Whether s (resp., t) begins with a stop word– Whether s (resp., t) ends with a stop word– Number of tokens in s (resp., t)
• Corpus statistics– Logarithm of the frequency of s (resp., t)– Logarithm of the concurrence frequency of s and t– Logarithm of the frequency of s (resp., t) in titles
• Query-log statistics– Logarithm of the s-to-t reformulation frequency
Rule: s → t
15
Classification ModelsClassification Models• We take an approach similar to Kraft & Zien [2004]
that explored a problem of a similar flavor
• SVM: a linear classifier
• rDTLC: Decision Tree with Linear-Combination splits [Loh & Shih,1988]– Bound the tree depth (3 in our implementation)– Use univariate splits on non-leaf nodes
• A rule can negatively affect performance on desired matches
• A rule can interfere with other rules
• Idea: Optimize rule selection
excel spreadsheet excel spreadsheetexcel symphony
20
Formal Optimization ProblemFormal Optimization Problem
q1q1
q2q2
q3q3
...
qnqn
p1p1
p2p2
p3p3
...
pnpn
. . .
s1 s2s3
s5s4 s6
Que
ries
Que
ries
Rew
ritten queries
Rew
ritten queries
DocumentsDocuments
ScoresScores
r1
r2
r3,r9r4
r4
r6,r8
Rewrite RulesRewrite Rules
(qi):desired doc. matches for qi
topk(qi):k docs. reachable w/ highest score
( (qi) , topk(qi) ) Quality measure per qii=
1
nGoal: Find a subset of the rewrite rules that
maximizes
21
Hardness & HeuristicsHardness & Heuristics
We propose 2 simple heuristic algorithms:
Theorem:
• Finding an optimal set of rules is NP-hard• So is finding any constant-factor approx.• Holds already for k=1• Holds for every quality measure (e.g., DCG, precision@k,
etc.), assuming a very basic well-behavior property• Reduction from maximal independent set