Optimized Interleaving for Online Optimized Interleaving for Online Retrieval Evaluation Retrieval Evaluation (Best paper in (Best paper in WSDM’13) WSDM’13) Author: Author: Filip Filip Radlinski, Radlinski, Nick Nick Craswell Craswell Slides By: Slides By: Han Jiang Han Jiang
30
Embed
Optimized interleaving for online retrieval evaluation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Optimized Interleaving for Online Retrieval Optimized Interleaving for Online Retrieval Evaluation Evaluation
(Best paper in(Best paper in WSDM’13) WSDM’13)
Author: Author: Filip Radlinski,Filip Radlinski,
Nick CraswellNick CraswellSlides By:Slides By: Han Jiang Han Jiang
AgendaAgenda
Basic conceptsBasic concepts
Previous algorithmsPrevious algorithms
FrameworkFrameworkInvert ProblemInvert Problem
Refine ProblemRefine Problem
Theoretical benefitsTheoretical benefits
IllustrationIllustration
EvaluationEvaluation
DiscussionDiscussion
Basic conceptsBasic concepts
What is interleaving?What is interleaving?
Merge results from different retrieval algorithms.Merge results from different retrieval algorithms.
Only a combined list is shown to user.Only a combined list is shown to user.
The quality of algorithms can be infered with the help The quality of algorithms can be infered with the help of clickthrough data.of clickthrough data.
Interleaved list
Source List A
Search Engine A Query
Clicks
Search Engine B
Source List B
Interleaving Algorithm
Assignment
Credit function
Evaluation Result
Basic concepts +Basic concepts +
OK, then toss a coin instead, and OK, then toss a coin instead, and
Credit function = if Credit function = if ddii is clicked and higher in ranker A, is clicked and higher in ranker A, prefer A.prefer A.
Intuitively*, a good one should:Intuitively*, a good one should:
Be blind to user. Be Be blind to user. Be blind to retrieval functions. to retrieval functions.
Be robust to Be robust to biases in the user’s decision process in the user’s decision process (that do not relate to (that do not relate to retrieval quality)retrieval quality)
Not substantially alter the search experienceNot substantially alter the search experience
Lead to clicks that reflect the user’s preferenceLead to clicks that reflect the user’s preference
AgendaAgenda
Basic concepts √Basic concepts √
Previous algorithmsPrevious algorithms
FrameworkFrameworkInvert ProblemInvert Problem
Refine ProblemRefine Problem
Theoretical benefitsTheoretical benefits
IllustrationIllustration
EvaluationEvaluation
DiscussionDiscussion
Previous AlgorithmsPrevious Algorithms
Balanced InterleavingBalanced Interleavingtoss a coin once, pick up best items by turns.toss a coin once, pick up best items by turns.
Team Draft InterleavingTeam Draft Interleaving toss a coin every two timestoss a coin every two times, , pick up best item from winner pick up best item from winner
firstfirst
Probabilistic InterleavingProbabilistic Interleaving toss a coin every time, toss a coin every time, sample item from winneritem from winner
A weight function ensures that doc in higher rank has higher probability to be picked up
Previous Algorithms +Previous Algorithms +About credit functions, only documents that are About credit functions, only documents that are clicked by by
A: dA: d11 d d22 d d33 d d4 4 A: dA: d11 d d22 d d33 d d44
B: dB: d44 d d11 d d22 d d33 B: d B: d44 d d11 d d22 d d33
M: M: dd11 d d44 d d22 dd33
clicks on: dclicks on: d11 d d33
A: A: dd11 d d22 d d33 d d4 4 A: A: dd11 d d22 dd33 d d44
B: dB: d44 d d11 d d22 dd33 B: d B: d44 d d11 d d22 dd33
A wins with p=100%
AgendaAgenda
Basic concepts √Basic concepts √
Previous algorithms √Previous algorithms √
FrameworkFrameworkInvert ProblemInvert Problem
Refine ProblemRefine Problem
Theoretical benefitsTheoretical benefits
IllustrationIllustration
EvaluationEvaluation
DiscussionDiscussion
Invert the problemInvert the problem
Why previous algorithms are not good enough:Why previous algorithms are not good enough:Balanced interleaving & Team Draft interleaving: Balanced interleaving & Team Draft interleaving: biasedbiased
Probabilistic interleaving: degrading the user Probabilistic interleaving: degrading the user experienceexperience
Even a random click on the document raises up a winner.
blah… A=(d1, d2), B=(d1,d2), but M = (d2, d1)
Therefore, the problem of interleaving should be more Therefore, the problem of interleaving should be more constrained constrained
A good way is to start from the principles…A good way is to start from the principles…
Again, what is a Again, what is a good interleaving algorithm?interleaving algorithm?
Be blind to user. Be Be blind to user. Be blind to retrieval functions. to retrieval functions.
Be robust to Be robust to biases in the user’s decision process in the user’s decision process (that do (that do not relate to retrieval quality)not relate to retrieval quality)
Not substantially alter the search experienceNot substantially alter the search experience
Lead to clicks that reflect the user’s preferenceLead to clicks that reflect the user’s preference
Refine the problemRefine the problem
Not substantially alter the search experience Not substantially alter the search experience (show one of the (show one of the rankings, or a ranking “in between” the two)rankings, or a ranking “in between” the two)
Lead to clicks that reflect the user’s preference:Lead to clicks that reflect the user’s preference:If document If document dd is clicked, the input ranker that ranked is clicked, the input ranker that ranked dd higher is given is given more creditmore credit
A randomly clicking user doesn’t create a preference for either rankerA randomly clicking user doesn’t create a preference for either ranker
Be sensitive to input data Be sensitive to input data (fewest user queries show significant (fewest user queries show significant preference)preference)
Again, what is a Again, what is a good interleaving algorithm?interleaving algorithm?
Be blind to user. Be Be blind to user. Be blind to retrieval functions. to retrieval functions.
Be robust to Be robust to biases in the user’s decision process in the user’s decision process (that do (that do not relate to retrieval quality)not relate to retrieval quality)
Refine the problem +Refine the problem +
Not substantially alter the search experience Not substantially alter the search experience (show one of (show one of the rankings, or a ranking “in between” the two)the rankings, or a ranking “in between” the two)
Lead to clicks that reflect the user’s preference:Lead to clicks that reflect the user’s preference:If document If document dd is clicked, the input ranker that ranked is clicked, the input ranker that ranked dd higher is given is given more creditmore credit
A randomly clicking user doesn’t create a preference for either rankerA randomly clicking user doesn’t create a preference for either ranker
Be sensitive to input data Be sensitive to input data (fewest user queries show significant (fewest user queries show significant preference)preference)
Refine the problem ++Refine the problem ++
Not substantially alter the search experience Not substantially alter the search experience (show one (show one of the rankings, or a ranking “in between” the two)of the rankings, or a ranking “in between” the two)
Lead to clicks that reflect the user’s preference:Lead to clicks that reflect the user’s preference:If document If document dd is clicked, the input ranker that ranked is clicked, the input ranker that ranked dd higher is given is given more creditmore credit
A=(d1, d2), B=(d1,d2), M = (d1, d2)
A randomly clicking user doesn’t create a preference for either A randomly clicking user doesn’t create a preference for either rankerranker
num of clicks
score function, when >0, assign score to A, otherwise to B
length of list
a possible interleaved list under previous constraints
Be sensitive to input data Be sensitive to input data (fewest user queries show significant (fewest user queries show significant preference)preference)
Refine the problem +++Refine the problem +++
Refine the problem ++++Refine the problem ++++
So the constraint So the constraint is:is:
And target is: And target is:
With variable: the definition of With variable: the definition of
Since it is a optimization problem, the existence of solution should be guaranteed theoretically. While in the paper it is only guaranteed empirically.
Theoretical BenefitsTheoretical Benefits
PROPERTY 1: Balanced interleaving Balanced interleaving ⊆ T⊆ This framework his framework
PROPERTY 2: Team Draft interleaving Team Draft interleaving ⊆ T⊆ This his framework framework PROPERTY 3: This framework This framework ⊆ ⊆ Probabilistic Probabilistic interleavinginterleaving
PROPERTY 4: The merged list is something “in between” The merged list is something “in between” the twothe two
Theoretical Benefits +Theoretical Benefits +
PROPERTY 5: Breaking case in Balanced interleaving Breaking case in Balanced interleaving is is omittedomittedPROPERTY 6: Insensitivity in Team Draft interleaving Insensitivity in Team Draft interleaving is is improvedimprovedPROPERTY 7: Probabilistic interleaving will degrade more user Probabilistic interleaving will degrade more user experience experience
IllustrationIllustration
L1 unbiased towards random user: 3*25% + (-1)*(35% + 40%) = 0
Note: the number of constraint is 5, but unknown factor is 6?
(it is a maximization problem, and the goal is to maximize sigma{pi * sensitivity(L_i)}
An option to pursue is sensitivity
AgendaAgenda
Basic concepts √Basic concepts √
Previous algorithms √Previous algorithms √
Framework √Framework √Invert Problem √Invert Problem √
Refine Problem √Refine Problem √
Theoretical benefits √Theoretical benefits √
Illustration Illustration √√
EvaluationEvaluation
DiscussionDiscussion
Evaluation: summaryEvaluation: summary
Construct a dataset to simulate interleaving and user Construct a dataset to simulate interleaving and user interactinteractEvaluate Pearson correlation between each two Evaluate Pearson correlation between each two algorithms.algorithms.Analyze cases that algorithms disagreeAnalyze cases that algorithms disagree
Evaluate result quality by expertsEvaluate result quality by experts
Analyze bias and sensitivity among algorithmsAnalyze bias and sensitivity among algorithms
Evaluation +: construction of Evaluation +: construction of datasetdataset
Collect all query as well as Collect all query as well as top-4 results from a search results from a search engineengineSince the web and algorithm is changing, there are many Since the web and algorithm is changing, there are many distinct rankings for the same query. distinct rankings for the same query.
For each query, make sure that there’re at least For each query, make sure that there’re at least 4 distinct rankings, each shown to user at least distinct rankings, each shown to user at least 10 times, times, with at least with at least 1 click. click.
The most frequent ranking sequence is regarded as A, a The most frequent ranking sequence is regarded as A, a most dissimilar one is regarded as B. one is regarded as B.
Further filter the log, so that results produced by either Further filter the log, so that results produced by either Balanced interleaving and Team Draft interleaving are Balanced interleaving and Team Draft interleaving are frequent.frequent.
Evaluation ++Evaluation ++
Evaluation +++Evaluation +++
Evaluation ++++Evaluation ++++
Bias comparison among different Bias comparison among different algorithmsalgorithms
Evaluation +++++Evaluation +++++
Sensitivity comparison among different Sensitivity comparison among different algorithmsalgorithms
AgendaAgenda
Basic concepts √Basic concepts √
Previous algorithms √Previous algorithms √
Framework √Framework √Invert Problem √Invert Problem √
Refine Problem √Refine Problem √
Theoretical benefits √Theoretical benefits √
Illustration Illustration √√
Evaluation √Evaluation √
DiscussionDiscussion
DiscussionDiscussion
Contribution in this paper:Contribution in this paper:
Invert the question of obtaining interleaving Invert the question of obtaining interleaving algorithms as a constrained optimization algorithms as a constrained optimization problemproblemThe solution is very intuitive, and The solution is very intuitive, and generalgeneral
Many interesting examples to illustrate the breaking Many interesting examples to illustrate the breaking cases for previous approachescases for previous approaches
The evaluation is simulated on logs from only one search The evaluation is simulated on logs from only one search engine.engine.For interleaving, we’re expecting an evaluation based on different search engines?
And that is why human evaluation result is not good among all algorithms.
Note:Note:
Discussion +Discussion +
““A and B are not shown to users as they have low A and B are not shown to users as they have low sensitivity”sensitivity”This is intuitive, however it violates the result shown in Table 1: (a,b,c,d) has sensitivity 0.83,