Optimized interleaving for online retrieval evaluation

Optimized Interleaving for Online Retrieval Optimized Interleaving for Online Retrieval Evaluation Evaluation

(Best paper in(Best paper in WSDM’13) WSDM’13)

Author: Author: Filip Radlinski,Filip Radlinski,

Nick CraswellNick CraswellSlides By:Slides By: Han Jiang Han Jiang

AgendaAgenda

Basic conceptsBasic concepts

Previous algorithmsPrevious algorithms

FrameworkFrameworkInvert ProblemInvert Problem

Refine ProblemRefine Problem

Theoretical benefitsTheoretical benefits

IllustrationIllustration

EvaluationEvaluation

DiscussionDiscussion

Basic conceptsBasic concepts

What is interleaving?What is interleaving?

Merge results from different retrieval algorithms.Merge results from different retrieval algorithms.

Only a combined list is shown to user.Only a combined list is shown to user.

The quality of algorithms can be infered with the help The quality of algorithms can be infered with the help of clickthrough data.of clickthrough data.

Interleaved list

Source List A

Search Engine A Query

Clicks

Search Engine B

Source List B

Interleaving Algorithm

Assignment

Credit function

Evaluation Result

Basic concepts +Basic concepts +

OK, then toss a coin instead, and OK, then toss a coin instead, and

Credit function = if Credit function = if ddii is clicked and higher in ranker A, is clicked and higher in ranker A, prefer A.prefer A.

Ah, that’s easy…how about:Ah, that’s easy…how about:

Interleaving method = pickup best results from each Interleaving method = pickup best results from each algorithms?algorithms?

Wait… how do we know whether d1 is better than d4?

Urgh… When a user randomly click on (d1,d2,d3), A is always preferred…

Basic concepts ++Basic concepts ++

So, what is a So, what is a good interleaving interleaving algorithm?algorithm?

[*] Joachims , Joachims , Optimizing Search Engines Using Clickthrough Data, KDD’02

Intuitively*, a good one should:Intuitively*, a good one should:

Be blind to user. Be Be blind to user. Be blind to retrieval functions. to retrieval functions.

Be robust to Be robust to biases in the user’s decision process in the user’s decision process (that do not relate to (that do not relate to retrieval quality)retrieval quality)

Not substantially alter the search experienceNot substantially alter the search experience

Lead to clicks that reflect the user’s preferenceLead to clicks that reflect the user’s preference

AgendaAgenda

Basic concepts √Basic concepts √

Previous algorithmsPrevious algorithms







Previous AlgorithmsPrevious Algorithms

Balanced InterleavingBalanced Interleavingtoss a coin once, pick up best items by turns.toss a coin once, pick up best items by turns.

Team Draft InterleavingTeam Draft Interleaving toss a coin every two timestoss a coin every two times, , pick up best item from winner pick up best item from winner

firstfirst

Probabilistic InterleavingProbabilistic Interleaving toss a coin every time, toss a coin every time, sample item from winneritem from winner

A weight function ensures that doc in higher rank has higher probability to be picked up

Previous Algorithms +Previous Algorithms +About credit functions, only documents that are About credit functions, only documents that are clicked by by

users are consideredusers are considered

Balanced Interleaving Balanced Interleaving (coin=A)(coin=A)

A: dA: d11 d d22 d d33 d d44

B: dB: d44 d d11 d d22 d d33

M: M: dd11 d d44 d d22 dd33

clicks on: dclicks on: d11 d d33

A: A: dd11 d d22 dd33 d d44

B: dB: d44 dd11 d d22 dd33

A: A: dd11 d d22 dd33

B: dB: d44 dd11 d d22

A wins

Team Draft Interleaving Team Draft Interleaving (coin=AA)(coin=AA)

A: dA: d11 d d22 d d33 d d44

B: dB: d44 d d11 d d22 d d33

M: M: dd11 d d44 d d22 dd33


A: A: dd11 d d22 d d33 d d44

B: dB: d44 d d11 d d22 dd33

tie

Probabilistic Interleaving (possible Probabilistic Interleaving (possible coin=AA, AB)coin=AA, AB)

A: dA: d11 d d22 d d33 d d4 4 A: dA: d11 d d22 d d33 d d44

B: dB: d44 d d11 d d22 d d33 B: d B: d44 d d11 d d22 d d33

M: M: dd11 d d44 d d22 dd33


A: A: dd11 d d22 d d33 d d4 4 A: A: dd11 d d22 dd33 d d44

B: dB: d44 d d11 d d22 dd33 B: d B: d44 d d11 d d22 dd33

A wins with p=100%

AgendaAgenda


Previous algorithms √Previous algorithms √







Invert the problemInvert the problem

Why previous algorithms are not good enough:Why previous algorithms are not good enough:Balanced interleaving & Team Draft interleaving: Balanced interleaving & Team Draft interleaving: biasedbiased

Probabilistic interleaving: degrading the user Probabilistic interleaving: degrading the user experienceexperience

Even a random click on the document raises up a winner.

blah… A=(d1, d2), B=(d1,d2), but M = (d2, d1)

Therefore, the problem of interleaving should be more Therefore, the problem of interleaving should be more constrained constrained

A good way is to start from the principles…A good way is to start from the principles…

Again, what is a Again, what is a good interleaving algorithm?interleaving algorithm?


Be robust to Be robust to biases in the user’s decision process in the user’s decision process (that do (that do not relate to retrieval quality)not relate to retrieval quality)

Not substantially alter the search experienceNot substantially alter the search experience

Lead to clicks that reflect the user’s preferenceLead to clicks that reflect the user’s preference

Refine the problemRefine the problem

Not substantially alter the search experience Not substantially alter the search experience (show one of the (show one of the rankings, or a ranking “in between” the two)rankings, or a ranking “in between” the two)

Lead to clicks that reflect the user’s preference:Lead to clicks that reflect the user’s preference:If document If document dd is clicked, the input ranker that ranked is clicked, the input ranker that ranked dd higher is given is given more creditmore credit

A randomly clicking user doesn’t create a preference for either rankerA randomly clicking user doesn’t create a preference for either ranker

Be sensitive to input data Be sensitive to input data (fewest user queries show significant (fewest user queries show significant preference)preference)

Again, what is a Again, what is a good interleaving algorithm?interleaving algorithm?


Be robust to Be robust to biases in the user’s decision process in the user’s decision process (that do (that do not relate to retrieval quality)not relate to retrieval quality)

Refine the problem +Refine the problem +

Not substantially alter the search experience Not substantially alter the search experience (show one of (show one of the rankings, or a ranking “in between” the two)the rankings, or a ranking “in between” the two)


A randomly clicking user doesn’t create a preference for either rankerA randomly clicking user doesn’t create a preference for either ranker


Refine the problem ++Refine the problem ++

Not substantially alter the search experience Not substantially alter the search experience (show one (show one of the rankings, or a ranking “in between” the two)of the rankings, or a ranking “in between” the two)


A=(d1, d2), B=(d1,d2), M = (d1, d2)

A randomly clicking user doesn’t create a preference for either A randomly clicking user doesn’t create a preference for either rankerranker

num of clicks

score function, when >0, assign score to A, otherwise to B

length of list

a possible interleaved list under previous constraints


Refine the problem +++Refine the problem +++

Refine the problem ++++Refine the problem ++++

So the constraint So the constraint is:is:

And target is: And target is:

With variable: the definition of With variable: the definition of

Define predict Define predict function: function: δδ

Linear Rank difference:Linear Rank difference:

Inverse Rank:Inverse Rank:

Since it is a optimization problem, the existence of solution should be guaranteed theoretically. While in the paper it is only guaranteed empirically.

Theoretical BenefitsTheoretical Benefits

PROPERTY 1: Balanced interleaving Balanced interleaving ⊆ T⊆ This framework his framework

PROPERTY 2: Team Draft interleaving Team Draft interleaving ⊆ T⊆ This his framework framework PROPERTY 3: This framework This framework ⊆ ⊆ Probabilistic Probabilistic interleavinginterleaving

PROPERTY 4: The merged list is something “in between” The merged list is something “in between” the twothe two

Theoretical Benefits +Theoretical Benefits +

PROPERTY 5: Breaking case in Balanced interleaving Breaking case in Balanced interleaving is is omittedomittedPROPERTY 6: Insensitivity in Team Draft interleaving Insensitivity in Team Draft interleaving is is improvedimprovedPROPERTY 7: Probabilistic interleaving will degrade more user Probabilistic interleaving will degrade more user experience experience


L1 unbiased towards random user: 3*25% + (-1)*(35% + 40%) = 0

Note: the number of constraint is 5, but unknown factor is 6?

(it is a maximization problem, and the goal is to maximize sigma{pi * sensitivity(L_i)}

An option to pursue is sensitivity

AgendaAgenda



Framework √Framework √Invert Problem √Invert Problem √

Refine Problem √Refine Problem √

Theoretical benefits √Theoretical benefits √

Illustration Illustration √√



Evaluation: summaryEvaluation: summary

Construct a dataset to simulate interleaving and user Construct a dataset to simulate interleaving and user interactinteractEvaluate Pearson correlation between each two Evaluate Pearson correlation between each two algorithms.algorithms.Analyze cases that algorithms disagreeAnalyze cases that algorithms disagree

Evaluate result quality by expertsEvaluate result quality by experts

Analyze bias and sensitivity among algorithmsAnalyze bias and sensitivity among algorithms

Evaluation +: construction of Evaluation +: construction of datasetdataset

Collect all query as well as Collect all query as well as top-4 results from a search results from a search engineengineSince the web and algorithm is changing, there are many Since the web and algorithm is changing, there are many distinct rankings for the same query. distinct rankings for the same query.

For each query, make sure that there’re at least For each query, make sure that there’re at least 4 distinct rankings, each shown to user at least distinct rankings, each shown to user at least 10 times, times, with at least with at least 1 click. click.

The most frequent ranking sequence is regarded as A, a The most frequent ranking sequence is regarded as A, a most dissimilar one is regarded as B. one is regarded as B.

Further filter the log, so that results produced by either Further filter the log, so that results produced by either Balanced interleaving and Team Draft interleaving are Balanced interleaving and Team Draft interleaving are frequent.frequent.

Evaluation ++Evaluation ++

Evaluation +++Evaluation +++

Evaluation ++++Evaluation ++++

Bias comparison among different Bias comparison among different algorithmsalgorithms

Evaluation +++++Evaluation +++++

Sensitivity comparison among different Sensitivity comparison among different algorithmsalgorithms

AgendaAgenda



Framework √Framework √Invert Problem √Invert Problem √

Refine Problem √Refine Problem √

Theoretical benefits √Theoretical benefits √

Illustration Illustration √√

Evaluation √Evaluation √



Contribution in this paper:Contribution in this paper:

Invert the question of obtaining interleaving Invert the question of obtaining interleaving algorithms as a constrained optimization algorithms as a constrained optimization problemproblemThe solution is very intuitive, and The solution is very intuitive, and generalgeneral

Many interesting examples to illustrate the breaking Many interesting examples to illustrate the breaking cases for previous approachescases for previous approaches

The evaluation is simulated on logs from only one search The evaluation is simulated on logs from only one search engine.engine.For interleaving, we’re expecting an evaluation based on different search engines?

And that is why human evaluation result is not good among all algorithms.

Note:Note:

Discussion +Discussion +

““A and B are not shown to users as they have low A and B are not shown to users as they have low sensitivity”sensitivity”This is intuitive, however it violates the result shown in Table 1: (a,b,c,d) has sensitivity 0.83,

which is high?

Thank You !Thank You !

Optimized interleaving for online retrieval evaluation

Technology

d1 d4 d2 d3

d1 d2 d3 d4

d4 d1 d2 d3 b

problem of interleaving

d1 d d3 d4

d4 d1 d2 d3tiem

d1 d22 d3 d4 b

d1 d3probabilistic