Online Search Evaluation with Interleaving Filip Radlinski Microsoft
Dec 23, 2015
Online Search Evaluation with Interleaving
Filip RadlinskiMicrosoft
Acknowledgments• This talk involves joint work with– Olivier Chapelle – Nick Craswell– Katja Hofmann – Thorsten Joachims– Madhu Kurup– Anne Schuth – Yisong Yue
MotivationBaseline Ranking Algorithm Proposed Ranking Algorithm
Which is better?
Retrieval evaluationTwo types of retrieval evaluation:
• Offline evaluationAsk experts or users to explicitly evaluate your retrieval system. This dominates evaluation research today.
• Online evaluationSee how normal users interact with your retrieval system when just using it.
Most well known type: A/B tests
A/B testing• Each user is assigned to one of two
conditions• They might see the left or the right
ranking
• Measure user interaction with theirs (e.g. clicks)
• Look for differences between the populations
Ranking A Ranking B
Online evaluation with interleaving• A within-user online ranker comparison
– Presents results from both rankings to every user
• The ranking that gets more of the clicks wins– Designed to be unbiased, and much more sensitive
than A/B
Ranking A Ranking BShown Users (randomized)
Ranking A1. Napa Valley – The authority for lodging...
www.napavalley.com2. Napa Valley Wineries - Plan your wine...
www.napavalley.com/wineries3. Napa Valley College
www.napavalley.edu/homex.asp4. Been There | Tips | Napa Valley
www.ivebeenthere.co.uk/tips/166815. Napa Valley Wineries and Wine
www.napavintners.com6. Napa Country, California – Wikipedia
en.wikipedia.org/wiki/Napa_Valley
Ranking B1. Napa Country, California – Wikipedia
en.wikipedia.org/wiki/Napa_Valley2. Napa Valley – The authority for lodging...
www.napavalley.com3. Napa: The Story of an American Eden...
books.google.co.uk/books?isbn=...4. Napa Valley Hotels – Bed and Breakfast...
www.napalinks.com5. NapaValley.org
www.napavalley.org6. The Napa Valley Marathon
www.napavalleymarathon.org
Presented Ranking1. Napa Valley – The authority for lodging...
www.napavalley.com2. Napa Country, California – Wikipedia
en.wikipedia.org/wiki/Napa_Valley3. Napa: The Story of an American Eden...
books.google.co.uk/books?isbn=...4. Napa Valley Wineries – Plan your wine...
www.napavalley.com/wineries5. Napa Valley Hotels – Bed and Breakfast...
www.napalinks.com 6. Napa Valley College
www.napavalley.edu/homex.asp7 NapaValley.org
www.napavalley.org
AB
[Radlinski et al. 2008]
Team draft interleaving
Team draft interleavingRanking A
1. Napa Valley – The authority for lodging...www.napavalley.com
2. Napa Valley Wineries - Plan your wine...www.napavalley.com/wineries
3. Napa Valley Collegewww.napavalley.edu/homex.asp
4. Been There | Tips | Napa Valleywww.ivebeenthere.co.uk/tips/16681
5. Napa Valley Wineries and Winewww.napavintners.com
6. Napa Country, California – Wikipediaen.wikipedia.org/wiki/Napa_Valley
Ranking B1. Napa Country, California – Wikipedia
en.wikipedia.org/wiki/Napa_Valley2. Napa Valley – The authority for lodging...
www.napavalley.com3. Napa: The Story of an American Eden...
books.google.co.uk/books?isbn=...4. Napa Valley Hotels – Bed and Breakfast...
www.napalinks.com5. NapaValley.org
www.napavalley.org6. The Napa Valley Marathon
www.napavalleymarathon.org
Presented Ranking1. Napa Valley – The authority for lodging...
www.napavalley.com2. Napa Country, California – Wikipedia
en.wikipedia.org/wiki/Napa_Valley3. Napa: The Story of an American Eden...
books.google.co.uk/books?isbn=...4. Napa Valley Wineries – Plan your wine...
www.napavalley.com/wineries5. Napa Valley Hotels – Bed and Breakfast...
www.napalinks.com 6. Napa Balley College
www.napavalley.edu/homex.asp7 NapaValley.org
www.napavalley.org
Tie!
Click
Click
[Radlinski et al. 2008]
Why might mixing rankings help?
• Suppose results are worth money. For some query:– Ranker A: , ,
User clicks
– Ranker B: , , User also clicks
• Users of A may not know what they’re missing– Difference in behaviour is small
• But if we can mix up results from A & B Strong preference for B
Comparison with A/B metrics
p-va
lue
Query set size
• Experiments with real Yahoo! rankers (very small differences in relevance)
Yahoo! Pair 1 Yahoo! Pair 2
Dis
agre
emen
t Pro
babi
lity
[Chapelle et al. 2012]
The interleaving click model• Click == Good• Interleaving corrects for position bias• Yet there other sources of bias, such as
bolding
vs
[Yue et al. 2010a]
The interleaving click model
• Bars should be equal if there was no effect of bolding
[Yue et al. 2010a]
Rank of Results
Clic
k fr
eque
ncy
on
bott
om re
sult
Sometimes clicks aren’t even good
• Satisfaction of a click can be estimated– Time spent on URLs is informative– More sophisticated models also consider the
query and document (some documents require more effort)
• Time before clicking is another efficiency metric
[Kim et al. WSDM 2014]
Click
Click
No…
Newer A/B metrics• Newer A/B metrics can incorporate these
signals– Time before clicking– Time spent on result documents– Estimated user satisfaction– Bias in click signal, e.g. position– Anything else the domain expert cares about
• Suppose I’ve picked an A/B metric and assume it to be my target– I just want to measure it more quickly– Can I use interleaving?
An A/B metric as a gold standard
• Does interleaving agree with these AB metrics? AB Metric Team Draft Agreement
Is Page Clicked? 63 %
Clicked @ 1? 71 %
Satisfied Clicked? 71 %
Satisfied Clicked @ 1? 76 %
Time – to – click 53 %
Time – to – click @ 1 45 %
Time – to – satisfied – click 47 %
Time – to – satisfied – click @ 1 42 %
[Schuth et al. SIGIR 2015]
An A/B metric as a gold standard
• Suppose we parameterize the clicks;– Optimize to maximize agreement with our AB
metric
• In particular:– Only include clicks where the predicted
probability of satisfaction is above threshold t:
– Score clicks based on the time to satisfied click:
– Learn a linear weighted combination of these
[Schuth et al. SIGIR 2015]
An A/B metric as a gold standard
AB MetricTeam Draft Agreement (1/80th size)
Learned (to each metric)
AB Self-Agreement on Subset
(1/80th size)Is Page Clicked? 63 % 84 % + 63 %
Clicked @ 1? 71 % * 75 % + 62 %
Satisfied Clicked? 71 % * 85 % + 61 %
Satisfied Clicked @ 1? 76 % * 82 % + 60 %
Time – to – click 53 % 68 % + 58 %
Time – to – click @ 1 45 % 56 % + 59 %
Time – to – satisfied – click 47 % 63 % + 59 %
Time – to – satisfied – click @ 1 42 % 50 % + 60 %
The right parameters
AB Metric Team Draft Agreement
Learned Combined
Learned (P(Sat) only)
Learned(Time to click * P(Sat))
Satisfied Clicked? 71 % 85 % + 84 % + 48 % –
P(Sat) >
0.5
P(Sat) >
0.76
• The optimal filtering parameter need not match the metric definition
• But having the right feature is essential
P(Sat) >
0.26
Does this cost sensitivity?S
tati
stic
al P
ower
Team Draft
Is Sat clicked (A/B)
What if you instead know how you value user actions?
• Suppose we don’t have an AB metric in mind
• Instead, suppose we instead know how to value users’ behavior on changed documents:
– If user clicks on a document that moved up k positions, how much is it worth?
– If a user spends time t before clicking, how much is it worth?
– If a user spends time t’ on a document, how much is it worth?
[Radlinski & Craswell, WSDM 2013]
Example credit function• The value if a click is proportional to how
far the document moved between A and B:
• Example:– A: – B:– Any click on gives credit +2– Any click on gives credit -1– Any click on gives credit -1
12 3
1 2 3
1
2
3
Interleaving (making the rankings)
1 2 3
1 2 3
1 2 3
Ranker A
Ranker B
1 2 3
1 2 3
1 2 3
1 2 3We generate a set of rankings that are similar to those returned by A and B in an A/B test
Team Draft
50%
50%
We have an optimization problem!
• We have a set of allowed rankings
• We specified how clicks translate to credit
• We solve for the probabilities:– The probabilities of showing the rankings add
up to 1
– The expected credit given random clicking is zero
Sensitivity• The optimization problem so far is usually
under-constrained (lots of possible rankings).• What else do we want? Sensitivity!• Intuition:– When we show a particular ranking (i.e. something
combining results from A and B), it is always biased(interleaving says that we should be unbiased on average)– The more biased, the less informative the outcome– We want to show individual rankings that are least
biased
I’ll skip the maths here...
Allowed interleaved
rankings
for different interleaving algorithms
0.87 25% 25%
0.73 25%
0.74 35% 25%
0.60 40%
0.50
Illustrative optimized solution
A
B
Summary• Interleaving is a sensitive online metric for
evaluating rankings– Very high agreement when reliable offline
relevance metrics are available– Agreement of simple interleaving algorithms
with AB metrics & small / ambiguous relevance differences can be poor
• Solutions:– Can de-bias user behaviour (e.g. presentation
effects)– Can optimize to a known AB metric (if one is
trusted)– Can optimize to a known user model
Thanks!
Questions?