Top Banner
Offline Evaluation of Recommender Systems All pain and no gain? Mark Levy Mendeley
60

Offline evaluation of recommender systems: all pain and no gain?

Sep 08, 2014

Download

Technology

Mark Levy

Keynote for the workshop on Reproducibility and Replication in Recommender Systems at ACM RecSys, Hong Kong, 12 October 2013.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Offline evaluation of recommender systems: all pain and no gain?

Offline Evaluation of Recommender Systems

All pain and no gain?

Mark LevyMendeley

Page 2: Offline evaluation of recommender systems: all pain and no gain?

About me

Page 3: Offline evaluation of recommender systems: all pain and no gain?

About me

Page 4: Offline evaluation of recommender systems: all pain and no gain?

Some things I built

Page 5: Offline evaluation of recommender systems: all pain and no gain?

Something I'm building

Page 6: Offline evaluation of recommender systems: all pain and no gain?

What is a good recommendation?

Page 7: Offline evaluation of recommender systems: all pain and no gain?

What is a good recommendation?

One that increases the usefulnessof your product in the long run1

1. WARNING: hard to measure directly

Page 8: Offline evaluation of recommender systems: all pain and no gain?

What is a good recommendation?

● One that increased your bottom line:

– User bought item after it was recommended

– User clicked ad after it was shown

– User didn't skip track when it was played

– User added document to library...

– User connected with contact...

Page 9: Offline evaluation of recommender systems: all pain and no gain?

Why was it good?

Page 10: Offline evaluation of recommender systems: all pain and no gain?

Why was it good?

● Maybe it was

– Relevant

– Novel

– Familiar

– Serendipitous

– Well explained

● Note: some of these are mutually incompatible

Page 11: Offline evaluation of recommender systems: all pain and no gain?

What is a bad recommendation?

Page 12: Offline evaluation of recommender systems: all pain and no gain?

What is a bad recommendation?

(you know one when you seen one)

Page 13: Offline evaluation of recommender systems: all pain and no gain?

What is a bad recommendation?

Page 14: Offline evaluation of recommender systems: all pain and no gain?

What is a bad recommendation?

Page 15: Offline evaluation of recommender systems: all pain and no gain?

What is a bad recommendation?

Page 16: Offline evaluation of recommender systems: all pain and no gain?

What is a bad recommendation?

● Maybe it was

– Not relevant

– Too obscure

– Too familiar

– I already have it

– I already know that I don't like it

– Badly explained

Page 17: Offline evaluation of recommender systems: all pain and no gain?

What's the cost of getting it wrong?

● Depends on your product and your users

– Lost revenue

– Less engaged user

– Angry user

– Amused user

– Confused user

– User defects to a rival product

Page 18: Offline evaluation of recommender systems: all pain and no gain?

Hypotheses

Good offline metricsexpress product goals

Most (really) bad recommendationscan be caught by business logic

Page 19: Offline evaluation of recommender systems: all pain and no gain?

Issues

● Real business goals concern long-term user behaviour e.g. Netflix

“we have reformulated the recommendation problem to the question of optimizing the probability a member chooses to

watch a title and enjoys it enough to come back to the service”

● Usually have to settle for short-term surrogate

● Only some user behaviour is visible

● Same constraints when collecting training data

Page 20: Offline evaluation of recommender systems: all pain and no gain?

Least bad solution?

● “Back to the future” aka historical log analysis

● Decide which logged event(s) indicate success

● Be honest about “success”

● Usually care most about precision @ small k

● Recall will discriminate once this plateaus

● Expect to have to do online testing too

Page 21: Offline evaluation of recommender systems: all pain and no gain?

Making metrics meaningful

● Building a test framework + data is hard

● Be sure to get best value from your work

● Don't use straw man baselines

● Be realistic – leave the ivory tower

● Make test setups and baselines reproducible

Page 22: Offline evaluation of recommender systems: all pain and no gain?

Making metrics meaningful

● Old skool k-NN systems are better than you think

– Input numbers from mining logs

– Temporal “modelling” (e.g. fake users)

– Data pruning (scalability, popularity bias, quality)

– Preprocessing (tf-idf, log/sqrt, )…– Hand crafted similarity metric

– Hand crafted aggregation formula

– Postprocessing (popularity matching)

– Diversification

– Attention profile

Page 23: Offline evaluation of recommender systems: all pain and no gain?

Making metrics meaningful

● Measure preference honestly

● Predicted items may not be “correct” just because they were consumed once

● Try to capture value

– Earlier recommendation may be better

– Don't need a recommender to suggest items by same artist/author

● Don't neglect side data

– At least use it for evaluation / sanity checking

Page 24: Offline evaluation of recommender systems: all pain and no gain?

Making metrics meaningful

● Public data isn't enough for reproducibility or fair comparison

● Need to document preprocessing

● Better:

Release your preparation/evaluation code too

Page 25: Offline evaluation of recommender systems: all pain and no gain?

What's the cost of poor evaluation?

Page 26: Offline evaluation of recommender systems: all pain and no gain?

What's the cost of poor evaluation?

Poor offline evaluation can lead toyears of misdirected research

Page 27: Offline evaluation of recommender systems: all pain and no gain?

Ex 1: Reduce playlist skips

● Reorder a playlist of tracks to reduce skips by avoiding “genre whiplash”

● Use audio similarity measure to compute transition distance, then travelling salesman

● Metric: sum of transition distances (lower is better)

● 6 months work to develop solution

Page 28: Offline evaluation of recommender systems: all pain and no gain?

Ex 1: Reduce playlist skips

● Result: users skipped more often

● Why?

Page 29: Offline evaluation of recommender systems: all pain and no gain?

Ex 1: Reduce playlist skips

● Result: users skipped more often

● When a user skipped a track they didn't like they were played something else just like it

● Better metric: average position of skipped tracks (based on logs, lower down is better)

Page 30: Offline evaluation of recommender systems: all pain and no gain?

Ex 2: Recommend movies

● Use a corpus of star ratings to improve movie recommendations

● Learn to predict ratings for un-rated movies

● Metric: average RMSE of predictions for a hidden test set (lower is better)

● 2+ years work to develop new algorithms

Page 31: Offline evaluation of recommender systems: all pain and no gain?

Ex 2: Recommend movies

● Result: “best” solutions were never deployed

● Why?

Page 32: Offline evaluation of recommender systems: all pain and no gain?

Ex 2: Recommend movies

● Result: “best” solutions were never deployed

● User behaviour correlates with rank not RMSE

● Side datasets an order of magnitude more valuable than algorithm improvements

● Explicit ratings are the exception not the rule

● RMSE still haunts research labs

Page 33: Offline evaluation of recommender systems: all pain and no gain?

Can contests help?

● Good:

– Great for consistent evaluation

● Not so good:

– Privacy concerns mean obfuscated data

– No guarantee that metrics are meaningful

– No guarantee that train/test framework is valid

– Small datasets can become overexposed

Page 34: Offline evaluation of recommender systems: all pain and no gain?

Ex 3: Yahoo! Music KDD Cup

● Largest music rating dataset ever released

● Realistic “loved songs” classification task

● Data fully obfuscated due to recent lawsuits

Page 35: Offline evaluation of recommender systems: all pain and no gain?

Ex 3: Yahoo! Music KDD Cup

● Result: researchers hated it

● Why?

Page 36: Offline evaluation of recommender systems: all pain and no gain?

Ex 3: Yahoo! Music KDD Cup

● Result: researchers hated it

● Research frontier focussed on audio content and metadata, not joinable to obfuscated ratings

Page 37: Offline evaluation of recommender systems: all pain and no gain?

Ex 4: Million Song Challenge

● Large music dataset with rich metadata

● Anonymized listening histories

● Simple item recommendation task

● Reasonable MAP@500 metric

● Aimed to solve shortcomings of KDD Cup

● Only obfuscation was removal of timestamps

Page 38: Offline evaluation of recommender systems: all pain and no gain?

Ex 4: Million Song Challenge

● Result: winning entry didn't use side data

● Why?

Page 39: Offline evaluation of recommender systems: all pain and no gain?

Ex 4: Million Song Challenge

● Result: winning entry didn't use side data

● No timestamps so test tracks chosen at random

● So “people who listen to A also listen to B”

● Traditional item similarity solves this well

● More honesty about “success” might have shown that contest data was flawed

Page 40: Offline evaluation of recommender systems: all pain and no gain?

Ex 5: Yelp RecSys Challenge

● Small business review dataset with side data

● Realistic mix of input data types

● Rating prediction task

● Informal procedure to create train/test sets

Page 41: Offline evaluation of recommender systems: all pain and no gain?

Ex 5: Yelp RecSys Challenge

● Result: baseline algorithms high up leaderboard

● Why?

Page 42: Offline evaluation of recommender systems: all pain and no gain?

Ex 5: Yelp RecSys Challenge

● Result: baseline algorithms high up leaderboard

● Train/test split was corrupt

● Competition organisers moved fast to fix this

● But left only one week before deadline

Page 43: Offline evaluation of recommender systems: all pain and no gain?

Ex 6: MIREX Audio Chord Estimation

● Small dataset of audio tracks

● Task to label with predicted chord symbols

● Human labelled data hard to come by

● Contest hosted by premier forum in field

● Evaluate frame-level prediction accuracy

● Historical glass ceiling around 80%

Page 44: Offline evaluation of recommender systems: all pain and no gain?

Ex 6: MIREX Audio Chord Estimation

● Result: 2011 winner ftw

● Why?

Page 45: Offline evaluation of recommender systems: all pain and no gain?

Ex 6: MIREX Audio Chord Estimation

● Result: 2011 winner ftw

● Spoof entry relying on known test set

● Protest against inadequate test data

● Other research showed weak generalisation of winning algorithms from same contest

● Next year results dropped significantly

Page 46: Offline evaluation of recommender systems: all pain and no gain?

So why evaluate offline at all?

● Building test framework ensures clear goals

● Avoid wishful thinking if your data is too thin

● Be efficient with precious online testing

– Cut down huge parameter space

– Don't alienate users

● Need to publish

● Pursuing science as well as profit

Page 47: Offline evaluation of recommender systems: all pain and no gain?

Online evaluation is tricky too

● No off the shelf solution for services

● Many statistical gotchas

● Same mismatch between short-term and long-term success criteria

● Results open to interpretation by management

● Can make incremental improvements look good when radical innovation is needed

Page 48: Offline evaluation of recommender systems: all pain and no gain?

Ex 7: Article Recommendations

● Recommender for related research articles

● Massive download logs available

● Framework developed based on co-downloads

● Aim to improve on existing search solution

● Management “keen for it work”

● Several weeks of live A/B testing available

● No offline evaluation

Page 49: Offline evaluation of recommender systems: all pain and no gain?

Ex 7: Article Recommendations

● Result: worse than similar title search

● Why?

Page 50: Offline evaluation of recommender systems: all pain and no gain?

Ex 7: Article Recommendations

● Result: worse than similar title search

● Inadequate business rules e.g. often suggesting other articles from same publication

● Users identified only by organisational IP range so value of “big data” very limited

● Establishing an offline evaluation protocol would have shown these in advance

Page 51: Offline evaluation of recommender systems: all pain and no gain?

Isn't there software for that?

Rules of the game:

– Model fit metrics (e.g. validation loss) don't count

– Need a transparent “audit trail” of data to support genuine reproducibility

– Just using public datasets doesn't ensure this

Page 52: Offline evaluation of recommender systems: all pain and no gain?

Isn't there software for that?

Wish list for reproducible evaluation:

– Integrate with recommender implementations

– Handle data formats and preprocessing

– Handle splitting, cross-validation, side datasets

– Save everything to file

– Work from file inputs so not tied to one framework

– Generate meaningful metrics

– Well documented and easy to use

Page 53: Offline evaluation of recommender systems: all pain and no gain?

Isn't there software for that?

Current offerings:

● GraphChi/GraphLab

● Mahout

● LensKit

● MyMediaLite

Page 54: Offline evaluation of recommender systems: all pain and no gain?

Isn't there software for that?

Current offerings:

● GraphChi/GraphLab

– Model validation loss, doesn't count

● Mahout

– Only rating prediction accuracy, doesn't count

● LensKit

– Too hard to understand, won't use

Page 55: Offline evaluation of recommender systems: all pain and no gain?

Isn't there software for that?

Current offerings:

● MyMediaLite

– Reports meaningful metrics

– Handles cross-validation

– Data splitting not transparent

– No support for pre-processing

– No built in support for standalone evaluation

– API is capable but current utils don't meet wishlist

Page 56: Offline evaluation of recommender systems: all pain and no gain?

Eating your own dog food

● Built a small framework around new algorithm

● https://github.com/mendeley/mrec

– Reports meaningful metrics

– Handles cross-validation

– Supports simple pre-processing

– Writes everything to file for reproducibility

– Provides API and utility scripts

– Runs standalone evalutions

– Readable Python code

Page 57: Offline evaluation of recommender systems: all pain and no gain?

Eating your own dog food

● Some lessons learned

– Usable frameworks are hard to write

– Tradeoff between clarity and scalability

– Should generate explicit validation sets

● Please contribute!

● Or use as inspiration to improve existing tools

Page 58: Offline evaluation of recommender systems: all pain and no gain?

Where next?

● Shift evaluation online:

– Contests based around online evaluation

– Realistic but not reproducible

– Could some run continuously?

● Recommender Systems as a commodity:

– Software and services reaching maturity now

– Business users can tune/evaluate themselves

– Is there a way to report results?

Page 59: Offline evaluation of recommender systems: all pain and no gain?

Where next?

● Support alternative query paradigms:

– More like this, less like that

– Metrics for dynamic/online recommenders

● Support recommendation with side data:

– LibFM, GenSGD, WARP research @google, …– Open datasets?

Page 60: Offline evaluation of recommender systems: all pain and no gain?

Thanks for listening

[email protected]

@gamboviol

https://github.com/gamboviol

https://github.com/mendeley/mrec