Collaborative Filtering Recommendation Techniques Yehuda Koren
Collaborative Filtering Recommendation Techniques
Yehuda Koren
Recommendation Types
Editorial Simple aggregates Top 10, Most Popular, Recent Uploads
Tailored to individual users Books, CDs, other products at amazon.com Movies by Netflix, MovieLens TV Shows by TiVo …
3
Recommendation Process
Collecting “known” user-item ratings Extrapolate unknown ratings from
known ratings Estimate ratings for the items that have not
been seen by a user Recommend the items with the highest
estimated ratings to a user
Collaborative Filtering
Collaborative filtering• Recommend items based on past transactions of many
users• Analyze relations between users and/or items• Specific data characteristics are irrelevant
– Domain-free: user/item attributes are not necessary– Can identify elusive aspects
Research
Research
“We’re quite curious, really. To the tune of one million dollars.” – Netflix Prize rules
• Goal to improve on Netflix’ existing movie recommendation technology, Cinematch
• Criterion: reduction in root mean squared error (RMSE)• Oct’06: Contest began • Oct’07: $50K progress prize for 8.43% improvement• Oct’08: $50K progress prize for 9.44% improvement• Sept’09: $1 million grand prize for 10.06% improvement
Research
scoremovieuser1211521314345241232376825763445415685234252234557664566
movieuser?621?961?72?32?473?153?414?284?935?745?696?836
Training data Test data
Movie rating data
• Training data– 100 million
ratings– 480,000 users– 17,770 movies– 6 years of data:
2000-2005• Test data
– Last few ratings of each user (2.8 million)
• Dates of ratings are given
Test Data Split into Three Pieces
• Probe– Ratings released– Allows participants to assess
methods directly
• Daily submissions allowed for combined Quiz/Test data– RMSE released for Quiz– Prizes based on Test RMSE– Identity of Quiz cases
withheld– Test RMSE withheld
Training Data
Training Data Hold-out Set(last 9 ratings for each user:
4.2M pairs)
All Data(~ 103 M user-item pairs)
Quiz Test
Labels provided Labels retained by Netflix for scoring
Random 3-way split
Probe
Research
#ratings per user
• Avg #ratings/user: 208
Research11
Most Active Users
User ID # Ratings Mean Rating305344 17,651 1.90387418 17,432 1.81
2439493 16,560 1.221664010 15,811 4.262118461 14,829 4.081461435 9,820 1.371639792 9,764 1.331314869 9,739 2.95
Research
#ratings per movie
• Avg #ratings/movie: 5627
Research
Movies Rated Most Often
Title # Ratings Mean RatingMiss Congeniality 227,715 3.36Independence Day 216,233 3.72The Patriot 200,490 3.78The Day After Tomorrow 194,695 3.44Pretty Woman 190,320 3.90Pirates of the Caribbean 188,849 4.15The Green Mile 180,883 4.31Forrest Gump 180,736 4.30
Research
Important RMSEs
Prize’07 (BellKor): 0.8712
Cinematch: 0.9514; baseline
Movie average: 1.0533
User average: 1.0651
Global average: 1.1296
Inherent noise: ????
Personalization
erroneous
accurate
Prize’08 (BellKor+BigChaos): 0.8616
Grand Prize (BellKor’s Pragmatic Chaos) : 0.8554
Random ranking
Bestsellers list
Problem Definition
1 4 3
4
4 4
4
2
Research
#17,770
#3
#2
#1
Users➔Items #480,000#3#2#1
Users/items arranged in ratings space
Latent factorization algorithm
• Dim(Users)≠ Dim(Items) (E.g., 17,770-vs-480,000)• Sparse data, with non-uniformly missing entries
Latent factor methods
Research
0.70.950.371.41.91.4User-480K
1.11.871.20.880.130.19User-3
1.951.050.970.041.10.77User-2
1.30.671.20.370.490.08User-1
0.170.870.760.430.120.44Item-17,770
1.10.550.370.12.10.95Item-3
0.371.251.351.190.011.1Item-2
0.250.011.72.10.81.2Item-1
Latent factorization algorithm
Users/items arranged in joint dense latent factors space
Research
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Ocean’s 11
Sense and Sensibility
Gus
Dave
A 2-D factor space
Research
Basic matrix factorization model
45531
312445
53432142
24542
522434
42331
items
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1
~
~
items
users
users
A rank-3 SVD approximation
Research
Estimate unknown ratings as inner-products of factors:
45531
312445
53432142
24542
522434
42331
items
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1
~
~
items
users
A rank-3 SVD approximation
users
?
Research
Estimate unknown ratings as inner-products of factors:
45531
312445
53432142
24542
522434
42331
items
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1
~
~
items
users
A rank-3 SVD approximation
users
?
Research
Estimate unknown ratings as inner-products of factors:
45531
312445
53432142
24542
522434
42331
items
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1
~
~
items
users
2.4
A rank-3 SVD approximation
users
Matrix factorization model45531
312445
53432142
24542
522434
42331
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1~
Idea: • Approximate the rating matrix as the product of two
lower-rank matrices: R=PQProperties:• SVD isn’t defined when entries are unknown use
specialized methods• Very powerful model can easily overfit, sensitive to
regularization
A regularized model
• User factors:Model a user u as a vector pu ~ Nk(µ, Σ)
• Movie factors:Model a movie i as a vector qi ~ Nk(γ, Λ)
• Ratings:Measure “agreement” between u and i: rui ~ N(pu
Tqi, ε2)• Simplifying assumptions:
µ =γ = 0, Σ= Λ = λI
Research
Matrix factorization as a cost function
ˆ Tui u ir p q=
regularization- user-factor of u
- item-factor of i
- rating by u for iuiriqup
• Optimize by either stochastic gradient-descent or alternating least squares
prediction
( ) ( )* *
2 2 2,
knownMin
ui
Tp q ui u i u i
rr p q p qλ− + +∑
Rating prediction:
Stochastic gradient descent optimization
• For each training example rui :– Compute prediction error: eui = rui – pu
Tqi
– Update item factor: qi qi+γ(pueui-λqi) – Update user factor: pu pu+γ(qieui-λpu)
Perform till convergence:
• Two constants to tune: γ (step size) and λ (regularization) • Cross validation: find values that minimize error on test set
Notation: - rating by user u to item iuir
Research
1 4 3
4
4 4
4
2
1.4
-0.2
0.8
0.5
-1.3
-0.4 1.6
-0.1 0.5
0.3
1.2 -0.51.1 -0.4
1.2 0.9
0.4 -0.4
1.2 -0.3
1.3
-0.1
0.9
0.4
1.1 -0.2
1.5
0.0
1.1 0.8
-1.2
-0.3
1.2 0.9
1.6
0.11.5
0.0
0.5 -0.3
-1.1
-0.2
0.4 -0.20.5 -0.1
0.6
0.2
P
Q
R
By Domonkos Tikk and the Gravity Team
Research
Research
1 4 3
4
4 4
4
2
1.5
-1.0
2.1
0.8
1.0
1.6 1.8
0.7 1.6
0.0
1.4 1.1
0.9 1.9
2.5 -0.3
P
Q
R3.3 2.4
-0.5 3.5 1.5
1.14.9
Research
Data normalization
• Most variability in observed data is driven by user-specific and item-specific effects, regardless of user-item interaction
• Examples:– Some movies are systematically rated higher– Some movies were rated by users that tend to rate low – Ratings change along time
• Data must be adjusted to account for these main effects• This stage requires most insights into the nature of the data• Can make a big difference…
Research
Components of a rating predictor
user-item interactionitem biasuser bias
User-item interaction• Characterizes the match
between users and items• Attracts most research in the
field• Benefits from algorithmic and
mathematical innovations
Biases• Separates users and movies• Often overlooked• Benefits from insights into
users’ behavior
ub + ib + Tu ip quir =
Research
A bias estimator
• We have expectations on the rating by user u to item i, even without estimating u’s attitude towards items like i
– Rating scale of user u– Values of other ratings the
user gave recently
– (Recent) popularity of item i– Selection bias
Research
Biases: an example
• Mean rating: 3.7 stars• The Sixth Sense is 0.5 stars above avg• Joe rates 0.2 stars below avgBaseline estimation:
Joe will rate The Sixth Sense 4 stars
Research
Biases 33%
Personalization 10%
Unexplained57%
Sources of Variance in Netflix data
1.276 (total variance)
0.732 (unexplained)0.415 (biases)0.129 (personalization)
++
Biases matter!
Research
Exploring Temporal Effects
2004
Netflix ratings by date
Something Happened in Early 2004…
Research
Are movies getting better with time?
Research
Multiple sources of temporal dynamics
• Item-side effects:– Product perception and popularity are constantly changing– Seasonal patterns influence items’ popularity
• User-side effects:– Customers redefine their taste– Transient, short-term bias; anchoring– Drifting rating scale– Change of rater within household
Research
Introducing temporal dynamics into biases
• Biases tend to capture most pronounced aspects of temporal dynamic
• We observe changes in:1. Rating scale of individual users (user bias)2. Popularity of individual items (item bias)
( )ˆ (( )) Tu i i uui b b ptt tr q= + +
ˆ Tui u i u ir b b p q= + +
Add temporal dynamics
General Lessons and Experience
• 1.4M predictions are split into 10 equal bins based on #ratings per user
Some users are more predictable…
0.7
0.75
0.8
0.85
0.9
0.95
1
12 24.4 38.1 55.4 80.9 119.4 176.5 264.8 420.9 918.5
RMSE vs. #Ratings
40
6090
12818050
100200
50100
20050
100 200 500
100200 500
50100 200 500 1000 1500
0.875
0.88
0.885
0.89
0.895
0.9
0.905
0.91
10 100 1000 10000 100000
RMSE
Millions of Parameters
Factor models: Error vs. #parameters
NMF
BiasSVD
SVD++
SVD v.2
SVD v.3
SVD v.4
Prize: 0.8563
Netflix: 0.9514
Research
Ratings are not given at random!
Marlin, Zemel, Roweis, Slaney, “Collaborative Filtering and the Missing at Random Assumption” UAI 2007
Yahoo! survey answersYahoo! music ratingsNetflix ratings
Distribution of ratings
Research
• A powerful source of information:Characterize users by which movies they rated, rather than how they rated
• A dense binary representation of the data:
45531
312445
53432142
24542
522434
42331
users
movies
010100100101
111001001100
011101011011
010010010110
110000111100
010010010101
usersm
ovies
Which movies users rate?
{ } ,ui u iR r= { } ,ui u i
B b=
Research
The Wisdom of Crowds (of Models)
• All models are wrong; some are useful – G. Box– Some miss strong “local” relationships, e.g., among sequels– Others miss cumulative effect of many small signals– Each complements the other
• Our best entry during Year 1 was a linear combination of 107 sets of predictions
• Our final solution was a linear blend of over 700 prediction sets– Many variations of model structure and parameter settings
• Mega blends are not needed in practice– A handful of simple models achieves 90% of the improvement of
the full blend
Research
0.865
0.867
0.869
0.871
0.873
0.875
0.877
0.879
0.881
1 8 15 22 29 36 43 50 57
Erro
r -R
MSE
#Predictors
Effect of ensemble size
Research
Talk announcement
Save the dateMay 24, 11:00AM @ Schreiber
Speaker:Prof. Ricardo Baeza-Yates, Yahoo! VP of Research for Europe and Latin America,
Research
45531
312445
53432142
24542
522434
42331
34321
454
2434
325
24
Yehuda KorenYahoo! Research
[email protected] homepage:
www.research.att.com/~volinsky/netflix/