A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis
Outline
Introduction to Collaborative Filtering Special nature of CF Inverted File Search Algorithm Item-based Slope-one Hybrid method No random access Experiment
Collaborative Filtering
Looking for opinions from similar taste friends
The active user collaborate to other users Trust those who are similar taste more
Searching for similar users
Which user is the best one to trust in order to predict “?” ?
Everyone Only i2 is relevant
i1 i2 i3 i4 iaua - 2 - - ?
u1 - 2 - - 3
u2 1 2 - - 1
u3 - 2 2 - 4
u4 2 2 - 3 2
u5 1 2 2 1 4
Similarity
The similarity is not based on all attributes (the items)
Only the items which the active user rated are relevant
Although some suggested (Breese al. et.) more items could be considered (by default voting), it is not popular.
Searching for similar users
Which user is the best one to trust in order to predict “?” ?
Everyone except u5
i1 i2 i3 i4 iaua 1 2 3 5 ?
u1 1 - - - 3
u2 - 2 - - 1
u3 - - 3 - 4
u4 - - - 5 2
u5 - - - - 4
Similarity
The similarity is not based on all attributes (the items)
Only the items which both the active user and the user under consideration rated are relevant
A Notice
ua is similar to u1, u2, u3 and u4
BUT
u1, u2, u3 and u4 are totally not relevant to each other
Searching for similar users
Which user is the best one to trust in order to predict “?” ?
u3 is the one.
Only u3 is relevant
i1 i2 i3 i4 iaua 1 2 3 4 ?
u1 1 2 3 5 -
u2 2 3 1 4 -
u3 4 3 2 1 4
u4 2 1 1 3 -
u5 1 4 2 1 -
Top-k most similar users
It is not the top-k of among all users It is the top-k of among the users who
rated ia
Summary on the nature
The matrix is incomplete Similarity
The set of items could be different for every pair of users (the intersect)
The set of users (the candidates) could be different for each query (those who rated ia)
No triangle inequality (in extreme, ua is similar to u1, u2; but u1 and u2 can be irrelevant)
Popular Similarity measure
Very often, Pearson Correlation is used:
j iterate through the items that rated by both user i and user a
Vote (rating) on item j by user a Average vote (rating) of user a
Brute Force Searching
Given an active user and active movie:Relevant movies are known from the active
user profileCandidates are known from the active movie
profile
Find sim(ua, ui) for all ui in candidate set The top-k are used as advisors
Useful Information
What are the useful information?
i1 i2 i3 i4 iaua 1 2 - 4 ?
u1 - 2 3 - 4
u2 2 3 1 - -
u3 4 - 2 1 4
u4 2 1 - 3 3
u5 1 - 2 1 -
Useful Information
What are the useful information?
i1 i2 i3 i4 iaua 1 2 - 4 ?
u1 - 2 3 - 4
u2 2 3 1 - -
u3 4 - 2 1 4
u4 2 1 - 3 3
u5 1 - 2 1 -
Useful Information
What are the useful information?
The Green entries are useful
i1 i2 i3 i4 iaua 1 2 - 4 ?
u1 - 2 3 - 4
u2 2 3 1 - -
u3 4 - 2 1 4
u4 2 1 - 3 3
u5 1 - 2 1 -
Useful Information
All user profiles
or All movie profiles
Contains the useful information
i1 i2 i3 i4 iaua 1 2 - 4 ?
u1 - 2 3 - 4
u2 2 3 1 - -
u3 4 - 2 1 4
u4 2 1 - 3 3
u5 1 - 2 1 -
Inverted file
Item
1 2 3 4 5 6
User
1 - 1 - 3 4 5
2 1 3 4 5 - 5
3 - 3 - 4 1 -
Item
1
2
3
4
5
6
2 1
1 3
2 4
1 4
1 5
2 5 3 4
1 1 2 3 3 3
3 1
2 5
Coster & Svensson 2002
Pearson Correlation
The active user is fixed in a single query For each user i, there are 3 summations Instead of calculate the w(a,i) for each user i, calculate
SAI[i], SAA[i] and SII[i] for all users (with help of inverted list)
SAA[i]
SAI[i]
SII[i]
Early Termination
Self-Indexing Inverted Files for Fast Text Retrieval, Alistair Moffat and Justin Zobel, 1994
QuitStop when number of user reaches a threshold
ContinueStop consider new users when number of user
reaches a threshold
Item-based
The matrix is symmetric Exchange the role of row (user profile) and
column (movie profile) Looks for movies which are similar to the
active movie If the users act similarly to both movies,
the active user may act similarly too.
Item-based example
The users act exactly the same on i2 and ia
Perhaps i2 and ia are very similar
? May be 1, as ua give i2 rating 1
i1 i2 i3 i4 iaua 1 1 3 4 ?
u1 1 1 3 5 1
u2 2 2 1 4 2
u3 4 4 2 1 4
u4 2 4 1 3 4
u5 1 5 2 1 5
Sarwar et al 2001Pre-find top-k similar items
Amazon.comPersonal promotion on the top-k similar items
Slope-one
For items pair j and i For all users rated both items Find the average difference in rating
Slope-one example
All users gave ia higher rating than i3 by 1
By considering ia and i3, ua may rate ‘?’ as 4
i1 i2 i3 i4 iaua 1 4 3 4 ?
u1 1 2 1 5 2
u2 2 2 1 4 2
u3 4 4 3 1 4
u4 2 4 3 3 4
u5 1 5 4 1 5
Summary
A common argumentThere are less items than users
Pre-computationSimilarity in item-baseddevj,i in slope-one
Hybrid method
Finding top-k similar users Brute force
Inefficient when number of candidate is large Inverted file
Inefficient when number of relevant items is large
Mixing the 2
I1
Segmented inverted file example
All users here given I1 rating 5
All users here given I1 rating 4
All users here given I1 rating 3
All users here given I1 rating 2
All users here given I1 rating 1
Accessing Segmented inverted file
First access the segments which is closer to the active user’s rating
I1
Access example
Access order 1, d=0 ua here
Access order 2, d=1
Access order 3, d=1
Access order 4, d=2
Access order 5, d=3
Accessing Segmented inverted file
The inverted file is a list ranked on d (distance to ua’s rating)
The best bound on similarity can be found
Algorithm
phase 1Access all inverted lists, such that all d=0
segments are loadedStarting from the most frequently seen
candidates, find the actual similarity (totally k candidates are needed)
The similarity of the k th candidate who actual similarity is known will be the initial filter
Algorithm
phase 2 – keep loading form the inverted lists The best bound of the similarity decreases Similarity bound is worse than filter => pruned The partial information is more complete Update filter after some number of segments are load Stop when number of remaining candidate is small
Algorithm – phase 2
In the implementation, the items rated by ua extremely (close to 1 or 5) are loaded first
The candidates’ best bound drop faster
Similarity measure
Additive L1 Segmental Manhattan Distance
= Manhattan Distance / # of relevant items
Sim=1-(SMD)/(maximum distance)
Horting
i1 i2 i3 i4 i5 iaua 1 2 3 4 5 ?
u1 1 2 3 4 5 5
u2 1 - - - - 1
Sim(ua, u1) = Sim(ua, u2)
u2 is less reliable
Best bound
We have ‘user num of appearance’ ‘max num of more appearance’ = min(ua_profile.len, ui_profile.len) –
‘user num of appearance’
if never see this user in any segment best distance = 1 else if ( partial distance > 1 ) The user appear in unseen items, and d=1 else if (‘max num of more appearance’ < horting_factor) The user appear enough number of times only else The user does not appear anymore, partial distance is the
best
No random access
The inverted file is a list ranked on d (distance to ua’s rating)
Nikos Mamoulis, Kit Hung Cheng, Man Lung Yiu, and David W. Cheung 2006
phase 1 Do not find any actual similarity until
The best bound of an unseen user isworse thanThe k th best worst bound
Worst Bound
While a user’s partial distance is smaller than the maximum possible distance include the distance
Disk IO statistic (hybrid)
% of actual similarity7.60%
% of entries loaded from inverted file68.52%
% of entries which loaded and relevant49.77%
Reference
Breese et al Empirical Analysis of Predictive Algorithms for Collaborative Filtering
Coster & Svensson 2002 Inverted File Search Algorithms for Collaborative Filtering
Lemire & Maclachlan 2005 Slope One Predictors for Online Rating-Based Collaborative Filtering
Sarwar et al 2001 ItemBased Collaborative Filtering Recommendation Algorithms
Aggarwal et al Horting Hatches an Egg: A New Graph-Theoretic Approach to Collaborative
Filtering Nikos Mamoulis, Kit Hung Cheng, Man Lung Yiu, and David W. Cheung
2006 Efficient Aggregation of Ranked Inputs