13.09.2012 DIMA – TU Berlin 1 Database Systems and Information Management Group (DIMA) Technische Universität Berlin http://www.dima.tu-berlin.de/ An Introduction to Collaborative Filtering with Apache Mahout Sebastian Schelter Recommender Systems Challenge at ACM RecSys 2012
13
Embed
Introduction to Collaborative Filtering with Apache Mahout
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
13.09.2012 DIMA – TU Berlin 1
Database Systems and Information Management Group (DIMA)Technische Universität Berlin
http://www.dima.tu-berlin.de/
An Introduction to Collaborative Filtering with Apache Mahout
Sebastian Schelter
Recommender Systems Challenge at ACM RecSys 2012
13.09.2012 DIMA – TU Berlin 2
■ Apache Mahout: apache-licensed library with the goal to provide highly scalable data mining and machine learning
■ its collaborative filtering module is based on the Taste framework of Sean Owen
■ mostly aimed at production scenarios, with a focus on□ processing efficiency
□ integratibility with different datastores, web applications, Amazon EC2
□ scalability, allows computation of recommendations, items similarities and matrix decompositions via MapReduce on Apache Hadoop
■ not that much used in recommender challenges□ not enough different algorithms implemented?
□ not enough tooling for evaluation?
→ it‘s open source, so it‘s up to you to change that!
Overview
13.09.2012 DIMA – TU Berlin 3
Preference & DataModel
■ Preference encapsulates a user-item-interaction as (user,item,value) triple□ only numeric userIDs and itemIDs allowed for memory efficiency
□ PreferenceArray encapsulates a set of preferences
■ DataModel encapsulates a dataset□ lots of convenient accessor methods like getNumUsers(),
getPreferencesForItem(itemID), ...
□ allows to add temporal information to preferences
□ lots of options to store the data (in-memory, file, database, key-value store)
□ drawback: for a lot of usecases, all the data has to fit into memory to allow efficient recommendation
DataModel dataModel = new FileDataModel(new File(„movielens.csv“));
■ in the Million Song DataSet Challenge, a novel item similarity measure was used in the winning solution
■ would be great to see this one also featured in Mahout
■ Task □ implement the novel item similarity measure as subclass of Mahout’s
ItemSimilarity
■ Future Work□ this novel similarity measure is asymmetric, ensure that it is correctly
applied in all scenarios
13.09.2012 DIMA – TU Berlin 11
Project: temporal split evaluator
■ currently Mahout‘s standard RecommenderEvaluator randomly splits the data into training and test set
■ for datasets with timestamps it would be much more interesting use this temporal information to split the data into training and test set
■ Task □ create a TemporalSplitRecommenderEvaluator similar to the existing
AbstractDifferenceRecommenderEvaluator
■ Future Work□ factor out the logic for splitting datasets into training and test set
13.09.2012 DIMA – TU Berlin 12
Project: baseline method for rating prediction
■ port MyMediaLite’s UserItemBaseline to Mahout(preliminary port already available)
■ user-item-baseline estimation is a simple approach that estimates the global tendency of a user or an item to deviate from the average rating (described in Y. Koren: Factor in the Neighbors: Scalable and Accurate Collaborative Filtering, TKDD 2009)
■ Task □ polish the code
□ make it work with Mahout’s DataModel
■ Future Work□ create an ItemBasedRecommender that makes use of the estimated
biases
13.09.2012 DIMA – TU Berlin 13
Thank you.
Questions?
Sebastian SchelterDatabase Systems and Information Management Group (DIMA)Technische Universität Berlin