Mahout part1
Post on 15-Jan-2015
1173 Views
Preview:
DESCRIPTION
Transcript
Mahout in ActionPart 1
Yasmine M. Gaber28 February 2013
Agenda
Meet Apache Mahout Part 1: Recommendation Part 2: Clustering Part 3: Classification
Meet Apache Mahout
It is an open source machine learning library from Apache
It is scalable
It is a Java library
It can be used with Hadoop to deal with large scale data.
Famous Engines Recommender engines: Amazon.comNetflix Dating sites like Líbímseti Social networking sites like Facebook
Clustering engines:Google NewsSearch engines like Clusty
Classification engines:Spam emailsGoogle’s PicasaOptical character recognition softwareApple’s Genius feature in iTunes
Recommendations
Recommender Input
A preference consists of a user ID and an item ID, user’s preference for the item
It is .csv file
Create Recommender
Recommender Evaluation
Average difference vs Root-mean-square
Mahout RecommenderEvaluator
Precision and Recall
RecommenderIRStatsEvaluator
Representing Recommender Data
Preference object new GenericPreference(123, 456, 3.0f)
Preference Array
Representing Recommender Data
Preference Array
FastByIDMap and FastIDSet
In-memory DataModels
GenericDataModel
File-based data
Refreshable components
Database-based data
Coping without preference values
Coping without preference values
User-based Recommender
The algorithm
for every item i that u has no preference for yet
for every other user v that has a preference for i
compute a similarity s between u and v
incorporate v's preference for i, weighted by s, into a running average
return the top items, ranked by weighted average
Recommender Components
Data model, implemented via DataModel
User-user similarity metric, implemented via UserSimilarity
User neighborhood definition, implemented via UserNeighborhood
Recommender engine, implemented via a Recommender (here, GenericUserBasedRecommender)
GenericUserBasedRecommender
User Neighborhoods
Fixed-size neighborhoods
Threshold-based neighborhood
similarity metrics
Pearson correlation–based similarity It is a number between –1 and 1 that measures
the tendency of two series of numbers, paired up one-to-one, to move together
Problems: It doesn’t take into account the number of items in
which two users’ preferences overlap, which is probably a weakness in the context of recommender engines.
If two users overlap on only one item, no correlation can be computed because of how the computation is defined
similarity metrics
Euclidean distance similarity 1 / (1+euclidean distance)
Cosine measure similarity between –1 and 1
Tanimoto coefficient similarity The ratio of the size of the
intersection to the size of
the union of their preferred items
Item-based recommendation
The algorithm
for every item i that u has no preference for yet
for every item j that u has a preference for
compute a similarity s between i and j
add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted average
GenericItemBasedRecommender
Slope-one recommender
The algorithm
for every item i the user u expresses no preference for
for every item j that user u expresses a preference for
find the average preference difference between j and i
add this diff to u's preference value for j
add this to a running average
return the top items, ranked by these averages
Taking Recommender to Production
User-based recommenders
Thank You
Contact at:Email: Yasmine.Gaber@espace.com.egTwitter: Twitter.com/yasmine_mohamed
top related