Top Banner
Slope One Recommender on Hadoop YONG ZHENG Center for Web Intelligence DePaul University Nov 15, 2012
28

Slope one recommender on hadoop

Sep 03, 2014

Download

Technology

YONG ZHENG

Introduction about the MapReduce distributed version of SlopeOne in Mahout
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Slope one recommender on hadoop

Slope One Recommender on Hadoop

YONG ZHENGCenter for Web Intelligence

DePaul UniversityNov 15, 2012

Page 2: Slope one recommender on hadoop

Overview

• Introduction

• Recommender Systems & Slope One Recommender

• Distributed Slope One on Mahout and Hadoop

• Experimental Setup and Analyses

• Drive Mahout on Hadoop

• Interesting Communities

Center for Web Intelligence, DePaul University, USA

Page 3: Slope one recommender on hadoop

Introduction• About Me: a recommendation guy

• My Research: data mining and recommender systems

• Typical Experimental Research

1) Design or improve an algorithm;2) Run algorithms and baseline algs on datasets;3) Compare experimental results;4) Try different parameters, find reasons and even re-design

and improve algorithm itself;5) Run algorithms and baseline algs on datasets;6) Compare experimental results;7) Try different parameters, find reasons and even re-design

and improve algorithm itself;8) And so on… Until it approaches expected results.

Page 4: Slope one recommender on hadoop

Introduction• Sometimes, data is large-scale.

e.g. one algorithm may spend days to complete, how about experimental results are not as expected. Then improve algorithms and run it for days again, and again.

How can we do previously? (for tasks not that complicated)1). Paralleling but complicated synchronization and limited

resources, such as CPU, memory, etc;2). Take advantage of PC Labs, let’s do it with 10 PCs

• Nearly all research will ultimately face the large-scale problems , especially in the domain of data mining.

• But, we have Map-Reduce NOW!

Page 5: Slope one recommender on hadoop

Introduction

• Do not need to distribute data and tasks manually. Instead we just simply generate configurations.

• Do not need to care about more details, e.g. how data is distributed, when one specific task will be ran on which machine, or how they conduct tasks one by one.

• Instead, we can pre-define working flow. We can take advantage of the functional contributions from mappers and reducers.

• More benefits: replication, balancing, robustness, etc

Page 6: Slope one recommender on hadoop

Recommender Systems

• Collaborative Filtering

• Slope One and Simple Weighted Slope One

• Slope One in Mahout

• Distributed Slope One in Mahout

• Mappers and Reducers

Center for Web Intelligence, DePaul University, USA

Page 7: Slope one recommender on hadoop

Recommender Systems

Page 8: Slope one recommender on hadoop

Collaborative Filtering (CF)One of most popular recommendation algorithms. User-based: User-CF Item-based: Item-CF, Slope One

User

4 star

Rating?

4

5

5

5

4

Example: User-based Collaborative Filtering

Page 9: Slope one recommender on hadoop

Slope One RecommenderReference: Daniel Lemire, Anna Maclachlan, Slope One Predictors for Online Rating-Based Collaborative Filtering, In SIAM Data Mining (SDM'05), April 21-23, 2005. http://lemire.me/fr/abstracts/SDM2005.html

User Batman Spiderman

U1 3 4

U2 2 4

U3 2 ?

1). How different two movies were rated?U1 rated Spiderman higher by (4-3) = 1U2 rated Spiderman higher by (4-2) = 2On average, Spiderman is rated (1+2)/2 = 1.5 higher

2). Rating difference can tell predictionsIf we know U3 gave Batman a 2-star, probably he will rated Spiderman by (2+1.5) = 3.5 star

Page 10: Slope one recommender on hadoop

Simple Weighted Slope OneUsually user rated multiple items

User HarryPotter Batman Spiderman

U1 5 3 4

U2 ? 2 4

U3 4 2 ?

1). How different the two movies were rated?Diff(Batman, Spiderman) = [(4-3)+(4-2)]/2 = 1.5Diff(HarryPotter, Spiderman) = (4-5)/1 = -1“2” and “1” here we call them as “count”.

2). Weighted rating difference can tell predictionsWe use a simple weighted approachRefer to Batman only, rating = 2+1.5 = 3.5Refer to HarryPotter only, rating = 4-1 = 3Consider them all, predicted rating = (3.5*2 + 3*1])/ (2+1) = 3.33

Page 11: Slope one recommender on hadoop

Simple Weighted Slope OneUser HarryPotter Batman Spiderman

u1 5 3 4

u2 ? 2 4

u3 4 2 ?

To calculate the prediction ratings, we need 2 matrices:1).Difference Matrix

2). Count MatrixJust number of users co-rated on two items

Movie1 Movie2 Movie3 Movie4

Movie1

Movie2 -1.5

Movie3 2 1

Movie4 -1 0.5 -2

Question: Online or Offline?

Page 12: Slope one recommender on hadoop

Slope One in Mahout

Mahout, an open-source machine learning library.

1). Recommendation algorithmsUser-based CF, Item-based CF, Slope One, etc

2). ClusteringKMeans, Fuzzy KMeans, etc

3). ClassificationDecision Trees, Naive Bayes, SVM, etc

4). Latent Factor ModelsLDA, SVD, Matrix Factorization, etc

Page 13: Slope one recommender on hadoop

Slope One in Mahoutorg.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommenderPre-Processing Stage: (class MemoryDiffStorage with Map)for every item i

for every other item jfor every user u expressing preference for both i and jadd the difference in u’s preference for i and j to an average

Recommendation Stage:for every item i the user u expresses no preference for

for every item j that user u expresses a preference forfind the average preference difference between j and iadd this diff to u’s preference value for jadd this to a running average

return the top items, ranked by these averages

Simple weighting: as introduced previouslyStdDev weighting: item-item rating diffs with lower sd should be

weighted highly

Page 14: Slope one recommender on hadoop

Distributed Slope One in Mahout

Similar to our previous practice, e.g. the matrix factorizationProcess, what we need is the Difference Matrix.

Suppose there are M users rated N items, the matrix requires N(N-1)/2 cells. Also, the density is another aspect – how user rated items. If there are several items and the rating matrix is dense, the computational costs will increase accordingly.

Question again: Online or Offline?Depends on tasks & data.

Large-scale data. Let’s do it offline!

Page 15: Slope one recommender on hadoop

Distributed Slope One in Mahout

package org.apache.mahout.cf.taste.hadoop.slopeone;class SlopeOneAverageDiffsJobclass SlopeOnePrefsToDiffsReducerclass SlopeOneDiffsToAveragesReducer

package org.apache.mahout.cf.taste.hadoop;class ToItemPrefsMapperorg.apache.hadoop.mapreduce.Mapper

Two Mapper-Reducer Stages:1). Create DiffMatrix for each user2). Collect AvgDiff info, counts, StdDev

Let’s see how it works…

Page 16: Slope one recommender on hadoop

Mapper and Reducer - 1

Mapper1 (ToItemPrefsMapper) <UserID, Pair<ItemID, Rating>>Reducer1 (PrefsToDiffsReducer) <Pair<Item1,Item2>, Diff> (for all three users)

User HarryPotter Batman Spiderman

U1 5 3 4

U2 ? 2 4

U3 4 2 ?

<U1> Potter Bat Spider

Potter

Bat -2

Spider -1 1

<U2> Potter Bat Spider

Potter

Bat NULL

Spider NULL 2

Page 17: Slope one recommender on hadoop

Mapper and Reducer - 2

Mapper2 (org.apache.hadoop.mapreduce.Mapper)Reducer2 (DiffsToAveragesReducer) Average Diffs, Count, StedDev

<U1> Potter Bat Spider

Potter

Bat -2

Spider -1 1

<U2> Potter Bat Spider

Potter

Bat NULL

Spider NULL 2

<Aggregate> Potter Bat Spider

Potter

Bat -2, 1

Spider -1, 1 1.5, 2

Simply, <a,b> pair denotes a=averge diff, b=countNotice: we should use three matrices in practice, here I used 2.

Page 18: Slope one recommender on hadoop

Predictions

<Aggregate> Potter Bat Spider

Potter

Bat -2, 1

Spider -1, 1 1.5, 2

Simply, <a,b> pair denotes a=averge diff, b=countNotice: we should use three matrices in practice, here I used 2.

User HarryPotter Batman Spiderman

U1 5 3 4

U2 ? 2 4

U3 4 2 ?

Prediction(U3, Spiderman) = [(4-1)*1 + (2+1.5)*2] / (1+2)= 3.33333333333333333333

Page 19: Slope one recommender on hadoop

Experiments

• Data

• Hadoop Setup

• Running Performances

Center for Web Intelligence, DePaul University, USA

Page 20: Slope one recommender on hadoop

Experiment SetupData: MovieLens-1M ratings

# of users: 6,040# of movies: 3,900# of ratings: 1,000,209

Density of the ratings: each user has at least 20 ratingsobviously, some users have many more ratings

Rating format: UserID, ItemID, Rating (scale 1-5)

Data Split: 80% training, 20% testing

Page 21: Slope one recommender on hadoop

Experiment Setup

Hadoop Cluster Setup IBM SmartCloud 1 master node, 7 slave nodes Each node is as SUSE Linux Enterprise Server v11 SP1 Server Configuration:

64 bit (vCPU: 2, RAM: 4 GiB, Disk: 60 GiB) Hadoop v.0.20.205.0 Mahout distribution-0.6

The environment setup follows the typical workflow as:http://irecsys.blogspot.com/2012/11/configurate-map-reduce-environment-on.html

Thanks Scott Young, neat writeup!!

Page 22: Slope one recommender on hadoop

Experimental AnalysesStage-1: SlopeOneAverageDiffsJob by Map-Reduce

Goal: Build DiffStorageOutput: DiffStorage txt file, 1.45GBRunning Time: real 13m 34.228s user 0m 5.136s sys 0m 1.028s

Stage-2: Java evaluator to measure MAE on testing setRunning Time: Load Testing Set (21K records), 299ms Load Training Set (79K records), 1,771ms Load DiffStorage, 176,352ms = 2.9m Prediction (21K records), 18,182ms = 0.3m MAE = 0.71330756

Item1 Item2 Diff Count StdDev

221 223 -1.02 197 0.5

Page 23: Slope one recommender on hadoop

Experimental Experiences1. Why not MovieLens 10M data?

Map-Reduce on 10M data may cost several hrs;Running time depends on cluster and configuration;Also, DiffStorage file will be too large.

2. Java EvaluatorLoad full DiffStorage file is time-consuming.Also, incur Java heap space and GCOverlimit errors;Those errors can not be fixed by –Xmx or other solutions;Two solutions:1). Just use simple weighting, discard StdDev weighting.2). Simple Mapper and Reducer, run it on clusters.

For MovieLens 1M, it is not that efficient compared with the live SlopeOne recommendation; 10M data may be better, will try MovieLens-10M data later; Slope One is simple but memory-expensive.

Page 24: Slope one recommender on hadoop

More …

• Drive Mahout on Hadoop

• Interesting Communities

Center for Web Intelligence, DePaul University, USA

Page 25: Slope one recommender on hadoop

Mahout + HadoopHow to put more Mahout algorithms to Hadoop?

1. Pre-set Command in MahoutLet’s see bin/mahout – help, then it provides a list of available programs such as svd, fkmeans, etc.

Some are basic functions, such as splitDatasetSome can be executed as Hadoop tasks

e.g. Run and evaluate Matrix Factorization on rating dataset

bin/mahout parallelALS --input inputSource --output outputSource--tempDir tmpFolder --numFeatures 20 --numIterations 10

bin/mahout evaluateFactorization --input inputSource --output outputSource --userFeatures als/out/U/ --itemFeatures als/out/M/ --tempDir tmpFolder

Page 26: Slope one recommender on hadoop

Mahout + Hadoop2. More Algorithms on Hadoop

Mahout provides a way to run more Mahout algorithms. Simply,

$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/core/target/mahout-core-<version>.jar <Job Class> --recommenderClassName Class <OPTIONS>

Which kinds of Jobs it supports? Mahout implemented some versions.

Some popular ones:1).org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob

--recommenderClassName ClassName2).org.apache.mahout.cf.taste.hadoop.item.RecommenderJob3).org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob4).org.apache.mahout.cf.taste.hadoop.slopeone.SlopeOneAverageDiffsJob

Page 27: Slope one recommender on hadoop

Interesting CommunitiesBeyond Hadoop and Mahout official sites

1. Data MiningKDnuggets, http://www.kdnuggets.comPopular community for Data Mining & Analytics. Lots of usefulinformation, such as news, materials, datasets, jobs, etc.

2. Big DataSmartData Collective, http://smartdatacollective.com/Smarter Computing, http://www.smartercomputingblog.com/Big Data Meetup, http://big-data.meetup.com/

3. Recommender SystemsACM Official Site, http://recsys.acm.org/RecSys Wiki, http://recsyswiki.com/

Page 28: Slope one recommender on hadoop

Thank You!

Center for Web Intelligence, DePaul University, USA