Touch the mahout

Touch The Mahout !

makoto uehara

自己紹介

・～2012年2月某SIerでインフラ周りに従事・ 2012年3月サイバーエージェント入社

- Amebaスマフォプラットフォームの構築

- 統合ログ解析基盤やオンラインデータベースの

インフラミドルウェア部分を担当

- Hadoop､HBase、Flume

・上原誠 (@pioho07)

【名前】

【経歴】 Facebook申請歓迎

Whats Mahout ?

Whats Mahout ?

スケーラブルな機械学習・データマイニングライブラリ

これ

CyberZ’s Hadoop Cluster

HDFS MR

ClouderaManager v4.7.3 Hive

Master Server*3

Slave Server*5

管理ツール使ってます

CDH4.4.0

Hadoop Cluster

HDD:100TB CPU:120Core

memory:480GB

Hue

Mahout’s Algorithm

[hdfs@svr001 ~]$ mahout

An example program must be given as the first argument.

Valid program names are:

arff.vector: : Generate Vectors from an ARFF file or directory

baumwelch: : Baum-Welch algorithm for unsupervised HMM training

canopy: : Canopy clustering

cat: : Print a file or resource as the logistic regression models would see it

cleansvd: : Cleanup and verification of SVD output

clusterdump: : Dump cluster output to text

clusterpp: : Groups Clustering Output In Clusters

cmdump: : Dump confusion matrix in HTML or text formats

cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.

dirichlet: : Dirichlet Clustering

recommendfactorized: : Compute recommendations using the factorization of a rating matrix

recommenditembased: : Compute recommendations using item-based collaborative filtering

fkmeans: : Fuzzy K-means clustering

fpg: : Frequent Pattern Growth

： :

:

Help見るとアルゴリズムいっぱい

Reccomendation

[hdfs@svr001 ~]$ mahout

An example program must be given as the first argument.

Valid program names are:

arff.vector: : Generate Vectors from an ARFF file or directory

baumwelch: : Baum-Welch algorithm for unsupervised HMM training

canopy: : Canopy clustering

cat: : Print a file or resource as the logistic regression models would see it

cleansvd: : Cleanup and verification of SVD output

clusterdump: : Dump cluster output to text

clusterpp: : Groups Clustering Output In Clusters

cmdump: : Dump confusion matrix in HTML or text formats

cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.

dirichlet: : Dirichlet Clustering

recommendfactorized: : Compute recommendations using the factorization of a rating matrix

recommenditembased: : Compute recommendations using item-based collaborative filtering

fkmeans: : Fuzzy K-means clustering

fpg: : Frequent Pattern Growth

：：：

レコメンドやってみたいよね

「ユーザAはこの商品を4点と評価しています

ユーザBはこの商品を3.5点と評価しています」といった情報を元にして、

「ユーザAにはこの商品がお勧めです」という結果を出すヤツ

まず、誰がどの商品をオススメしているかを表すデータを作成します

MahoutにはユーザとかアイテムをIDにして渡さないといけないので、

入力ファイルは数字祭なファイルになります

Reccomendation

●Input File１

1,101,5.0

1,102,3.0

1,103,2.5

2,101,2.0

2,102,2.5

2,103,5.0

2,104,2.0

3,101,2.5

3,102,4.0

3,105,4.5

4,101,5.0

4,103,3.0

4,104,4.5

4,106,4.0

5,101,4.0

5,102,1.0

5,103,2.0

5,104,4.0

5,105,3.5

5,106,4.0

●Input File１

UserID：1-5

ItemID：101-107

Score：0.0-5.0

こんな感じで

入力ファイル作成

User2 User3

Score=4.5

Score=4.5

Mahout Command Line

Algorithm

mahout¥

recommenditembased¥

--input /mahout/recommend_sample1.csv¥

--output /mahout/recome1 ¥

–similarityClassname¥

SIMILARITY_PEARSON_CORRELATION

Input File

Output Dir

SIMILARITY

Mahout

[hdfs@svr001 ~]$ mahout recommenditembased --input

/mahout/recommend_sample1.csv --output /mahout/recome1 --similarityClassname


MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.

Running on hadoop, using /opt/cloudera/parcels/CDH-4.4.0-

1.cdh4.4.0.p0.39/lib/hadoop/bin/hadoop and

HADOOP_CONF_DIR=/etc/hadoop/conf

MAHOUT-JOB: /opt/cloudera/parcels/CDH-4.4.0-

1.cdh4.4.0.p0.39/lib/mahout/mahout-examples-0.7-cdh4.4.0-job.jar

13/12/12 11:26:23 INFO common.AbstractJob: Command line arguments: {--

booleanData=[false], --endPhase=[2147483647], --

input=[/mahout/recommend_sample4.csv], --maxPrefsPerUser=[1000], --

minPrefsPerUser=[1], --output=[temp/preparePreferenceMatrix], --ratingShift=[0.0],

--startPhase=[0], --tempDir=[temp]}

13/12/12 11:26:24 WARN mapred.JobClient: Use GenericOptionsParser for parsing

the arguments. Applications should implement Tool for the same.

13/12/12 11:26:25 INFO input.FileInputFormat: Total input paths to process : 1

13/12/12 11:26:25 INFO mapred.JobClient: Running job: job_201312042139_0225

13/12/12 11:26:26 INFO mapred.JobClient: map 0% reduce 0%



Command Run

●処理時間はけっこうかかった。

２１行の入力ファイルで

２～3分かかった。。

●mahout version 0.7

●はまったとこ

入力ファイルに改行が入ってるとエラーが出てハマった

Command Run

SIMILARITY in Recommenditembased

SIMILARITY_COOCCURRENCE

SIMILARITY_LOGLIKELIHOOD

SIMILARITY_TANIMOTO_COEFFICIENT

SIMILARITY_CITY_BLOCK

SIMILARITY_COSINE


SIMILARITY_EUCLIDEAN_DISTANCE

SIMILARITY(相関)

よくわかりません><



SIMILARITY(相関)

今回この2つで試してみよう

●Similarity

PEARSON_CORRELATION(同時に起こる2つのことの一時的な特性)

●Run

mahout recommenditembased --input /mahout/recommend_sample1.csv --output

/mahout/recome1 --similarityClassname SIMILARITY_PEARSON_CORRELATION

●Result

[hdfs@svr001 ~]$ hadoop fs -cat /mahout/recome/part-r-0000*

3 [103:4.279442]

1 [105:2.868604]

2 [105:3.1569808]

Run Result

●Similarity

TANIMOTO_COEFFICIENT(谷本係数)

●Run


/mahout/recome2 --similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT

●Result

[hdfs@svr001 ~]$ hadoop fs -cat /mahout/recome2/part-r-0000*

3 [106:3.5357144,104:3.3799999,103:3.3125]

1 [105:3.6363637,106:3.5,104:3.4714286]

4 [105:4.2746477,102:4.2]

2 [106:2.9056604,105:2.6296296]

Run Result

SIMILARITY(相関)

PEARSON_CORRELATION だと、

User3にはItem 103をScore=4.2794でレコメンド

TANIMOTO_COEFFICIENTだと、

User3にはItem 106をScore=3.5357でレコメンド

アルゴリズムの違いでお勧め商品もスコアも違うね

SIMILARITY(相関)

入力ファイルいじって

無理やりにでもUser2に商品102を

お勧めしてみてやる！

1,101,5.0

1,102,5.0

1,103,5.0

1,108,1.5

2,101,5.0

2,102,4.0

2,103,5.0

2,104,2.0

2,108,5.0

3,101,5.0

3,103,5.0

3,105,4.5

3,107,1.0

3,108,4.5

4,101,1.5

4,103,1.0

4,104,2.5

4,106,6.0

5,101,1.0

5,102,1.5

5,103,2.0

5,104,3.0

5,105,3.5

5,106,5.0

【Step1】入力データを、

user1,2,3の購入傾向が似ている状況にする

【Step2】 user1,2,3は

商品101,103を購入し評価5.0としている

user1,2は商品102も購入し高評価している

なので ↓↓↓

【Result！】

user3に商品102をrecommend

するっしょ！？


/mahout/recome1 --similarityClassname SIMILARITY_PEARSON_CORRELATION


3 [102:4.8554997,106:4.5731964]

1 [106:5.0,105:4.615747]

4 [105:2.6432545,102:2.6432545,108:1.2666667]

2 [105:4.7233295,106:4.133889]

5 [108:3.5]

レコメンドさせられた？結果は・・？

ｷﾀ━(ﾟ∀ﾟ)━!

ただ、

PEARSON_CORRELATIONだとうまくいったが、

TANIMOTO_COEFFICIENTだとうまくいかなかった

アルゴリズムの違いなんだろうね・・

レコメンドさせるられたか？結果

mahout recommenditembased --input /mahout/recommend_sample2.csv --

output /mahout/recome22 --similarityClassname



3 [104:4.8636365,106:4.852941,102:4.8076925]

1 [104:4.0,105:3.9807692,107:3.9545455,106:3.857143]

4 [107:4.25,108:4.0,102:3.6410255,105:3.6325302]

2 [107:5.0,105:4.354839,106:3.6893203]

5 [107:2.6111112,108:1.872093]

レコメンドさせるられたか？結果

２番目になってる

ご清聴ありがとうございました！

Touch the mahout

Technology