Top Banner
Mahout Learning with
37

Mahout Introduction BarCampDC

Jan 15, 2015

Download

Technology

Drew Farris

An introduction to Apache Mahout presented at Apache BarCamp DC, May 19, 2012

A brief introduction to the examples and links to more resources for further exploration.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mahout Introduction BarCampDC

MahoutLearning with

Page 2: Mahout Introduction BarCampDC

About me

Drew Farris Committer to Apache Mahout since

2/2010 ..not as active in the past year

Author: Taming Text My Company: (and BarCamp DC Sponsor)

Page 3: Mahout Introduction BarCampDC

What is Mahout?

Mahout (as in hoot) or Mahout (as in trout)?

A scalable machine learning library

Page 4: Mahout Introduction BarCampDC

What is Mahout?

A scalable machine learning library ‘large’ data sets Often Hadoop ..but sometimes not

Page 5: Mahout Introduction BarCampDC

What is Mahout?

A scalable machine learning library Recommendation Mining

Page 6: Mahout Introduction BarCampDC

What is Mahout?

A scalable machine learning library Recommendation Mining Clustering

Page 7: Mahout Introduction BarCampDC

What is Mahout?

A scalable machine learning library Recommendation Mining Clustering Classification

Page 8: Mahout Introduction BarCampDC

What is Mahout?

A scalable machine learning library Recommendation Mining Clustering Classification Association Mining

Page 9: Mahout Introduction BarCampDC

What is Mahout?

A scalable machine learning library Recommendation Mining Clustering Classification Association Mining A reasonable linear algebra library A reasonable library of collections

Page 10: Mahout Introduction BarCampDC

What is Mahout?

A scalable machine learning library Recommendation Mining Clustering Classification Association Mining A reasonable linear algebra library A reasonable library of collections Other Stuff

Page 11: Mahout Introduction BarCampDC

Mahout

Getting Started Check out & build the code ▪ git clone git://git.apache.org/mahout.git▪ mvn install –DskipTests=true▪ The tests take a looong time to run, not needed for

intial build Or use the Cloudera Virtual Machine (http://bit.ly/

MyBnFi)

Page 12: Mahout Introduction BarCampDC

Mahout

Getting Started Check out & build the code Examples in examples/bin

Page 13: Mahout Introduction BarCampDC

Mahout

Getting Started Check out & build the code Examples in examples/bin Wiki (http://mahout.apache.org/)

Page 14: Mahout Introduction BarCampDC

Mahout

Getting Started Check out & build the code Examples in examples/bin Wiki (http://mahout.apache.org/) Articles & Presentations▪ Grant’s IBM Developerworks Article▪ http://ibm.co/LUbptg (Nov 2011)

▪ Others @ http://bit.ly/IZ6PqE (wiki)

Page 15: Mahout Introduction BarCampDC

Mahout

Getting Started Check out & build the code Examples in examples/bin Wiki (http://mahout.apache.org/) Articles & Publications (http://bit.ly/IZ6PqE) Mailing Lists ▪ [email protected] ▪ (http://bit.ly/L1GSHB)▪ [email protected]▪ (http://bit.ly/JPeNoE)

Page 16: Mahout Introduction BarCampDC

Mahout

Getting Started Check out & build the code Examples in examples/bin Wiki (http://mahout.apache.org/) Articles & Presentations Mailing Lists Books! ▪ Mahout in Action: http://bit.ly/IWMvaz▪ Taming Text: http://bit.ly/KkODZV

Page 17: Mahout Introduction BarCampDC

Mahout Examples

Kicking the Tires in examples/bin classify-20newsgroups.sh cluster-reuters.sh cluster-syntheticcontrol.sh asf-email-examples.sh

Page 18: Mahout Introduction BarCampDC

Mahout Examples

Kicking the Tires in examples/bin classify-20newsgroups.sh Premise: Classify News Stories Algorithm: sgd Data: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-

bydate.tar.gz

Page 19: Mahout Introduction BarCampDC

Mahout Examples

Kicking the Tires in examples/bin cluster-reuters.sh Premise: Group Related News Stories Data: http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz

Page 20: Mahout Introduction BarCampDC

Mahout Examples

Kicking the Tires in examples/bin cluster-syntheticcontrol.sh▪ Premise: Cluster time series data▪ normal, cyclic, increasing, decreasing, upward,

downward shift

▪ Algorithms: ▪ canopy, kmeans, fuzzykmeans, dirichlet, meanshift

See: https://cwiki.apache.org/MAHOUT/clustering-of-synthetic-control-data.html Data: http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html

Page 21: Mahout Introduction BarCampDC

Mahout Examples

Kicking the Tires in examples/bin asf-email-examples.sh▪ Recommendation (user based)▪ Clustering (kmeans, dirichlet, minhash)▪ Classification (naïve bayes, sgd)

Page 22: Mahout Introduction BarCampDC

Learning Outline

General Outline: Data Transformation▪ From Native format to…▪ ..Sequence Files; Typed Key, Value pairs▪ ..Labeled Vectors

Page 23: Mahout Introduction BarCampDC

Learning Outline

General Outline: Data Transformation▪ From Native format to…▪ ..Sequence Files; Typed Key, Value pairs▪ ..Labeled Vectors

Model Training

Page 24: Mahout Introduction BarCampDC

Learning Outline

General Outline: Data Transformation▪ From Native format to…▪ ..Sequence Files; Typed Key, Value pairs▪ ..Labeled Vectors

Model Training Model Evaluation

Page 25: Mahout Introduction BarCampDC

Learning Outline

General Outline: Data Transformation▪ From Native format to…▪ ..Sequence Files; Typed Key, Value pairs▪ ..Labeled Vectors

Model Training Model Evaluation Lather, Rinse, Repeat

Page 26: Mahout Introduction BarCampDC

Learning Outline

General Outline: Data Transformation▪ From Native format to…▪ ..Sequence Files; Typed Key, Value pairs▪ ..Labeled Vectors

Model Training Model Evaluation Lather, Rinse, Repeat Production

Page 27: Mahout Introduction BarCampDC

Learning Outline

General Outline: Data Transformation▪ From Native format to…▪ ..Sequence Files; Typed Key, Value pairs▪ ..Labeled Vectors

Model Training Model Evaluation Lather, Rinse, Repeat Production Lather, Rinse, Repeat

Page 28: Mahout Introduction BarCampDC

Text to Sparse Vectors

mahout seq2sparse Tokenize Documents Count Words Make Partial/Merge Vectors TFIDF Make Partial/Merge TFIDF Vectors

Page 29: Mahout Introduction BarCampDC

Tips

View Sequence Files with: mahout seqdumper –i /path/to/sequence/file

Check out shortcuts in: src/conf/driver.classes.props

Run classes with: mahout org.apache.mahout.SomeCoolNewFeature …

Standalone vs. Distributed Standalone mode is default Set HADOOP_CONF_DIR to use Hadoop MAHOUT_LOCAL will force standalone

Page 30: Mahout Introduction BarCampDC

Example: Recommendation asf-email-examples.sh (recommendation)

Premise: Recommend Interesting Threads User based recommendation Boolean preferences based on thread

contribution Implies boolean similarity measure – tanimoto, log-

likelihood

See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/

Page 31: Mahout Introduction BarCampDC

Recommendation Process Recommendation Steps

Convert Mail to Sequence Files Convert Sequence Files to Preferences Prepare Preference Matrix Row Similarity Job Recommender Job

See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/

Page 32: Mahout Introduction BarCampDC

Example: Classification

asf-email-examples.sh (classification)

Premise: Predict project mailing lists for incoming messages

Data labeled based on the mailing list it arrived on Hold back a random 20% of data for testing, the

rest for training. Algorithms: Naïve Bayes (Standard, Complimentary),

SGD

See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/

Page 33: Mahout Introduction BarCampDC

Classification Process

Classification Steps Convert Mail to Sequence Files Sequence Files to Sparse Vectors Modify Sequence File Labels Split into Training and Test Sets Train the Model Test the Model

See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/

Page 34: Mahout Introduction BarCampDC

Example: Clustering

asf-email-examples.sh (clustering)

Premise: Grouping Messages by Subject Same Prep as Classification Different Algorithms: (kmeans, dirichlet,

minhash)

12/05/16 05:16:02 INFO driver.MahoutDriver: Program took 20577398 ms (Minutes: 342.95663333333334

See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/

Page 35: Mahout Introduction BarCampDC

Clustering Process

Clustering Steps Convert Mail to Sequence Files Sequence Files to Sparse Vectors Run Clustering (iterate) Dump Results

Page 36: Mahout Introduction BarCampDC

Where to now?

Insert Bar Camp Style Discussion Here

Page 37: Mahout Introduction BarCampDC

Resources

Mahout in Action Owen, Anil, Dunning and Friedman http://bit.ly/IWMvaz

Taming Text Ingersoll, Morton and Farris http://bit.ly/KkODZV