Intro to Apache Mahout Grant Ingersoll Lucid Imagination http:// www.lucidimagination.com
May 06, 2015
Intro to Apache Mahout
Grant IngersollLucid Imagination
http://www.lucidimagination.com
Anyone Here Use Machine Learning?
• Any users of:– Google?• Search?• Priority Inbox?
– Facebook?
– Twitter?
– LinkedIn?
Topics
• Background and Use Cases• What can you do in Mahout?• Where’s the community at?• Resources• K-Means in Hadoop (time permitting)
Definition
• “Machine Learning is programming computers to optimize a performance criterion using example data or past experience”– Intro. To Machine Learning by E. Alpaydin
• Subset of Artificial Intelligence• Lots of related fields:
– Information Retrieval– Stats– Biology– Linear algebra– Many more
Common Use Cases
• Recommend friends/dates/products• Classify content into predefined groups• Find similar content• Find associations/patterns in actions/behaviors• Identify key topics/summarize text– Documents and Corpora
• Detect anomalies/fraud• Ranking search results• Others?
Apache Mahout
• An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License– http://mahout.apache.org
• Why Mahout?– Many Open Source ML libraries either:
• Lack Community• Lack Documentation and Examples• Lack Scalability• Lack the Apache License• Or are research-oriented
Definition: http://dictionary.reference.com/browse/mahout
What does scalable mean to us?• Goal: Be as fast and efficient as possible given
the intrinsic design of the algorithm– Some algorithms won’t scale to massive machine
clusters– Others fit logically on a Map Reduce framework
like Apache Hadoop– Still others will need different distributed
programming models– Others are already fast (SGD)
• Be pragmatic
Sampling of Who uses Mahout?
https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
What Can I do with Mahout Right Now?
3C + FPM + O = Mahout
Collaborative Filtering
• Extensive framework for collaborative filtering (recommenders)
• Recommenders– User based– Item based
• Online and Offline support– Offline can utilize Hadoop
• Many different Similarity measures– Cosine, LLR, Tanimoto, Pearson, others
Clustering
• Document level– Group documents based
on a notion of similarity– K-Means, Fuzzy K-
Means, Dirichlet, Canopy, Mean-Shift, EigenCuts (Spectral)
– All Map/Reduce– Distance Measures
• Manhattan, Euclidean, other
• Topic Modeling – Cluster words across
documents to identify topics
– Latent Dirichlet Allocation (M/R)
Categorization• Place new items into
predefined categories:– Sports, politics,
entertainment– Recommenders
• Implementations– Naïve Bayes (M/R)– Compl. Naïve Bayes (M/R)– Decision Forests (M/R)– Linear Regression (Seq. but
Fast!)
•See Chapter 17 of Mahout in Action for Shop It To Me use case:• http://awe.sm/5FyNe
Freq. Pattern Mining
• Identify frequently co-occurrent items
• Useful for:– Query
Recommendations• Apple -> iPhone, orange,
OS X
– Related product placement• Basket Analysis
• Map/Reduce
http://www.amazon.com
Other
• Primitive Collections!
• Collocations (M/R)
• Math library– Vectors, Matrices, etc.
• Noise Reduction via Singular Value Decomp (M/R)
Prepare Data from Raw content
• Data Sources:– Lucene integration
• bin/mahout lucene.vector …
– Document Vectorizer• bin/mahout seqdirectory …• bin/mahout seq2sparse …
– Programmatically• See the Utils module in Mahout and the Iterator<Vector>
classes
– Database– File system
How to: Command Line
• Most algorithms have a Driver program– $MAHOUT_HOME/bin/mahout.sh helps with most tasks
• Prepare the Data– Different algorithms require different setup
• Run the algorithm– Single Node– Hadoop
• Print out the results or incorporate into application– Several helper classes:
• LDAPrintTopics, ClusterDumper, etc.
What’s Happening Now?
• Unified Framework for Clustering and Classification
• 0.5 release on the horizon (May?)• Working towards 1.0 release by focusing on:– Tests, examples, documentation– API cleanup and consistency
• Gearing up for Google Summer of Code– New M/R work for Hidden Markov Models
Summary
• Machine learning is all over the web today
• Mahout is about scalable machine learning
• Mahout has functionality for many of today’s common machine learning tasks
• Many Mahout implementations use Hadoop
Resources• http://mahout.apache.org• http://cwiki.apache.org/MAHOUT• {user|dev}@mahout.apache.org• http://svn.apache.org/repos/asf/mahout/trunk• http://hadoop.apache.org
Resources
• “Mahout in Action” – Owen, Anil, Dunning and Friedman– http://awe.sm/5FyNe
• “Introducing Apache Mahout” – http://www.ibm.com/developerworks/java/library/j-mahout/
• “Taming Text” by Ingersoll, Morton, Farris• “Programming Collective Intelligence” by Toby Segaran• “Data Mining - Practical Machine Learning Tools and
Techniques” by Ian H. Witten and Eibe Frank• “Data-Intensive Text Processing with MapReduce” by Jimmy
Lin and Chris Dyer
K-Means• Clustering Algorithm– Nicely parallelizable!
http://en.wikipedia.org/wiki/K-means_clustering
K-Means in Map-Reduce• Input:
– Mahout Vectors representing the original content– Either:
• A predefined set of initial centroids (Can be from Canopy)• --k – The number of clusters to produce
• Iterate– Do the centroid calculation (more in a moment)
• Clustering Step (optional)• Output
– Centroids (as Mahout Vectors)– Points for each Centroid (if Clustering Step was taken)
Map-Reduce Iteration
• Each Iteration calculates the Centroids using:– KMeansMapper– KMeansCombiner– KMeansReducer
• Clustering Step– Calculate the points for each Centroid using:– KMeansClusterMapper
KMeansMapper
• During Setup:– Load the initial Centroids (or the
Centroids from the last iteration)• Map Phase– For each input
• Calculate it’s distance from each Centroid and output the closest one
• Distance Measures are pluggable– Manhattan, Euclidean, Squared
Euclidean, Cosine, others
KMeansReducer
• Setup:– Load up clusters– Convergence information– Partial sums from
KMeansCombiner (more in a moment)
• Reduce Phase– Sum all the vectors in the
cluster to produce a new Centroid
– Check for Convergence• Output cluster
KMeansCombiner
• Just like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper
KMeansClusterMapper
• Some applications only care about what the Centroids are, so this step is optional
• Setup:– Load up the clusters and the DistanceMeasure
used• Map Phase– Calculate which Cluster the point belongs to– Output <ClusterId, Vector>