Top Banner
Apache Mahout Large Scale Machine Learning Speaker: Isabel Drost
62

Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Jun 25, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Apache MahoutLarge Scale Machine Learning

Speaker: Isabel Drost

Page 2: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Agenda

● Motivation.

● What is machine learning?

● Introduction to Mahout.

Page 3: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

January 3, 2006 by Matt Callowhttp://www.flickr.com/photos/blackcustard/81680010

Page 4: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

News aggregation

Today: Read news papers,Blogs, Twitter, RSS feed.

Wish: Aggregate sourcesand track emerging topics.

September 10, 2008 by Alex Barthhttp://www.flickr.com/photos/a-barth/2846621384

Page 5: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

March 7, 2008 by extranoise

http://www.flickr.com/photos/extranoise/2317950586/

Page 6: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Go to cinema

Today: IMDB, zitty, movie reviewpages, twitter, blogs, ask friends.

Wish: Reviews, sentimentdetection, recommendations.

March 22, 2008 by Crystian Cruzhttp://www.flickr.com/photos/crystiancruz/2353895708

Page 7: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Machine learning – what's that?

Page 8: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Image by John Leech, from: The Comic History of Rome by Gilbert Abbott A Beckett.

Bradbury, Evans & Co, London, 1850sArchimedes taking a Warm Bath

Page 9: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Archimedes model of nature

Page 10: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

June 25, 2008 by chase-mehttp://www.flickr.com/photos/sasy/2609508999

Page 11: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004
Page 12: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

An SVM's model of nature

Page 13: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

From data to model.

Page 14: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Gatherdata

Page 15: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

January 8, 2008 by Pink Sherbet Photographyhttp://www.flickr.com/photos/pinksherbet/2177961471/

Page 16: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Gatherdata

Extractsignals

Algor.choice

Para-meters

Trainmodel

Applymodel

Useresults

Page 17: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

December 31, 2005, birdfarmhttp://www.flickr.com/photos/birdfarm/80052248/

Page 18: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

E-Bay

Password

If we looked at two words only:

Page 19: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Reality

● There are a few more words in mails.● Use all relevant features/ signals available.

– Words.

– Header fields.

– Characteristics of attachments.

– …

● Usually pipeline of feature extractors.● UIMA: Apache project focussed on that task.

Page 20: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Gatherdata

Extractsignals

Algor.choice

Para-meters

Trainmodel

Applymodel

Useresults

Page 21: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Training the model

● No single best algorithm for all tasks.● No single best parameter setting per algorithm.

● Evaluate constantly.

Page 22: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Gatherdata

Extractsignals

Algor.choice

Para-meters

Trainmodel

Applymodel

Useresults

Page 23: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Gatherdata

Extractsignals

Algor.choice

Para-meters

Trainmodel

Applymodel

Useresults

Problem:“Nature changes”

Page 24: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Challenges.

Page 25: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Challenges

● Amount of data grows exponentially.– User generated content on the web.

– Sensor data.

– Customer logs.

● Index and search the data.● Build models and generalize from raw data.

● How do non-Googlers deal with that?

Page 26: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Once upon a time

● A Java library.● Index with easy to use API.

Page 27: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Once upon a time

● A library alone is not enough.

Page 28: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Once upon a time

● Lucene: Umbrella project for search at Apache.

Page 29: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Once upon a time

● nutch needed a way to scale to the web.

Page 30: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

● Functional languages support map/reduce.

● 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat.

● 2004 - Initial versions of DFS and Map-Reduce by Doug Cutting & Mike Cafarella

● December 2005 - Nutch ported to framework, 20 nodes.

● January 2006 - Doug Cutting joins Yahoo!

● February 2006 - Apache Hadoop project hived off.

● March 2006 - Formation of the Yahoo! Hadoop team

● April 2007 - Research clusters - 2 clusters of 1000 nodes

Page 31: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Once upon a time

Page 32: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Once upon a time

Page 33: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Once upon a time

And many more inside and outside Apache.

Page 34: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Where does Mahout fit in?

● Amount of data to process is growing.● Idea: Scale and go parallel.

Page 35: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Where does Mahout fit in?

● Amount of data to process is growing.● Idea: Scale and go parallel.

Page 36: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Where does Mahout fit in?

● Amount of data to process is growing.● Idea: Scale and go parallel.

Page 37: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Where does Mahout fit in?

?

Page 38: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

What does Mahout have to offer.

Page 39: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Discover groups of items

● Group items by similarity.

● Examples:– Group news articles by topic.

– Find developers with similar interests.

– Discovery of groups of related search results.

Page 40: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Discover groups of similar items

● Canopy.

● k-Means.

● Fuzzy k-Means.

● Dirichlet based.

● Others upcoming.

Page 41: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Discover groups of similar items

● Example: Synthetic Control

– http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series– Example Job: <MAHOUT_HOME>/examples– Outputs clusters

● Download the distribution.● Run the example.● Have a closer look at the examples.

Page 42: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Assign items to defined categories.

● Given pre-defined categories, assign items to it.

● Examples:– Spam mail classification.

– Discovery of images depicting humans.

Page 43: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Assign items to defined categories.

● Naïve Bayes.

● Complementary naïve bayes.

● Winnow/Perceptron.

● Others upcoming.

Page 44: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Assign items to defined categories.

● Examples based on “standard” datasets:

● 20 Newsgroups

– http://cwiki.apache.org/confluence/display/MAHOUT/TwentyNewsgroups

● Wikipedia

– http://cwiki.apache.org/confluence/display/MAHOUT/WikipediaBayesExample

Page 45: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Evolutionary algorithms

● Traveling Salesman– http://cwiki.apache.org/confluence/display/MAHOUT

/Traveling+Salesman

● Classification rule discovery– http://cwiki.apache.org/confluence/display/MAHOUT

/Class+Discovery

Page 46: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Collaborative filtering

● Recommend items to users.

● Examples:– Find movies I might want to watch.

– Find books related to the book I am buying.

Page 47: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Collaborative filtering

Page 48: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Collaborative filtering

Page 49: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Recommendation mining.

● Mahout with more Taste.● Mature Java library.● Java-based, web service / HTTP bindings.

● Batch mode based on EC2 and Hadoop.

Page 50: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

What next?

● More algorithms.

● More examples.

Page 51: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

What next?

● 2nd Summer of code.● Four mentors.● Three students.● Two returning students.

Robin Anil: Online Classification and Frequent Pattern Mining using Map-Reduce.

David Hall: Distributed Latent Dirichlet Allocation.

AbdelHakim: Implement parallel Random/ Regression Forest.

Page 52: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Why go for Apache Mahout?

Page 53: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Jumpstart your project with proven code.

January 8, 2008 by dreizehn28http://www.flickr.com/photos/1328/2176949559

Page 54: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Discuss ideas and problems online.

November 16, 2005 [phil h]http://www.flickr.com/photos/hi-phi/64055296

Page 55: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Become part of the community.

Page 56: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Release: 0.1Big Thanks to those who made this possible!

October 22, 2008 by e_calamarhttp://www.flickr.com/photos/e_calamar/2964991182/

Page 57: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

[email protected]

[email protected]

Interest in machine learning.

Interesting problems.

Hadoop proficiency.

Bug reports, patches, features.

Documentation, code, examples.July 9, 2006 by trackrecordhttp://www.flickr.com/photos/trackrecord/185514449

Page 58: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Ahem – I do not own a big cluster...

Page 59: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

● Mahout runs on top of Amazon EMR.● Run Mahout on your Hadoop cluster on EC2.● Committers do get free credits for EC2 ;)● Set up your own Hadoop cluster.

Page 60: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

June, 25th 2009: Hadoop* Get Together in Berlin

● Torsten Curdt: “Data Legacy - the challenges of an evolving data warehouse.”

● Christoph M. Friedrich: “SCAIView - Lucene for Life Science Knowledge Discovery”

● Uri Boness, Bram Smeets: “Solr in production.”

newthinking store

Tucholskystr. 48

September, 29th 2009: Hadoop* Get Together in Berlin featuring a talk on UIMA by Thilo Götz.

* UIMA, Hbase, Lucene, Solr, katta, Mahout, CouchDB, pig, Hive, Cassandra, Cascading, JAQL, ... talks welcome as well.

Page 61: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

[email protected]

[email protected]

Interest in machine learning.

Interesting problems.

Hadoop proficiency.

Bug reports, patches, features.

Documentation, code, examples.July 9, 2006 by trackrecordhttp://www.flickr.com/photos/trackrecord/185514449

Page 62: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004