Top Banner
Apache Mahout
17

Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Jan 01, 2016

Download

Documents

Amanda Copeland
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Apache Mahout

Page 2: Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

• Mahout Introduction • Machine Learning• Clustering• K-means • Canopy Clustering• Fuzzy K-Means

• Conclusion

Page 3: Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

What is Mahout?

• Distributed machine learning libraries– “scalable to reasonably large data sets”– Runs on Hadoop

Page 4: Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

What?• Hadoop brings:– Map/Reduce API– HDFS– In other words, scalability and fault-tolerance

• Mahout brings:– Library of machine learning algorithms– Examples

Page 5: Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Why Mahout?

• Many Open Source ML libraries either:– Lack Community

– Lack Documentation and Examples

– Lack Scalability

– Lack the Apache License ;-)

– Or are research-oriented

Page 6: Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Clustering

• Unsupervised• Find Natural Groupings– Documents– Search Results– People– Genetic traits in groups– Many, many more uses

Page 7: Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Types

• Supervised– Using labeled training data, create function that

predicts output of unseen inputs• Unsupervised– Using unlabeled data, create function that predicts

output• Semi-Supervised– Uses labeled and unlabeled data

Page 8: Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Example: Clustering

Google News

Page 9: Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

K-means Algorithm

1) Pick a number (k) of cluster centers2) Assign every element to its nearest cluster

center3) Move each cluster center to the mean of

its assigned elements 4) Repeat 2-3 until convergence

Page 10: Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Figure 1: K-means algorithm. Training examples are shown as dots, and cluster centroids are shown as crosses.

K-means Example

Page 11: Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

K-means Example

Invocation using the command line takes the form:

Page 12: Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Canopy Clustering• Canopy Clustering is a very simple, fast and surprisingly accurate method for

grouping objects into clusters.

Define two thresholdsTight: T1

Loose: T2Put all records into a set SWhile S is not empty

Remove any record r from S and create a canopy centered at rFor each other record ri, compute cheap distance d from r to ri If d < T2, place ri in r’s canopyIf d < T1, remove ri from S

Page 13: Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Canopy Clustering

SequenceFile (WritableComparable, VectorWritable)

Invocation using the command line takes the form:

Page 14: Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Fuzzy K-Means

Fuzzy K-Means (also called Fuzzy C-Means) is an extension of K-Means , the popular simple clustering technique. Fuzzy K-Means (also called Fuzzy C-Means) is an extension of K-Means , the popular simple clustering technique.

Like K-Means, Fuzzy K-Means works on those objects which can be represented in n-dimensional vector space and a distance measure is defined. The algorithm is similar to k-means.

Initialize k clusters

Until convergedCompute the probability of a point belong to a cluster for every pairRe-compute the cluster centers using above probability membership values of points to clusters.

Page 15: Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Fuzzy K-MeansInvocation using the command line takes the form:

Page 16: Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Conclusion

• Mahout did not scale well• Mahout was not easy to learn• Mahout was not easily modifiable

• For performance and efficiency, it is better to– Understand the data set– Understand data mining– Understand the methodology

Page 17: Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Thank you !