Top Banner

Click here to load reader

of 48

Mahout part2

May 11, 2015



Part two of a presentation about Mahout system. It is based on

  • 1.Mahout in Action Part 2Yasmine M. Gaber 4 April 2013

2. AgendaPart 2: ClusteringPart 3: Classification 3. ClusteringAn algorithmA notion of both similarity and dissimilarityA stopping condition 4. Measuring the similarity of itemsEuclidean Distance 5. Creating the inputPreprocess the dataUse that data to create vectorsSave the vectors in SequenceFile format as input for thealgorithm 6. Using Mahout clusteringThe SequenceFile containing the inputvectors.The SequenceFile containing the initial clustercenters.The similarity measure to be used.The convergenceThreshold.The number of iterations to be done.The Vector implementation used in the inputfiles. 7. Using Mahout clustering 8. Distance measuresEuclidean distance measureSquared Euclidean distance measureManhattan distance measure 9. Distance measuresCosine distance measureTanimoto distance measure 10. Playing Around 11. Representing data 12. Representing text documents as vectorsVector Space Model (VSM)TF-IDFN-gram collocations 13. Generating vectors from documents$ bin/mahout seqdirectory -c UTF-8 -iexamples/reuters-extracted/ -o reuters-seqfiles$ bin/mahout seq2sparse -i reuters-seqfiles/ -oreuters-vectors -ow 14. Improving quality of vectors using normalizationP-norm$ bin/mahout seq2sparse -i reuters-seqfiles/-o reuters-normalized-bigram -ow -aorg.apache.lucene.analysis.WhitespaceAnalyzer-chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2-ml 50 -seq -n 2 15. Clustering CategoriesExclusive clusteringOverlapping clusteringHierarchical clusteringProbabilistic clustering 16. Clustering ApproachesFixed number of centersBottom-up approachTop-down approach 17. Clustering algorithmsK-means clusteringFuzzy k-means clusteringDirichlet clustering 18. k-means clustering algorithm 19. Running k-means clustering 20. Running k-means clustering$ bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -c reuters-initial-clusters -o reuters-kmeans-clusters -dmorg.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 1.0 -k 20-x 20 -cl$ bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -c reuters-initial-clusters -o reuters-kmeans-clusters -dmorg.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -k 20 -x 20 -cl$ bin/mahout clusterdump -dt sequencefile -d 21. Fuzzy k-means clusteringInstead of the exclusive clustering in k-means,fuzzy k-means tries to generate overlappingclusters from the data set.Also known as fuzzy c-means algorithm. 22. Running fuzzy k-means clustering 23. Running fuzzy k-means clustering$ bin/mahout fkmeans -i reuters-vectors/tfidf-vectors/ -c reuters-fkmeans-centroids -oreuters-fkmeans-clusters -cd 1.0 -k 21 -m 2-ow -x 10 -dmorg.apache.mahout.common.distance.SquaredEuclideanDistanceMeasureFuzziness factor 24. Dirichlet clusteringmodel-based clustering algorithm 25. Running Dirichlet clustering$ bin/mahout dirichlet -i reuters-vectors/tfidf-vectors -o reuters-dirichlet-clusters -k 60-x 10 -a0 1.0 -mdorg.apache.mahout.clustering.dirichlet.models.GaussianClusterDistribution -mporg.apache.mahout.math.SequentialAccessSparseVector 26. Evaluating and improving clusteringqualityInspecting clustering outputEvaluating the quality of clustering0Improving clustering quality 27. Inspecting clustering output$ bin/mahout clusterdump -s kmeans-output/clusters-19/ -d reuters-vectors/dictionary.file-0 -dt sequencefile -n 10Top Terms: said=>11.60126582278481 bank =>5.943037974683544dollar => 28. Analyzing clustering outputDistance measure and feature selectionInter-cluster and intra-cluster distancesMixed and overlapping clusters 29. Improving clustering qualityImproving document vector generationWriting a custom distance measure 30. Real-world applications of clusteringClustering like-minded people on TwitterSuggesting tags for an artist on usingclusteringCreating a related-posts feature for a website 31. ClassificationClassification is a process of using specificinformation (input) to choose a single selection(output) from a short list of predeterminedpotential responses.Applications of classification, e.g. spamfiltering 32. Why use Mahout for classification? 33. How classification works 34. ClassificationTraining versus test versus productionPredictor variables versus target variableRecords, fields, and values 35. Types of values for predictorvariablesContinuousCategoricalWord-likeText-like 36. Classification Work flowTraining the modelEvaluating the modelUsing the model in production 37. Stage 1: training the classificationmodelStage 2: evaluating the classificationmodelStage 3: using the model in production 38. Stage 1: training the classificationmodelDefine Categories for the Target VariableCollect Historical DataDefine Predictor VariablesSelect a Learning Algorithm to Train the ModelUse Learning Algorithm to Train the Model 39. Extracting features to build aMahout classifier 40. Preprocessing raw data into classifiable data 41. Converting classifiable data intovectorsUse one Vector cell per word, category, orcontinuous valueRepresent Vectors implicitly as bags of wordsUse feature hashing 42. Classifying the 20 newsgroups data set 43. Choosing an algorithm 44. The classifier evaluation APIPercent correctConfusion matrixEntropy matrixAUCLog likelihood 45. When classifiers go badTarget leaksBroken feature extraction 46. Tuning the problemRemove Fluff VariablesAdd New Variables, Interactions, and DerivedValues 47. Tuning the classifierTry Alternative AlgorithmsTune the Learning Algorithm 48. Thank You Contact at:Email: [email protected]:

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.