Top Banner

Click here to load reader

48

Mahout part2

May 11, 2015

Download

Technology

Yasmine Gaber

Part two of a presentation about Mahout system. It is based on http://my.safaribooksonline.com/9781935182689/
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mahout part2

Mahout in ActionPart 2

Yasmine M. Gaber4 April 2013

Page 2: Mahout part2

Agenda

Part 2: Clustering Part 3: Classification

Page 3: Mahout part2

Clustering

An algorithm

A notion of both similarity and dissimilarity

A stopping condition

Page 4: Mahout part2

Measuring the similarity of items

Euclidean Distance

Page 5: Mahout part2

Creating the input

Preprocess the data Use that data to create vectors Save the vectors in SequenceFile format as input for the

algorithm

Page 6: Mahout part2

Using Mahout clustering

The SequenceFile containing the input vectors.

The SequenceFile containing the initial cluster centers.

The similarity measure to be used. The convergenceThreshold. The number of iterations to be done. The Vector implementation used in the input

files.

Page 7: Mahout part2

Using Mahout clustering

Page 8: Mahout part2

Distance measures

Euclidean distance measure

Squared Euclidean distance measure

Manhattan distance measure

Page 9: Mahout part2

Distance measures

Cosine distance measure

Tanimoto distance measure

Page 10: Mahout part2

Playing Around

Page 11: Mahout part2

Representing data

Page 12: Mahout part2

Representing text documents as vectors

Vector Space Model (VSM) TF-IDF

N-gram collocations

Page 13: Mahout part2

Generating vectors from documents

$ bin/mahout seqdirectory -c UTF-8 -i examples/reuters-extracted/ -o reuters-seqfiles

$ bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -ow

Page 14: Mahout part2

Improving quality of vectors using normalization

P-norm

$ bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-normalized-bigram -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer

-chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2

Page 15: Mahout part2

Clustering Categories

Exclusive clustering Overlapping clustering Hierarchical clustering Probabilistic clustering

Page 16: Mahout part2

Clustering Approaches

Fixed number of centers

Bottom-up approach

Top-down approach

Page 17: Mahout part2

Clustering algorithms

K-means clustering

Fuzzy k-means clustering

Dirichlet clustering

Page 18: Mahout part2

k-means clustering algorithm

Page 19: Mahout part2

Running k-means clustering

Page 20: Mahout part2

Running k-means clustering

$ bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -c reuters-initial-clusters -o reuters-kmeans-clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 1.0 -k 20 -x 20 -cl

$ bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -c reuters-initial-clusters -o reuters-kmeans-clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -k 20 -x 20 -cl

$ bin/mahout clusterdump -dt sequencefile -d reuters-vectors/dictionary.file-* -s reuters-kmeans-clusters/clusters-19 -b 10 -n 10

Page 21: Mahout part2

Fuzzy k-means clustering

Instead of the exclusive clustering in k-means, fuzzy k-means tries to generate overlapping clusters from the data set.

Also known as fuzzy c-means algorithm.

Page 22: Mahout part2

Running fuzzy k-means clustering

Page 23: Mahout part2

Running fuzzy k-means clustering

$ bin/mahout fkmeans -i reuters-vectors/tfidf-vectors/ -c reuters-fkmeans-centroids -o reuters-fkmeans-clusters -cd 1.0 -k 21 -m 2 -ow -x 10 -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure

Fuzziness factor

Page 24: Mahout part2

Dirichlet clustering

model-based clustering algorithm

Page 25: Mahout part2

Running Dirichlet clustering

$ bin/mahout dirichlet -i reuters-vectors/tfidf-vectors -o reuters-dirichlet-clusters -k 60 -x 10 -a0 1.0 -md org.apache.mahout.clustering.dirichlet.models.GaussianClusterDistribution -mp org.apache.mahout.math.SequentialAccessSparseVector

Page 26: Mahout part2

Evaluating and improving clustering quality

Inspecting clustering output Evaluating the quality of clustering0 Improving clustering quality

Page 27: Mahout part2

Inspecting clustering output

$ bin/mahout clusterdump -s kmeans-output/clusters-19/ -d reuters-vectors/dictionary.file-0 -dt sequencefile -n 10

Top Terms:

said => 11.60126582278481

bank => 5.943037974683544

dollar => 4.89873417721519

market => 4.405063291139241

us => 4.2594936708860756

banks => 3.3164556962025316

pct => 3.069620253164557

he => 2.740506329113924

rates => 2.7151898734177213

rate => 2.7025316455696204

Page 28: Mahout part2

Analyzing clustering output

Distance measure and feature selection Inter-cluster and intra-cluster distances Mixed and overlapping clusters

Page 29: Mahout part2

Improving clustering quality

Improving document vector generation Writing a custom distance measure

Page 30: Mahout part2

Real-world applications of clustering

Clustering like-minded people on Twitter

Suggesting tags for an artist on Last.fm using clustering

Creating a related-posts feature for a website

Page 31: Mahout part2

Classification

Classification is a process of using specific information (input) to choose a single selection (output) from a short list of predetermined potential responses.

Applications of classification, e.g. spam filtering

Page 32: Mahout part2

Why use Mahout for classification?

Page 33: Mahout part2

How classification works

Page 34: Mahout part2

Classification

Training versus test versus production Predictor variables versus target variable Records, fields, and values

Page 35: Mahout part2

Types of values for predictor variables

Continuous Categorical Word-like Text-like

Page 36: Mahout part2

Classification Work flow

Training the model

Evaluating the model

Using the model in production

Page 37: Mahout part2

Stage 2: evaluating the classification model

Stage 3: using the model in production

Stage 1: training the classification model

Page 38: Mahout part2

Stage 1: training the classification model

Define Categories for the Target Variable Collect Historical Data Define Predictor Variables Select a Learning Algorithm to Train the Model Use Learning Algorithm to Train the Model

Page 39: Mahout part2

Extracting features to build a Mahout classifier

Page 40: Mahout part2

Preprocessing raw data into classifiable data

Page 41: Mahout part2

Converting classifiable data into vectors

Use one Vector cell per word, category, or continuous value

Represent Vectors implicitly as bags of words Use feature hashing

Page 42: Mahout part2

Classifying the 20 newsgroups data set

Page 43: Mahout part2

Choosing an algorithm

Page 44: Mahout part2

The classifier evaluation API

Percent correct Confusion matrix Entropy matrix AUC Log likelihood

Page 45: Mahout part2

When classifiers go bad

Target leaks Broken feature extraction

Page 46: Mahout part2

Tuning the problem

Remove Fluff Variables Add New Variables, Interactions, and Derived

Values

Page 47: Mahout part2

Tuning the classifier

Try Alternative Algorithms Tune the Learning Algorithm

Page 48: Mahout part2

Thank You

Contact at:Email: [email protected]: Twitter.com/yasmine_mohamed