Top Banner
Hands-on Classification
24

Oscon data-2011-ted-dunning

May 31, 2015

Download

Technology

Ted Dunning

These are the slides for my half of the OSCON Mahout tutorial.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Oscon data-2011-ted-dunning

Hands-on Classification

Page 2: Oscon data-2011-ted-dunning

Preliminaries

• Code is available from github:– [email protected]:tdunning/Chapter-16.git

• EC2 instances available• Thumb drives also available• Email to [email protected]• Twitter @ted_dunning

Page 3: Oscon data-2011-ted-dunning

A Quick Review

• What is classification?– goes-ins: predictors– goes-outs: target variable

• What is classifiable data?– continuous, categorical, word-like, text-like– uniform schema

• How do we convert from classifiable data to feature vector?

Page 4: Oscon data-2011-ted-dunning

Data Flow

Not quite so simple

Page 5: Oscon data-2011-ted-dunning

Classifiable Data

• Continuous– A number that represents a quantity, not an id– Blood pressure, stock price, latitude, mass

• Categorical– One of a known, small set (color, shape)

• Word-like– One of a possibly unknown, possibly large set

• Text-like– Many word-like things, usually unordered

Page 6: Oscon data-2011-ted-dunning

But that isn’t quite there

• Learning algorithms need feature vectors– Have to convert from data to vector

• Can assign one location per feature – or category – or word

• Can assign one or more locations with hashing– scary– but safe on average

Page 7: Oscon data-2011-ted-dunning

Data Flow

Page 8: Oscon data-2011-ted-dunning
Page 9: Oscon data-2011-ted-dunning

Classifiable Data Vectors

Page 10: Oscon data-2011-ted-dunning
Page 11: Oscon data-2011-ted-dunning
Page 12: Oscon data-2011-ted-dunning

Hashed Encoding

Page 13: Oscon data-2011-ted-dunning

What about collisions?

Page 14: Oscon data-2011-ted-dunning

Let’s write some code

(cue relaxing background music)

Page 15: Oscon data-2011-ted-dunning

Generating new features

• Sometimes the existing features are difficult to use

• Restating the geometry using new reference points may help

• Automatic reference points using k-means can be better than manual references

Page 16: Oscon data-2011-ted-dunning

K-means using target

Page 17: Oscon data-2011-ted-dunning

K-means features

Page 18: Oscon data-2011-ted-dunning

More code!

(cue relaxing background music)

Page 19: Oscon data-2011-ted-dunning

Integration Issues

• Feature extraction is ideal for map-reduce– Side data adds some complexity

• Clustering works great with map-reduce– Cluster centroids to HDFS

• Model training works better sequentially– Need centroids in normal files

• Model deployment shouldn’t depend on HDFS

Page 20: Oscon data-2011-ted-dunning

Averagemodels

Parallel Stochastic Gradient Descent

Trainsub

model

Model

Input

Page 21: Oscon data-2011-ted-dunning

Updatemodel

Variational Dirichlet Assignment

Gathersufficientstatistics

Model

Input

Page 22: Oscon data-2011-ted-dunning

Old tricks, new dogs

• Mapper– Assign point to cluster– Emit cluster id, (1, point)

• Combiner and reducer– Sum counts, weighted sum of points– Emit cluster id, (n, sum/n)

• Output to HDFS

Read fromHDFS to local disk by distributed cache

Written by map-reduce

Read from local disk from distributed cache

Page 23: Oscon data-2011-ted-dunning

Old tricks, new dogs

• Mapper– Assign point to cluster– Emit cluster id, 1, point

• Combiner and reducer– Sum counts, weighted sum of points– Emit cluster id, n, sum/n

• Output to HDFSMapR FS

Read fromNFS

Written by map-reduce

Page 24: Oscon data-2011-ted-dunning

Modeling architecture

Featureextraction

anddown

sampling

Input

Side-data

Datajoin

SequentialSGD

Learning

Map-reduce

Now via NFS