Oscon data-2011-ted-dunning

Hands-on Classification

Preliminaries

• Code is available from github:– git@github.com:tdunning/Chapter-16.git

• EC2 instances available• Thumb drives also available• Email to ted.dunning@gmail.com• Twitter @ted_dunning

A Quick Review

• What is classification?– goes-ins: predictors– goes-outs: target variable

• What is classifiable data?– continuous, categorical, word-like, text-like– uniform schema

• How do we convert from classifiable data to feature vector?

Data Flow

Not quite so simple

Classifiable Data

• Continuous– A number that represents a quantity, not an id– Blood pressure, stock price, latitude, mass

• Categorical– One of a known, small set (color, shape)

• Word-like– One of a possibly unknown, possibly large set

• Text-like– Many word-like things, usually unordered

But that isn’t quite there

• Learning algorithms need feature vectors– Have to convert from data to vector

• Can assign one location per feature – or category – or word

• Can assign one or more locations with hashing– scary– but safe on average

Data Flow

Classifiable Data Vectors

Hashed Encoding

What about collisions?

Let’s write some code

(cue relaxing background music)

Generating new features

• Sometimes the existing features are difficult to use

• Restating the geometry using new reference points may help

• Automatic reference points using k-means can be better than manual references

K-means using target

K-means features

More code!

(cue relaxing background music)

Integration Issues

• Feature extraction is ideal for map-reduce– Side data adds some complexity

• Clustering works great with map-reduce– Cluster centroids to HDFS

• Model training works better sequentially– Need centroids in normal files

• Model deployment shouldn’t depend on HDFS

Averagemodels

Parallel Stochastic Gradient Descent

Trainsub

Updatemodel

Variational Dirichlet Assignment

Gathersufficientstatistics

Old tricks, new dogs

• Mapper– Assign point to cluster– Emit cluster id, (1, point)

• Combiner and reducer– Sum counts, weighted sum of points– Emit cluster id, (n, sum/n)

• Output to HDFS

Read fromHDFS to local disk by distributed cache

Written by map-reduce

Read from local disk from distributed cache

Old tricks, new dogs

• Mapper– Assign point to cluster– Emit cluster id, 1, point

• Combiner and reducer– Sum counts, weighted sum of points– Emit cluster id, n, sum/n

• Output to HDFSMapR FS

Read fromNFS

Written by map-reduce

Modeling architecture

Featureextraction

anddown

sampling

Side-data

Datajoin

SequentialSGD

Learning

Map-reduce

Now via NFS

Oscon data-2011-ted-dunning

cluster id

classifiable data continuous

classifiable data vectors

data flownot

cluster centroids

new features

weighted sum of points

new dogs mapper assign

Technology

Dunning 95

Django Update (OSCON 2007)

Why docker | OSCON 2013

Palestra OSCON 2011

Ted Dunning, Chief Application Architect, MapR at MLconf ATL...

Graphlab Ted Dunning Clustering

Oscon 2011 chemistry_hanwell

Oscon 2010 - ATS

Varnish Oscon 2009

Oscon Presentation.Computational Journalism

Ted Dunning – Very High Bandwidth Time Series Database...

Some Important Streaming Algorithms You Should Know...

Plaxo OSCON 2006

Boston Hug by Ted Dunning 2012

Vijay Oscon

Oscon keynote 2012