Hands-on Classification
May 31, 2015
Hands-on Classification
Preliminaries
• Code is available from github:– [email protected]:tdunning/Chapter-16.git
• EC2 instances available• Thumb drives also available• Email to [email protected]• Twitter @ted_dunning
A Quick Review
• What is classification?– goes-ins: predictors– goes-outs: target variable
• What is classifiable data?– continuous, categorical, word-like, text-like– uniform schema
• How do we convert from classifiable data to feature vector?
Data Flow
Not quite so simple
Classifiable Data
• Continuous– A number that represents a quantity, not an id– Blood pressure, stock price, latitude, mass
• Categorical– One of a known, small set (color, shape)
• Word-like– One of a possibly unknown, possibly large set
• Text-like– Many word-like things, usually unordered
But that isn’t quite there
• Learning algorithms need feature vectors– Have to convert from data to vector
• Can assign one location per feature – or category – or word
• Can assign one or more locations with hashing– scary– but safe on average
Data Flow
Classifiable Data Vectors
Hashed Encoding
What about collisions?
Let’s write some code
(cue relaxing background music)
Generating new features
• Sometimes the existing features are difficult to use
• Restating the geometry using new reference points may help
• Automatic reference points using k-means can be better than manual references
K-means using target
K-means features
More code!
(cue relaxing background music)
Integration Issues
• Feature extraction is ideal for map-reduce– Side data adds some complexity
• Clustering works great with map-reduce– Cluster centroids to HDFS
• Model training works better sequentially– Need centroids in normal files
• Model deployment shouldn’t depend on HDFS
Averagemodels
Parallel Stochastic Gradient Descent
Trainsub
model
Model
Input
Updatemodel
Variational Dirichlet Assignment
Gathersufficientstatistics
Model
Input
Old tricks, new dogs
• Mapper– Assign point to cluster– Emit cluster id, (1, point)
• Combiner and reducer– Sum counts, weighted sum of points– Emit cluster id, (n, sum/n)
• Output to HDFS
Read fromHDFS to local disk by distributed cache
Written by map-reduce
Read from local disk from distributed cache
Old tricks, new dogs
• Mapper– Assign point to cluster– Emit cluster id, 1, point
• Combiner and reducer– Sum counts, weighted sum of points– Emit cluster id, n, sum/n
• Output to HDFSMapR FS
Read fromNFS
Written by map-reduce
Modeling architecture
Featureextraction
anddown
sampling
Input
Side-data
Datajoin
SequentialSGD
Learning
Map-reduce
Now via NFS