Machine Learning - Black Art Charles Parker Allston Trading
Machine Learning - Black Art
Charles ParkerAllston Trading
Machine Learning is Hard!
• By now, you know kind of a lot
• Different types of models
• Feature engineering
• Ways to evaluate
• But you’ll still fail!
• Out in the real world, there’s a whole bunch of things that will kill your project
• FYI - A lot of these talks are stolen
2
Join Me!
• On a journey into the Machine Learning House of Horrors!
• Mwa ha ha!
3
5
• The Horror of The Huge Hypothesis Space
• The Perils of The Poorly Picked Loss Function
• The Creeping Creature Called Cross Validation
• The Dread of the Drifting Domain
• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
Choosing A Hypothesis Space
• By “hypothesis space” we mean the possible classifiers you could build with an algorithm given the data
• This is the choice you make when you pick a learning algorithm
• You have one job!
• Is there any way to make it easier?
6
Theory to The Rescue!
• Probably Approximately Correct
• We’d like our model to have error less than epsilon
• We’d like that to happen at least some percentage of the time
• If the error is epsilon, the percentage is sigma, the number of training examples is m, and the hypothesis space size is d:
7
The Triple Trade-Off
• There is a triple-trade off between the error, the size of the hypothesis space, and the amount of training data you have
8
Error
Hypothesis Space Training Data
What About Huge Data?
• I’m clever, so I’ll use non-parametric methods (Decision tree, k-NN, kernelized SVMs)
• As data scales, curious things tend to happen
• Simpler models become more desirable as they’re faster to fit.
• You can increase model complexity by adding features (maybe word counts)
• Big data often trumps modeling!
9
10
• The Horror of The Huge Hypothesis Space
• The Perils of The Poorly Picked Loss Function
• The Creeping Creature Called Cross Validation
• The Dread of the Drifting Domain
• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
A Dirty Little Secret About ML Algorithms
• They don’t care what you want
• Decision Trees:
• SVM:
• LR:
• LDA:
11
Real-world Losses
• Real losses are nothing like this
• False positive in disease diagnosis
• False positive in face detection
• False positive in thumbprint identification
• Some aren’t even instance-based
• Path dependencies
• Game playing
12
Specializing Your Loss
• One solution is to let developers apply their own loss
• This is the approach of SVM light:
http://svmlight.joachims.org/
It’s been around for a while
• Losses other than Mutual Information can be plugged into the appropriate place in splitting code
• Models trained via gradient descent can obviously be customized (Python’s Theano is interesting for this)
• In the case of multi-example loss function, we have SEARN in Vowpal Wabbit
https://github.com/JohnLangford/vowpal_wabbit
13
Other Hackery
• Sometimes, the solution is just to hack around the actual prediction
• Have several levels (cascade) of classifiers in e.g., medical diagnosis, text recognition
• Apply logic to explicitly avoid high loss cases (e.g., when buying/selling equities)
• Changing the problem setting
• Will you be doing queries? Use ranking or metric learning
• “I want to do crazy thing x with classifiers”, chances are it’s already been done and you can read about it.
14
15
• The Horror of The Huge Hypothesis Space
• The Perils of The Poorly Picked Loss Function
• The Creeping Creature Called Cross Validation
• The Dread of the Drifting Domain
• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
When Validation Attacks!
• Cross validation
• n-Fold - Hold out one fold for testing, train on n - 1 folds
• Great way to measure performance, right?
• It’s all about information leakage
• via instances
• via features
16
Case Study #1: Law of Averages
• Estimate sporting event outcomes
• Use previous games to estimate points scored for each team (via windowing transform)
• Choose winner based on predicted score
• What if you’re off by one on the window?
17
Case Study #2: Photo Dating
• Take scanned photos from 30 different users (on average 200 per user) and create a model to assign a date taken (plus or minus five years)
• Perform 10-cross validation
• Accuracy is 85%. Can you trust it?
18
Case Study #3: Moments In Time
• You have a buy/sell opportunity every five seconds
• The signals you use to evaluate the opportunity are aggregates of market activity over the last five minutes
• How careful must you be with cross-validation?
19
20
• The Horror of The Huge Hypothesis Space
• The Perils of The Poorly Picked Loss Function
• The Creeping Creature Called Cross Validation
• The Dread of the Drifting Domain
• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
Breaking Machine Learning
• You’ve got this great model! Congratulations!
• Suddenly it stops working. Why?
• You might be in a domain that tends to change over time (document classification, sales prediction)
• You might be experiencing adverse selection (market data predictions, spam)
21
Concept Drift
• This is called non-stationarity in either the prior or the conditional distributions
• Could be a couple of different things
• If the prior p(input) is changing, it’s covariate shift
• If the conditional p(output | input) is changing, it’s concept drift
• No rule that it can’t be both
• http://blog.bigml.com/2013/03/12/machine-learning-from-streaming-data-two-problems-two-solutions-two-concerns-and-two-lessons/
22
Take Action!
• First: Look for symptoms
• Getting a lot of errors
• The distribution of predicted values changes
• Drift detection algorithms (that I know about) have the same basic flavor:
• Buffer some data in memory
• If recent data is “different” from past data, retrain, update or give up
• Some resources - A nice survey paper and an open source package:
23
http://www.win.tue.nl/~mpechen/publications/pubs/Gama_ACMCS_AdaptationCD_accepted.pdf
http://moa.cms.waikato.ac.nz/
The Benefits of Archeology
• Why might you train on old data, even if it’s not relevant?
• Verification of your research process
• You’d do the same thing last year. Did it work?
• Gives you a good idea of how much drift you should expect
24
25
• The Horror of The Huge Hypothesis Space
• The Perils of The Poorly Picked Loss Function
• The Creeping Creature Called Cross Validation
• The Dread of the Drifting Domain
• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
Publish or Perish
• Academic papers are a certain type of result
• Show incremental improvement in accuracy or generality
• Prove something about your algorithm
• This latter is hard to come by as results get more realistic
• Machine learning proofs assume data is “i.i.d”, but this is obviously false.
• Real world data sucks, and dealing with that significantly changes the dataset
26
Usefulness of Results
• Theoretical Results
• Most of the time bounds do not apply (error, sample complexity, convergence)
• Sometimes they don’t even make any sense
• Beware of putting too much faith in a single person or single person’s work
• Usefulness generally occurs only in the aggregate
• And sometimes not even then (researchers are people, too)
27
Machine Learning Isn’t About Machine Learning
• Why doesn’t it work like in the paper?
• Remember, the paper is carefully controlled in a way your application is not.
• Performance is rarely driven by machine learning
• It’s driven by camera microphones
• It’s driven by Mario Draghi
28
So, Don’t Bother With It?
• Of course not!
• What’s the alternative?
• “All our science, measured against reality, is primitive and childlike — and yet it is the most precious thing we have” - Albert Einstein
• Use academia as your starting point, but don’t think it will get you out of the work
29
Some Themes
• The major points of this talk:
• Machine learning is hard to get right
• The algorithms won’t do what you want
• Good results are probably spurious
• Even if they aren’t, it won’t last
• Reading the research won’t help
• Wait, no!
• Have an attitude of skeptical optimism (or optimal skepticism?)
30