L15. Machine Learning - Black Art

Machine Learning - Black Art

Charles ParkerAllston Trading

Machine Learning is Hard!

• By now, you know kind of a lot

• Different types of models

• Feature engineering

• Ways to evaluate

• But you’ll still fail!

• Out in the real world, there’s a whole bunch of things that will kill your project

• FYI - A lot of these talks are stolen

2

Join Me!

• On a journey into the Machine Learning House of Horrors!

• Mwa ha ha!

3

5

• The Horror of The Huge Hypothesis Space

• The Perils of The Poorly Picked Loss Function

• The Creeping Creature Called Cross Validation

• The Dread of the Drifting Domain

• The Repugnance of Reliance on Research Results

The Machine Learning House of Horrors!

Choosing A Hypothesis Space

• By “hypothesis space” we mean the possible classifiers you could build with an algorithm given the data

• This is the choice you make when you pick a learning algorithm

• You have one job!

• Is there any way to make it easier?

6

Theory to The Rescue!

• Probably Approximately Correct

• We’d like our model to have error less than epsilon

• We’d like that to happen at least some percentage of the time

• If the error is epsilon, the percentage is sigma, the number of training examples is m, and the hypothesis space size is d:

7

The Triple Trade-Off

• There is a triple-trade off between the error, the size of the hypothesis space, and the amount of training data you have

8

Error

Hypothesis Space Training Data

What About Huge Data?

• I’m clever, so I’ll use non-parametric methods (Decision tree, k-NN, kernelized SVMs)

• As data scales, curious things tend to happen

• Simpler models become more desirable as they’re faster to fit.

• You can increase model complexity by adding features (maybe word counts)

• Big data often trumps modeling!

9

10







A Dirty Little Secret About ML Algorithms

• They don’t care what you want

• Decision Trees:

• SVM:

• LR:

• LDA:

11

Real-world Losses

• Real losses are nothing like this

• False positive in disease diagnosis

• False positive in face detection

• False positive in thumbprint identification

• Some aren’t even instance-based

• Path dependencies

• Game playing

12

Specializing Your Loss

• One solution is to let developers apply their own loss

• This is the approach of SVM light:

http://svmlight.joachims.org/

It’s been around for a while

• Losses other than Mutual Information can be plugged into the appropriate place in splitting code

• Models trained via gradient descent can obviously be customized (Python’s Theano is interesting for this)

• In the case of multi-example loss function, we have SEARN in Vowpal Wabbit

https://github.com/JohnLangford/vowpal_wabbit

13

http://svmlight.joachims.org/

https://github.com/JohnLangford/vowpal_wabbit

Other Hackery

• Sometimes, the solution is just to hack around the actual prediction

• Have several levels (cascade) of classifiers in e.g., medical diagnosis, text recognition

• Apply logic to explicitly avoid high loss cases (e.g., when buying/selling equities)

• Changing the problem setting

• Will you be doing queries? Use ranking or metric learning

• “I want to do crazy thing x with classifiers”, chances are it’s already been done and you can read about it.

14

15







When Validation Attacks!

• Cross validation

• n-Fold - Hold out one fold for testing, train on n - 1 folds

• Great way to measure performance, right?

• It’s all about information leakage

• via instances

• via features

16

Case Study #1: Law of Averages

• Estimate sporting event outcomes

• Use previous games to estimate points scored for each team (via windowing transform)

• Choose winner based on predicted score

• What if you’re off by one on the window?

17

Case Study #2: Photo Dating

• Take scanned photos from 30 different users (on average 200 per user) and create a model to assign a date taken (plus or minus five years)

• Perform 10-cross validation

• Accuracy is 85%. Can you trust it?

18

Case Study #3: Moments In Time

• You have a buy/sell opportunity every five seconds

• The signals you use to evaluate the opportunity are aggregates of market activity over the last five minutes

• How careful must you be with cross-validation?

19

20







Breaking Machine Learning

• You’ve got this great model! Congratulations!

• Suddenly it stops working. Why?

• You might be in a domain that tends to change over time (document classification, sales prediction)

• You might be experiencing adverse selection (market data predictions, spam)

21

Concept Drift

• This is called non-stationarity in either the prior or the conditional distributions

• Could be a couple of different things

• If the prior p(input) is changing, it’s covariate shift

• If the conditional p(output | input) is changing, it’s concept drift

• No rule that it can’t be both

• http://blog.bigml.com/2013/03/12/machine-learning-from-streaming-data-two-problems-two-solutions-two-concerns-and-two-lessons/

22

http://blog.bigml.com/2013/03/12/machine-learning-from-streaming-data-two-problems-two-solutions-two-concerns-and-two-lessons/

Take Action!

• First: Look for symptoms

• Getting a lot of errors

• The distribution of predicted values changes

• Drift detection algorithms (that I know about) have the same basic flavor:

• Buffer some data in memory

• If recent data is “different” from past data, retrain, update or give up

• Some resources - A nice survey paper and an open source package:

23

http://www.win.tue.nl/~mpechen/publications/pubs/Gama_ACMCS_AdaptationCD_accepted.pdf

http://moa.cms.waikato.ac.nz/

http://www.win.tue.nl/~mpechen/publications/pubs/Gama_ACMCS_AdaptationCD_accepted.pdf

http://moa.cms.waikato.ac.nz/

The Benefits of Archeology

• Why might you train on old data, even if it’s not relevant?

• Verification of your research process

• You’d do the same thing last year. Did it work?

• Gives you a good idea of how much drift you should expect

24

25







Publish or Perish

• Academic papers are a certain type of result

• Show incremental improvement in accuracy or generality

• Prove something about your algorithm

• This latter is hard to come by as results get more realistic

• Machine learning proofs assume data is “i.i.d”, but this is obviously false.

• Real world data sucks, and dealing with that significantly changes the dataset

26

Usefulness of Results

• Theoretical Results

• Most of the time bounds do not apply (error, sample complexity, convergence)

• Sometimes they don’t even make any sense

• Beware of putting too much faith in a single person or single person’s work

• Usefulness generally occurs only in the aggregate

• And sometimes not even then (researchers are people, too)

27

Machine Learning Isn’t About Machine Learning

• Why doesn’t it work like in the paper?

• Remember, the paper is carefully controlled in a way your application is not.

• Performance is rarely driven by machine learning

• It’s driven by camera microphones

• It’s driven by Mario Draghi

28

So, Don’t Bother With It?

• Of course not!

• What’s the alternative?

• “All our science, measured against reality, is primitive and childlike — and yet it is the most precious thing we have” - Albert Einstein

• Use academia as your starting point, but don’t think it will get you out of the work

29

Some Themes

• The major points of this talk:

• Machine learning is hard to get right

• The algorithms won’t do what you want

• Good results are probably spurious

• Even if they aren’t, it won’t last

• Reading the research won’t help

• Wait, no!

• Have an attitude of skeptical optimism (or optimal skepticism?)

30

L15. Machine Learning - Black Art

Data & Analytics