Top Banner
Mistakes I've Made Mistakes I've Made PyData Seattle 2015 Cam Davidson-Pilon
56

Mistakes I've Made- Cam Davidson-Pilon

Aug 15, 2015

Download

Data & Analytics

PyData
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mistakes I've Made- Cam Davidson-Pilon

Mistakes I've MadeMistakes I've Made

PyData Seattle 2015Cam Davidson-Pilon

Page 2: Mistakes I've Made- Cam Davidson-Pilon
Page 3: Mistakes I've Made- Cam Davidson-Pilon

Who am I?Who am I?

Cam Davidson-Pilon

- Lead on the Data Team at Shopify

- Open source contributer

- Author of Bayesian Methods for Hackers(in print soon!)

Page 4: Mistakes I've Made- Cam Davidson-Pilon
Page 5: Mistakes I've Made- Cam Davidson-Pilon

Ottawa

Page 6: Mistakes I've Made- Cam Davidson-Pilon

Ottawa

Page 7: Mistakes I've Made- Cam Davidson-Pilon

Ottawa?

Page 8: Mistakes I've Made- Cam Davidson-Pilon

Case Study 1Case Study 1

Page 9: Mistakes I've Made- Cam Davidson-Pilon

We needed to predict mail returnrates based on census data.

Sample Data (simplified):

Page 10: Mistakes I've Made- Cam Davidson-Pilon

Well I'm predicting the rate, so Ibuild that:

Page 11: Mistakes I've Made- Cam Davidson-Pilon

Don't need margin of errors...

Page 12: Mistakes I've Made- Cam Davidson-Pilon

...then do "data science"

Page 13: Mistakes I've Made- Cam Davidson-Pilon

Outcome: failure

What went wrong? At the time, ¯\_(ツ)_/¯

Page 14: Mistakes I've Made- Cam Davidson-Pilon

(highly, highly recommended!)

Page 15: Mistakes I've Made- Cam Davidson-Pilon

σ =X̄ √nσ

Page 16: Mistakes I've Made- Cam Davidson-Pilon

σ =X̄ √nσ

"The std. deviation of the sample mean isequal to the std. deviation of thepopulation over square-root n"

Page 17: Mistakes I've Made- Cam Davidson-Pilon
Page 18: Mistakes I've Made- Cam Davidson-Pilon
Page 19: Mistakes I've Made- Cam Davidson-Pilon
Page 20: Mistakes I've Made- Cam Davidson-Pilon

What I learned1. Sample sizes are so important when dealing with

aggregate level data.2. It was only an issue because the sample sizes were

different, too.3. Use the Margin of Error, don't ignore it - it's there for a

reason.4. I got burned so bad here, I became a Bayesian soon after.

Page 21: Mistakes I've Made- Cam Davidson-Pilon

Case Study 2Case Study 2

A intra-day time series of S&P, Dow,Nasdaq and FTSE (UK index)

Page 22: Mistakes I've Made- Cam Davidson-Pilon

Suppose you areinterested in doing someday trading. Your target:

UK stocks.

Futures on the FTSE inparticular.

Page 23: Mistakes I've Made- Cam Davidson-Pilon
Page 24: Mistakes I've Made- Cam Davidson-Pilon

Post Backtesting Results

Page 25: Mistakes I've Made- Cam Davidson-Pilon

Push to Production -investing really money

Page 26: Mistakes I've Made- Cam Davidson-Pilon

What happened?

Data Leakage happened

Page 27: Mistakes I've Made- Cam Davidson-Pilon
Page 28: Mistakes I've Made- Cam Davidson-Pilon

What I learned1. Your backtesting / cross validation will always be equal or

overly optimistic - plan for that.2. Understand where your data comes from, from start to

finish.

Page 29: Mistakes I've Made- Cam Davidson-Pilon

Case Study 3Case Study 3

Page 30: Mistakes I've Made- Cam Davidson-Pilon

What I learned1. When developing statistical software that already exists in

the wild, write tests against the output of that software. 2. Be responsible for your software:

Page 31: Mistakes I've Made- Cam Davidson-Pilon

Case Study 4Case Study 4

Page 32: Mistakes I've Made- Cam Davidson-Pilon

It was my first A/B test atShopify...

Control group: 4%Experiment group: 5%

Bayesian A/B testing told me there was asignificant statistical difference between

the groups...

Page 33: Mistakes I've Made- Cam Davidson-Pilon

Upper management wantedto know the relative increase...

(5% - 4%) / 4% = 25%

Page 34: Mistakes I've Made- Cam Davidson-Pilon

No.We forgot sample size

again.

Page 35: Mistakes I've Made- Cam Davidson-Pilon
Page 36: Mistakes I've Made- Cam Davidson-Pilon
Page 37: Mistakes I've Made- Cam Davidson-Pilon
Page 38: Mistakes I've Made- Cam Davidson-Pilon
Page 39: Mistakes I've Made- Cam Davidson-Pilon

What I learned1. Don't naively compute stats on top of stats - this only

compounds the uncertainty. 2. Better to underestimate than overestimate3. Visualizing uncertainty is a the role of a statistician.

Page 40: Mistakes I've Made- Cam Davidson-Pilon

Machine LearningMachine Learningcounter examplescounter examples

Page 41: Mistakes I've Made- Cam Davidson-Pilon

Sparse-ing thesolution naively

Page 42: Mistakes I've Made- Cam Davidson-Pilon

Coefficients after linear regression*:

*Assume data has been normalized too,i.e. mean 0 and standard deviation 1

Page 43: Mistakes I've Made- Cam Davidson-Pilon

Decide to drop a variable:

Page 44: Mistakes I've Made- Cam Davidson-Pilon
Page 45: Mistakes I've Made- Cam Davidson-Pilon

Suppose this is the true model...

Okay, out regression got the coefficientsright, but...

Page 46: Mistakes I've Made- Cam Davidson-Pilon

So actually, together, these variables havevery little contribution to Y!

Page 47: Mistakes I've Made- Cam Davidson-Pilon

Solution:

Any form of regularization will solve this.For example, using ridge regression withwith even the slightest penalizer gives:

Page 48: Mistakes I've Made- Cam Davidson-Pilon
Page 49: Mistakes I've Made- Cam Davidson-Pilon

PCA beforeRegression

Page 50: Mistakes I've Made- Cam Davidson-Pilon

PCA is great at many things, but it canactually significantly hurt regression if

used as a preprocessing step. How?

Page 51: Mistakes I've Made- Cam Davidson-Pilon

Suppose we wish to regress Y onto X and W.The true model of Y is Y = X - W. We don't know this

yet.

Suppose further there is a positivecorrelation between X and W, say 0.5.

Apply PCA to [X W], we get a new matrix:

[ X + W , X − W ]√21

√21

√21

√21

Page 52: Mistakes I've Made- Cam Davidson-Pilon

[ X + W , X − W ]√21

√21

√21

√21

Textbook analysis tells you to drop thesecond dimension from this new PCA.

Page 53: Mistakes I've Made- Cam Davidson-Pilon

[ X + W ]√21

√21

So now we are regressing Y onto:

i.e., find values to fit the model:

Y = α + β(X + W )

But there are no good values for theseunknowns!

Page 54: Mistakes I've Made- Cam Davidson-Pilon

QuickIPythonDemo

Page 55: Mistakes I've Made- Cam Davidson-Pilon

Solution:

Don't use naive PCA before regression, youare losing information - try something like

supervised PCA, or just don't do it.

Page 56: Mistakes I've Made- Cam Davidson-Pilon

Thanks for listening :)

@cmrn_dp