Top Banner
Experimenting with Stats Engine Pete Koomen Co-founder, CTO, Optimizely @koomen [email protected] opticon2017
46

Opticon 2017 Running Experiment Engines with Stats Engine

Jan 21, 2018

Download

Technology

Optimizely
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Opticon 2017 Running Experiment Engines with Stats Engine

Experimenting with Stats EnginePete KoomenCo-founder, CTO, Optimizely@[email protected]

opticon2017

Page 2: Opticon 2017 Running Experiment Engines with Stats Engine

AgendaHere

1. Why we built Stats Engine2. How to make a decisions with Stats

Engine3. How to scale your decision process

opticon2017

Page 3: Opticon 2017 Running Experiment Engines with Stats Engine

opticon2017opticon2017

Why we built Stats Engine

Page 4: Opticon 2017 Running Experiment Engines with Stats Engine
Page 5: Opticon 2017 Running Experiment Engines with Stats Engine
Page 6: Opticon 2017 Running Experiment Engines with Stats Engine

The study followed 1,291 participants for 10 years.

No exercise: 438 with 128 deaths (29%)Light exercise: 576 with 7 deaths (1%)Moderate exercise: 262 with 8 deaths (3%)Heavy exercise: 40 with 2 deaths (5%)

Page 7: Opticon 2017 Running Experiment Engines with Stats Engine

“Thank goodness a third person didn't die, or public health

authorities would be banning jogging.”

– Alex Hutchinson, Runner’s World

Page 8: Opticon 2017 Running Experiment Engines with Stats Engine
Page 9: Opticon 2017 Running Experiment Engines with Stats Engine
Page 10: Opticon 2017 Running Experiment Engines with Stats Engine

“A/A” results

Page 11: Opticon 2017 Running Experiment Engines with Stats Engine

The “T-test” (a.k.a. “NHST”, a.k.a. “Student T-test” )

The T-test in a nutshell1. Run your experiment until you have reached

the required sample size, and then stop.2. Ask “What are the chances I’d have gotten

these results in an A/A test?” (p-value)3. If p-value < 5%, your results are significant.

Page 12: Opticon 2017 Running Experiment Engines with Stats Engine

1908Data is expensive.

Data is slow.Practitioners are trained.

2017Data is cheap.Data is real-time.Practitioners are everyone.

The T-test was designed for this world

Page 13: Opticon 2017 Running Experiment Engines with Stats Engine

T-Test Pitfalls1. Peeking2. Multiple comparisons

Page 14: Opticon 2017 Running Experiment Engines with Stats Engine

1. Peeking

Page 15: Opticon 2017 Running Experiment Engines with Stats Engine

p-Value < 5%. Significant!

p-Value > 5%. Inconclusive.

p-Value > 5%. Inconclusive.

Min Sample Size

Time

Experiment Starts p-Value > 5%. Inconclusive.

Page 16: Opticon 2017 Running Experiment Engines with Stats Engine

Why is this a problem?

There is a ~5% chance of seeing a false positive each time you peek.

Page 17: Opticon 2017 Running Experiment Engines with Stats Engine

p-Value < 5%. Significant!

p-Value > 5%. Inconclusive.

p-Value > 5%. Inconclusive.

Min Sample Size

Time

Experiment Starts p-Value > 5%. Inconclusive.

4 peeks —> ~18% chance of seeing a false positive

Page 18: Opticon 2017 Running Experiment Engines with Stats Engine

The “T-test” (a.k.a. “NHST”, a.k.a. “Student T-test” )

The T-test in a nutshell1. Run your experiment until you have reached the required sample size, and then stop.2. Ask “What are the chances I’d have gotten these results in an A/A test?” (p-value)3. If p-value < 5%, your results are significant.

Page 19: Opticon 2017 Running Experiment Engines with Stats Engine

1:45 2:45 3:45 4:45 5:45

Page 20: Opticon 2017 Running Experiment Engines with Stats Engine

Solution: Stats Engine uses sequential testing to compute an “always-valid” p-value.

Page 21: Opticon 2017 Running Experiment Engines with Stats Engine

2. Multiple Comparisons

Page 22: Opticon 2017 Running Experiment Engines with Stats Engine

© Randall Patrick Munroe, xkcd.com

Page 23: Opticon 2017 Running Experiment Engines with Stats Engine

© Randall Patrick Munroe, xkcd.com

Page 24: Opticon 2017 Running Experiment Engines with Stats Engine

- - - - -

Metrics

1 2 3 4 5

Variations

A

B

C

D

Control

Page 25: Opticon 2017 Running Experiment Engines with Stats Engine

False Discovery Rate = P( No Real Improvement | 10% Lift )

False Positive Rate = P( 10% Lift | No Real Improvement ) “How likely are my results if I assume there is no underlying difference between my variation and control?

“How likely is it that my results are a fluke?”

Solution: Stats Engine controls False Discovery Rate by becoming more conservative when more metrics and variations are added to a test.

Page 26: Opticon 2017 Running Experiment Engines with Stats Engine

opticon2017opticon2017

How to make decisions with Stats Engine

When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?

Page 27: Opticon 2017 Running Experiment Engines with Stats Engine

opticon2017opticon2017

How to make decisions with Stats Engine

When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?

Page 28: Opticon 2017 Running Experiment Engines with Stats Engine

Variation

👍 Use “visitors remaining” to decide whether continuing your experiment is worth it.

Page 29: Opticon 2017 Running Experiment Engines with Stats Engine

opticon2017opticon2017

How to make decisions with Stats Engine

When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?

Page 30: Opticon 2017 Running Experiment Engines with Stats Engine

A

B

AB

Page 31: Opticon 2017 Running Experiment Engines with Stats Engine

“Peeking at A/B Tests: Why it matters, and what to do about it” KDD 2017

👍 Statistical Significance rises whenever there is strong evidence of a difference between variation and control

Page 32: Opticon 2017 Running Experiment Engines with Stats Engine

“Peeking at A/B Tests: Why it matters, and what to do about it” KDD 2017

0

Page 33: Opticon 2017 Running Experiment Engines with Stats Engine

Variatio

Variation

👍 Statistical Significance will “reset” when there is strong evidence of an underlying change.

Page 34: Opticon 2017 Running Experiment Engines with Stats Engine

Variation

👍 If your point estimate is near the edge of its confidence interval, consider running the experiment longer.

-19.3% -2.58%

Page 35: Opticon 2017 Running Experiment Engines with Stats Engine

opticon2017opticon2017

How to make decisions with Stats Engine

When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?

Page 36: Opticon 2017 Running Experiment Engines with Stats Engine

False Discovery Rate = P( No Real Improvement | 10% Lift )

False Positive Rate = P( 10% Lift | No Real Improvement ) “How likely are my results if I assume there is no underlying difference between my variation and control?

“How likely is it that my results are a fluke?”

Solution: Stats Engine controls False Discovery Rate by becoming more conservative when more metrics and variations are added to a test.

Page 37: Opticon 2017 Running Experiment Engines with Stats Engine

Stats Engine treats each metric as a “signal”.

High Signal metrics are directly affected by the experiment

Low Signal metrics are indirectly or not at all affected by the experiment

Page 38: Opticon 2017 Running Experiment Engines with Stats Engine

False Discovery Rate = P( No Real Improvement | 10% Lift )

False Positive Rate = P( 10% Lift | No Real Improvement ) “How likely are my results if I assume there is no underlying difference between my variation and control?

“How likely is it that my results are a fluke?”

Solution: Stats Engine controls False Discovery Rate by becoming more conservative when more low signal metrics and variations are added to a test.

Page 39: Opticon 2017 Running Experiment Engines with Stats Engine

Variations

A

B

C

D

Metrics

1 2 3 4 5 6 7 8

Primary Secondary Monitoring

Page 40: Opticon 2017 Running Experiment Engines with Stats Engine

👍For maximum velocity, use “high signal” primary and secondary metrics.

👍Use monitoring metrics for “low signal” metrics.

Page 41: Opticon 2017 Running Experiment Engines with Stats Engine

opticon2017opticon2017

How to make decisions with Stats Engine

When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?

Page 42: Opticon 2017 Running Experiment Engines with Stats Engine

Max False Discovery Rate

👍 Use your Statistical Significance threshold to control risk vs. velocity.

Page 43: Opticon 2017 Running Experiment Engines with Stats Engine

opticon2017opticon2017

How to scale your decision process

Risk vs. Velocity for Experimentation ProgramsGetting organizational buy-in

Page 44: Opticon 2017 Running Experiment Engines with Stats Engine

👍Define “risk classes” for your team’s experiments

👍Keep low-risk experiments “low touch”

👍Save data science analysis resources for high risk experiments

👍Run high-risk experiments for 1+ conversion cycles to control for seasonality

👍Rerun high-risk experiments

Risk vs. Velocity for Experimentation Programs

Page 45: Opticon 2017 Running Experiment Engines with Stats Engine

👍Decide how and when you’ll share experiment results with your organization.

👍Write down your “decision process” and socialize with the team

Getting organizational buy-in

Page 46: Opticon 2017 Running Experiment Engines with Stats Engine

opticon2017

Q&APete Koomen@[email protected]