Statistics in 40 Minutes: A/B Testing Fundamentals

Statistics in 40 Minutes: A/B Testing FundamentalsLeo PekelisStatistician, Optimizely

@[email protected]

#opticon2015

mailto:[email protected]

You have your own unique approach to A/B Testing

The goal of this talk is to break down A/B Testing to its

fundamentals.

A/B Testing Platform1) Create an experiment

2) Read the results page

Outcomes & Error Rates

Fundamental Tradeoff

Confidence Intervals

Hypotheses

XX X

“A/B Testing Playbook”Opening

Mid-game

Mid-game

Closing




Hypotheses

XX X


Mid-game

Mid-game

Closing

1. What makes a good hypothesis?

2. What are the differences between False Positive Rate and False Discovery Rate?

3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?

At the end of this talk, you should be able to answer

1. A good hypothesis has a variation and a clearly defined goal which you crafted ahead of time.

2. False Positive Rates have hidden sources of error when testing many goals and variations. False Discovery Rates correct this by increasing runtime when you have many low signal goals.

3. All three levers are inversely related. For example, running my tests longer can get me lower error rates, or detect smaller effects.

The answers

First, some vocabulary (yay!)

• Control and Variation A control is the original, or baseline version of content that you are testing through a variation.

• Goal Metric used to measure impact of control and variation

• Baseline conversion rate The control group’s expected conversion rate.

• Effect size The improvement (positive or negative) of your variation over baseline.

• Sample size The number of visitors in your test.

• A hypothesis test is a control, and variation that you

want to show improves a goal

• An experiment is a collection of hypotheses (goals &

variation pairs) that all have the same control




Hypotheses

XX X


Mid-game

Mid-game

Closing

http://www.nba.com/

Imagine we are the NBA

http://www.nba.com/

What is a good hypothesis (test)?

Why is this not actionable?

“I think changing the header image will make

my site better.”

• Removing the header will increase engagement?

• Removing the header will increase total revenue?

• Removing the header will increase “the finals” clicks?

• Growing the header will increase engagement?

• Growing the header with increase “the finals” clicks.

• …

Test creep!Bad hypothesis






• …






• …

“I think changing the header image will make

my site better.”

Bad hypothesis Good hypotheses

Organized and clear

The more relationships (hypotheses) you test,

the longer (visitors) it will take

to achieve the same outcome (error rate).

Hypotheses also give the cost of your experiment

Questions to check for a good hypothesis

What are you trying to show with your idea?

What key metrics should it drive?

Are all my goals and variations necessary given my testing limits?






Answer: A good hypothesis has a variation and a clearly defined goal which you crafted ahead of time.







Hypotheses

XX X


Mid-game

Mid-game

Closing

http://www.nba.com/

http://www.nba.com/

What are the possible outcomes?

“True” value of hypothesis

Improvement No effect

Result of test

Winner / Loser True positive False

positive

Inconclusive False negative

True negative



Result of test


positive


True negative

(no effect, winner / loser)

http://www.nba.com/

http://www.nba.com/



Result of test


positive


True negative



Result of test


positive


True negative



Result of test


positive


True negative


(+/- improvement, inconclusive)

http://www.nba.com/

http://www.nba.com/



Result of test


positive


True negative



Result of test


positive


True negative



Result of test


positive


True negative



Result of test


positive


True negative



Result of test


positive


True negative


(+/- improvement, inconclusive)

(+/- improvement, winner / loser)

(no effect, inconclusive)

The 2x2 table will help us to

1. Keep track of different error rates we care about

2. Explore the consequences of controlling false positives vs false

discoveries

Error rate 1: False positive rate


Improve-ment No effect

Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

• False positive rate (Type I error)

= “Chance of a false positive from a variation with no effect on a goal”



Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative





Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative





Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative



• Thresholding the FPR

“When I have a variation with no effect on a goal, I’ll find an effect less than 10% of the time.”

How can we ever compute a False Positive Rate if we don’t know whether a hypothesis is true or not?

Statistical tests (fixed horizon t-test, Stats Engine) are designed to

threshold an error rate.

Example:

“Calling winners & losers when a p-value is below .05 will guarantee a

False Positive Rate below 5%”

False Positive Rates with multiple tests

https://xkcd.com/882/



What happened?

21 tests X 5% FPR = 1 False Positive on average



Result of test

Winner / Loser

True positive

False positive

Inconc-lusive

False negative

True negative

False positive rates are only useful in the context of all hypotheses

Error rate 2: False discovery rate

• False discovery rate (FDR)

= “Chance of a false positive from a conclusive result”



Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative





Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative





Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative



• Thresholding the FDR

= “When you see a winning or losing goal on a variation, it’s wrong less than 10% of the time.”



Result of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

or X 5% FDR = 0.05 False Positives on average



Result of test

Winner / loser

True positive

False positive

Inconc-lusive

False negative

True negative

False discovery rates are useful despite the number of hypotheses

What’s the catch?

The more hypotheses (goals & variations) in your experiment, the longer it takes to find conclusives.

Not quite …

low signal high signal

What’s the catch?

The more low signal hypotheses (goals & variations) in your experiment, the longer it takes to find conclusives.

Recap

• False Positive Rate thresholding

-controls the chance of a false positive when you have a hypothesis with no effect

-misrepresents your error rate with multiple goals and variations

• False Discovery Rate thresholding

-controls the chance of a false positive when you have a winning or losing hypothesis

-is accurate regardless of how many hypotheses you run

-can take longer to reach significance with more low signal variations on goals

Tips & Tricks for running experiments with False Discovery Rates

• Ask: Which goal is most important to me?

-This should be my primary goal (not impacted by all other goals)

• Run large, or large multivariate tests without fear of finding spurious results, but be prepared for the cost of exploration

• A little human intuition and prior knowledge can go a long way towards reducing the runtime of your experiments







False Positive Rates have hidden sources of error when testing many goals and variations. False Discovery Rates correct this by increasing runtime when you have many noisy goals.






Hypotheses

XX X


Mid-game

Mid-game

Closing

“3 Levers” of A/B Testing

1.Threshold an error rate

• “I want no more than 10% false discovery rate”

2.Detecting effect sizes (setting an MDE)

• “I’m OK with only detecting greater than 5% improvement”

3.Running tests longer

• “I can afford to run this test for 3 weeks, or 50,000

visitors”

Fundamental Tradeoff of A/B Testing

Error rates Runtime

Effect size / Baseline CR

All Inversely Related

Error rates Runtime


At any number of visitors,

the less you threshold your error rate,

the smaller effect sizes you can detect.


Error rates Runtime



At any error rate threshold,

stopping your test earlier means

you can only detect larger effect sizes.

Error rates Runtime



For any effect size,

the lower error rate you want,

the longer you need to run your test.

What does this look like in practice?

Average Visitors needed to reach significance with Stats Engine

Improvement (relative)

5% 10% 25%

Significance Threshold

95% 62,400 13,500 1,800

90% 59,100 12,800 1,700

80% 52,600 11,400 1,500

Baseline conversion rate = 10%

All A/B Testing platforms address the fundamental tradeoff …

1. Choose a minimum detectable effect (MDE) and false positive rate threshold

2. Find a required sampled minimum sample size with a sample size calculator

3. Wait until the minimum sample size is reached

4. Look at your results once and only once

Optimizely is the only platform that lets you pull the levers in

real time

http://www.nba.com/

http://www.nba.com/

5%

Error rates Runtime


-

+5%, 10%

52,600

?In the beginning, we make an educated guess …

5%

Error rates Runtime (remaining)


-

+13%, 16%

1,600

Instead of: 52,600 - 7,200 = 45,400

… but then the improvement turns out to be better …

5%

Error rates Runtime (remaining)


-

+2%, 8%

> 100,000

… or a lot worse.

Recap

• The Fundamental Tradeoff of A/B Testing affects you no matter what testing platform you use.

-If you want to detect a 5% Improvement on a 10% baseline conversion rate, you should be prepared to wait for at least 50,000 visitors

• Optimizely’s Stats Engine is the only platform that allows you to adjust the trade-off in real time while still reporting valid error rates








All three are inversely related. For example, running my tests longer can get me lower error rates, or detect smaller effects.





Hypotheses

XX X


Mid-game

Mid-game

Closing




Hypotheses

XX X


Mid-game

Mid-game

Closing




Hypotheses

XX X


Mid-game

Mid-game

Closing

Statistics in 40 Minutes: A/B Testing FundamentalsLeo PekelisStatistician, Optimizely

@[email protected]

#opticon2015

mailto:[email protected]




Hypotheses

XX X


Mid-game

Mid-game

Closing

Definition:

A confidence interval is a range of values for your metric (revenue, conversion rate, etc.) that is 90%* likely to contain the true difference between your variation and baseline.

15.41

11.4Middle Ground

Best Case

Worst case

7.29

This is true regardless of your significance.

http://www.nba.com/

http://www.nba.com/

We can’t wait for significance

The confidence interval tells us what we need to know

A confidence interval is the mirror image of statistical significance

Mathematical Definition:

The set of parameter values X so that a hypothesis test with null hypothesis

H0: Removing a distracting header will result in X more revenue per visitor.

is not yet rejected.

Error rate 3: False negative rate

• False negative rate (Type II error)

= “Rate of false negatives from all hypotheses that could have been false negatives.”



Outcome of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative





Outcome of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative





Outcome of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative


= “Rate of false negatives from all variations with an improvement on a goal.”

= #(False negative) / #(Improve)

• Thresholding Type II error

= “When have a goal on a variation with an effect, you miss it less than 10% of the time.”



Outcome of test

Winner /Loser

True positive

False positive

Inconc-lusive

False negative

True negative

Statistics in 40 Minutes: A/B Testing Fundamentals

Data & Analytics

header image

nals clicks

ab testing platform1

good hypothesis test

thresholding error rates

lower error rates

false positive rates

bad hypothesis good