Statistics in 40 Minutes: A/B Testing Fundamentals Leo Pekelis Statistician, Optimizely @lpekelis [email protected] #opticon2015
Aug 12, 2015
Statistics in 40 Minutes: A/B Testing FundamentalsLeo PekelisStatistician, Optimizely
#opticon2015
You have your own unique approach to A/B Testing
The goal of this talk is to break down A/B Testing to its
fundamentals.
A/B Testing Platform1) Create an experiment
2) Read the results page
Outcomes & Error Rates
Fundamental Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing Playbook”Opening
Mid-game
Mid-game
Closing
Outcomes & Error Rates
Fundamental Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing Playbook”Opening
Mid-game
Mid-game
Closing
1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?
At the end of this talk, you should be able to answer
1. A good hypothesis has a variation and a clearly defined goal which you crafted ahead of time.
2. False Positive Rates have hidden sources of error when testing many goals and variations. False Discovery Rates correct this by increasing runtime when you have many low signal goals.
3. All three levers are inversely related. For example, running my tests longer can get me lower error rates, or detect smaller effects.
The answers
First, some vocabulary (yay!)
• Control and Variation A control is the original, or baseline version of content that you are testing through a variation.
• Goal Metric used to measure impact of control and variation
• Baseline conversion rate The control group’s expected conversion rate.
• Effect size The improvement (positive or negative) of your variation over baseline.
• Sample size The number of visitors in your test.
• A hypothesis test is a control, and variation that you
want to show improves a goal
• An experiment is a collection of hypotheses (goals &
variation pairs) that all have the same control
Outcomes & Error Rates
Fundamental Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing Playbook”Opening
Mid-game
Mid-game
Closing
What is a good hypothesis (test)?
Why is this not actionable?
“I think changing the header image will make
my site better.”
• Removing the header will increase engagement?
• Removing the header will increase total revenue?
• Removing the header will increase “the finals” clicks?
• Growing the header will increase engagement?
• Growing the header with increase “the finals” clicks.
• …
Test creep!Bad hypothesis
• Removing the header will increase engagement?
• Removing the header will increase total revenue?
• Removing the header will increase “the finals” clicks?
• Growing the header will increase engagement?
• Growing the header with increase “the finals” clicks.
• …
• Removing the header will increase engagement?
• Removing the header will increase total revenue?
• Removing the header will increase “the finals” clicks?
• Growing the header will increase engagement?
• Growing the header with increase “the finals” clicks.
• …
“I think changing the header image will make
my site better.”
Bad hypothesis Good hypotheses
Organized and clear
The more relationships (hypotheses) you test,
the longer (visitors) it will take
to achieve the same outcome (error rate).
Hypotheses also give the cost of your experiment
Questions to check for a good hypothesis
What are you trying to show with your idea?
What key metrics should it drive?
Are all my goals and variations necessary given my testing limits?
1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?
At the end of this talk, you should be able to answer
1. What makes a good hypothesis?
Answer: A good hypothesis has a variation and a clearly defined goal which you crafted ahead of time.
2. What are the differences between False Positive Rate and False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?
At the end of this talk, you should be able to answer
Outcomes & Error Rates
Fundamental Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing Playbook”Opening
Mid-game
Mid-game
Closing
http://www.nba.com/
What are the possible outcomes?
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
(no effect, winner / loser)
http://www.nba.com/
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
(no effect, winner / loser)
(+/- improvement, inconclusive)
http://www.nba.com/
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
(no effect, winner / loser)
(+/- improvement, inconclusive)
(+/- improvement, winner / loser)
(no effect, inconclusive)
The 2x2 table will help us to
1. Keep track of different error rates we care about
2. Explore the consequences of controlling false positives vs false
discoveries
Error rate 1: False positive rate
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
• False positive rate (Type I error)
= “Chance of a false positive from a variation with no effect on a goal”
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
• False positive rate (Type I error)
= “Chance of a false positive from a variation with no effect on a goal”
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
• False positive rate (Type I error)
= “Chance of a false positive from a variation with no effect on a goal”
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
• False positive rate (Type I error)
= “Chance of a false positive from a variation with no effect on a goal”
• Thresholding the FPR
“When I have a variation with no effect on a goal, I’ll find an effect less than 10% of the time.”
How can we ever compute a False Positive Rate if we don’t know whether a hypothesis is true or not?
Statistical tests (fixed horizon t-test, Stats Engine) are designed to
threshold an error rate.
Example:
“Calling winners & losers when a p-value is below .05 will guarantee a
False Positive Rate below 5%”
False Positive Rates with multiple tests
https://xkcd.com/882/
https://xkcd.com/882/
https://xkcd.com/882/
What happened?
21 tests X 5% FPR = 1 False Positive on average
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner / Loser
True positive
False positive
Inconc-lusive
False negative
True negative
False positive rates are only useful in the context of all hypotheses
Error rate 2: False discovery rate
• False discovery rate (FDR)
= “Chance of a false positive from a conclusive result”
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
• False discovery rate (FDR)
= “Chance of a false positive from a conclusive result”
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
• False discovery rate (FDR)
= “Chance of a false positive from a conclusive result”
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
• False discovery rate (FDR)
= “Chance of a false positive from a conclusive result”
• Thresholding the FDR
= “When you see a winning or losing goal on a variation, it’s wrong less than 10% of the time.”
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
or X 5% FDR = 0.05 False Positives on average
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner / loser
True positive
False positive
Inconc-lusive
False negative
True negative
False discovery rates are useful despite the number of hypotheses
What’s the catch?
The more hypotheses (goals & variations) in your experiment, the longer it takes to find conclusives.
Not quite …
low signal high signal
What’s the catch?
The more low signal hypotheses (goals & variations) in your experiment, the longer it takes to find conclusives.
Recap
• False Positive Rate thresholding
-controls the chance of a false positive when you have a hypothesis with no effect
-misrepresents your error rate with multiple goals and variations
• False Discovery Rate thresholding
-controls the chance of a false positive when you have a winning or losing hypothesis
-is accurate regardless of how many hypotheses you run
-can take longer to reach significance with more low signal variations on goals
Tips & Tricks for running experiments with False Discovery Rates
• Ask: Which goal is most important to me?
-This should be my primary goal (not impacted by all other goals)
• Run large, or large multivariate tests without fear of finding spurious results, but be prepared for the cost of exploration
• A little human intuition and prior knowledge can go a long way towards reducing the runtime of your experiments
1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?
At the end of this talk, you should be able to answer
1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and False Discovery Rate?
False Positive Rates have hidden sources of error when testing many goals and variations. False Discovery Rates correct this by increasing runtime when you have many noisy goals.
3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?
At the end of this talk, you should be able to answer
Outcomes & Error Rates
Fundamental Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing Playbook”Opening
Mid-game
Mid-game
Closing
“3 Levers” of A/B Testing
1.Threshold an error rate
• “I want no more than 10% false discovery rate”
2.Detecting effect sizes (setting an MDE)
• “I’m OK with only detecting greater than 5% improvement”
3.Running tests longer
• “I can afford to run this test for 3 weeks, or 50,000
visitors”
Fundamental Tradeoff of A/B Testing
Error rates Runtime
Effect size / Baseline CR
All Inversely Related
Error rates Runtime
Effect size / Baseline CR
At any number of visitors,
the less you threshold your error rate,
the smaller effect sizes you can detect.
All Inversely Related
Error rates Runtime
Effect size / Baseline CR
All Inversely Related
At any error rate threshold,
stopping your test earlier means
you can only detect larger effect sizes.
Error rates Runtime
Effect size / Baseline CR
All Inversely Related
For any effect size,
the lower error rate you want,
the longer you need to run your test.
What does this look like in practice?
Average Visitors needed to reach significance with Stats Engine
Improvement (relative)
5% 10% 25%
Significance Threshold
95% 62,400 13,500 1,800
90% 59,100 12,800 1,700
80% 52,600 11,400 1,500
Baseline conversion rate = 10%
All A/B Testing platforms address the fundamental tradeoff …
1. Choose a minimum detectable effect (MDE) and false positive rate threshold
2. Find a required sampled minimum sample size with a sample size calculator
3. Wait until the minimum sample size is reached
4. Look at your results once and only once
Optimizely is the only platform that lets you pull the levers in
real time
http://www.nba.com/
5%
Error rates Runtime
Effect size / Baseline CR
-
+5%, 10%
52,600
?In the beginning, we make an educated guess …
5%
Error rates Runtime (remaining)
Effect size / Baseline CR
-
+13%, 16%
1,600
Instead of: 52,600 - 7,200 = 45,400
… but then the improvement turns out to be better …
5%
Error rates Runtime (remaining)
Effect size / Baseline CR
-
+2%, 8%
> 100,000
… or a lot worse.
Recap
• The Fundamental Tradeoff of A/B Testing affects you no matter what testing platform you use.
-If you want to detect a 5% Improvement on a 10% baseline conversion rate, you should be prepared to wait for at least 50,000 visitors
• Optimizely’s Stats Engine is the only platform that allows you to adjust the trade-off in real time while still reporting valid error rates
1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?
At the end of this talk, you should be able to answer
1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?
All three are inversely related. For example, running my tests longer can get me lower error rates, or detect smaller effects.
At the end of this talk, you should be able to answer
Outcomes & Error Rates
Fundamental Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing Playbook”Opening
Mid-game
Mid-game
Closing
Outcomes & Error Rates
Fundamental Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing Playbook”Opening
Mid-game
Mid-game
Closing
Outcomes & Error Rates
Fundamental Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing Playbook”Opening
Mid-game
Mid-game
Closing
Statistics in 40 Minutes: A/B Testing FundamentalsLeo PekelisStatistician, Optimizely
#opticon2015
Outcomes & Error Rates
Fundamental Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing Playbook”Opening
Mid-game
Mid-game
Closing
Definition:
A confidence interval is a range of values for your metric (revenue, conversion rate, etc.) that is 90%* likely to contain the true difference between your variation and baseline.
15.41
11.4Middle Ground
Best Case
Worst case
7.29
This is true regardless of your significance.
http://www.nba.com/
We can’t wait for significance
The confidence interval tells us what we need to know
A confidence interval is the mirror image of statistical significance
Mathematical Definition:
The set of parameter values X so that a hypothesis test with null hypothesis
H0: Removing a distracting header will result in X more revenue per visitor.
is not yet rejected.
Error rate 3: False negative rate
• False negative rate (Type II error)
= “Rate of false negatives from all hypotheses that could have been false negatives.”
“True” value of hypothesis
Improve-ment No effect
Outcome of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
• False negative rate (Type II error)
= “Rate of false negatives from all hypotheses that could have been false negatives.”
“True” value of hypothesis
Improve-ment No effect
Outcome of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
• False negative rate (Type II error)
= “Rate of false negatives from all hypotheses that could have been false negatives.”
“True” value of hypothesis
Improve-ment No effect
Outcome of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
• False negative rate (Type II error)
= “Rate of false negatives from all variations with an improvement on a goal.”
= #(False negative) / #(Improve)
• Thresholding Type II error
= “When have a goal on a variation with an effect, you miss it less than 10% of the time.”
“True” value of hypothesis
Improve-ment No effect
Outcome of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative