Tuning tie-breaker experiments 1 Tuning the tie-breaker design Art B. Owen * , Stanford University and Hal Varian, Google * Work (mostly) done for Google, not as part of my Stanford responsibilities. Stanford statistics seminar
Tuning tie-breaker experiments 1
Tuning the tie-breaker design
Art B. Owen∗, Stanford University
and
Hal Varian, Google
∗Work (mostly) done for Google, not as part of my Stanford responsibilities.
Stanford statistics seminar
Tuning tie-breaker experiments 2
OverviewLots of problems come up when you combine data of different types.
From 40,000 feet
• Bayes
• Likelihood
• Empirical Bayes
• Transportability
At ground level
Specifics are interesting.
First I sketch some recent examples.
Then the work with Hal Varian.
Stanford statistics seminar
Tuning tie-breaker experiments 3
Big data & small dataWith Aiyou Chen and Minghui Shi of Google.
Small and good data set
(xi, yi), i ∈ S, n obs
We want β for this small population.
Huge data set, possibly relevant
(xi, yi), i ∈ B, N � n obs
Approach
Shrink β̂S towards β̂BStein or Bayes
Stanford statistics seminar
Tuning tie-breaker experiments 4
Related GWASWith Edgar Dobriban, Stuart Kim, Kristen Fortney
• Tiny underpowered GWAS on centenarians
• Seek optimal weighting of the SNP hypotheses (inverse weight the p-values)
• Using huge GWAS on age-related illness (eg diabetes, hypertension)
We got some new longevity-associated genes.
Stanford statistics seminar
Tuning tie-breaker experiments 5
Partial conjunction testsConcept from Benjamini & Heller
Test same H0 n data sets. Require at least r rejections.
Better reproducibility than meta-analysis.
2 papers lead by Jingshu Wang
Paper 1
Conditions for admissible testing of a weirdly composite “sparsity null”.
Wang & O (2018) JASA
Paper 2
N genes in n studies
An N × n matrix of p-values
Filtering idea to do N PC tests at once.
Wang, Su, Sabatti, O (2018)
Stanford statistics seminar
Tuning tie-breaker experiments 6
Propensity workWith Evan Rosenman and Michael Baiocchi and Hailey Banack (2018)
Does W ∈ {0, 1} cause y?
Huge data base (Wi, xi, yi) for i ∈ Obs.
Wi chosen in a way that could depend on xi
Small randomized experiment (Wi, xi, yi) for i ∈ Expt.
Wi chosen at random
First idea
Put experimental subjects into a propensity bucket.
The one they would have occupied in the observational data.
Women’s health initiative
Both kinds of data on hormone replacement vs coronary heart disease.
Stanford statistics seminar
Tuning tie-breaker experiments 7
Hal Varian
Google chief economist
Stanford statistics seminar
Tuning tie-breaker experiments 8
Customer loyalty plansAn airline can give an upgrade to n out of N customers. Who?
• The n most loyal customers?
• The n customers most likely to start flying / spending more?
Other examples
• Hotels & car rental companies
• E-commerce platforms, for their advertisers, reviewers, or content producers
Stanford statistics seminar
Tuning tie-breaker experiments 9
Two goals
1) Get the most value from the offer
2) Measure the causal effect of the offer
Stanford statistics seminar
Tuning tie-breaker experiments 10
Two acronyms
1) RDD = Regression Discontinuity Design
2) RCT = Randomized Controlled Trial
We will hybridize between these approaches.
Stanford statistics seminar
Tuning tie-breaker experiments 11
The random variablesi customer id
zi treatment, YES = 1, NO = −1
yi outcome, e.g., revenue one year later (or profit, or · · · )xi assignment variable (larger the better)
Assignment / running variable x
1) It could be past revenue, or
2) a machine learning prediction.
Stanford statistics seminar
Tuning tie-breaker experiments 12
Some simplificationsSuppose at first that half of zi = 1 and half are−1.
(undo later)
Rank transformation
Sort customers, x1 6 x2 6 · · · 6 xN , then
re-define xi ←2i−N − 1
N
Now−1 < xi < 1.
Two-line regression
yi = β0 + β1xi + β2zi + β3xizi + εi εi ∼ (0, σ2)
Other models are interesting, but we need to pick one, so this is it.
Stanford statistics seminar
Tuning tie-breaker experiments 13
Regression discontinuityTreatment IFF x > 0 Thistlethwaite & Campbell (1960)
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●●
●●●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
−1.0 −0.5 0.0 0.5 1.0
89
1011
1213
Running variable
Res
pons
e
People just left of the discontinuity should be comparable to those just right of it.Stanford statistics seminar
Tuning tie-breaker experiments 14
Separate linear regressions
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
−1.0 −0.5 0.0 0.5 1.0
89
1011
1213
Regression discontinuity
Running variable
Res
pons
e
Raises thorny extrapolation/linearity issues at large |xi|. Stanford statistics seminar
Tuning tie-breaker experiments 15
Regression discontinuity
Famous example:
x = test score
z = merit scholarship iff x > τ
y = went to grad school
then logistic regression.
RDD is the second most believable causal inference method.
Stanford statistics seminar
Tuning tie-breaker experiments 16
Tie-breaker designPick cutoffs A 6 B, then
zi =
1, xi > B
−1, xi 6 A
random, A < xi < B
−1.0 −0.5 0.0 0.5 1.0
NO YES50 : 50
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
A B
Extreme cases
1) x1 < A = B < xN =⇒ RDD
2) A < x1 6 · · · 6 xN < B =⇒ RCT
Also called “cutoff designs” Cappelleri & Trochim Stanford statistics seminar
Tuning tie-breaker experiments 17
Examplesx z Ref
Reading ability Remedial English class Aiken et al. (1998)
Student ranking Post secondary financial aid Angrist et al (2014)
Composite prognostic Inpatient rehab Havassy
Lanarkshire milk experiment
Student (1931)
Maybe a tie-breaker would have worked.
Stanford statistics seminar
Tuning tie-breaker experiments 18
Tie-breakers∆ = Fraction in RDD between Blue dashed lines
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●●
●●
●
●
●
●●●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
−1.0 −0.5 0.0 0.5 1.0
9498
102
106
RCT: Delta = 1
x
Out
com
e
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●●
−1.0 −0.5 0.0 0.5 1.0
9510
010
5
RDD: Delta = 0
x
Out
com
e
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●●
−1.0 −0.5 0.0 0.5 1.0
9510
010
5
Delta = 1/3
x
Out
com
e
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●●
−1.0 −0.5 0.0 0.5 1.0
9510
010
5
Delta = 2/3
x
Out
com
e
Stanford statistics seminar
Tuning tie-breaker experiments 19
Two-line regression
E(y) = β0 + β1x+ β2z + β3xz
X =
1 x1 z1 x1z1
1 x2 z2 x2z2
......
......
1 xN zN xNzN
Var(β̂) = (XTX )−1σ2
Pr(zi = 1) =
0, xi 6 −∆
1/2, |xi| < ∆
1, xi > ∆
Stanford statistics seminar
Tuning tie-breaker experiments 20
Integral approximation
1
NXTX ≈
1 x z xz
1 1 0 0 φ(∆)
x 0 1/3 φ(∆) 0
z 0 φ(∆) 1 0
xz φ(∆) 0 0 1/3
where
φ(∆) ≡ 1
2
∫ 1
−1
xE(z | x) dx
=1
2
∫ −∆
−1
(−x) dx+1
2
∫ ∆
−∆
0 dx+1
2
∫ 1
∆
xdx
=1−∆2
2
The error above is Op(1/√N).
Even less under stratification. Stanford statistics seminar
Tuning tie-breaker experiments 21
Rearrange XTX/N
1 zx z x
1 1 φ · ·zx φ 1/3 · ·z · · 1 φ
x · · φ 1/3
(using · for 0)
N ×Var
β̂0
β̂3
β̂2
β̂1
=1
1/3− φ2
1/3 −φ · ·−φ 1 · ·· · 1/3 −φ· · −φ 1
σ2
φ = φ(∆) =1−∆2
2
Stanford statistics seminar
Tuning tie-breaker experiments 22
NormalizationThe design choice is which ∆ to use.
That comes down toVar(cTβ̂; ∆1)
Var(cTβ̂; ∆0)
for various vectors c.
Cancellation
σ2 cancels in this ratio.
So we fix σ2 = 1.
Stanford statistics seminar
Tuning tie-breaker experiments 23
ImpactChanging z from−1 to +1 increases E(y) by(
β0 + β1x+ β2 + β3x)−(β0 + β1x− β2 − β3x
)= 2(β2 + xβ3)
So β2 and β3 are important.
So is x.
(If we didn’t already know)
Variance
Var(2(β̂2 + xβ̂3)) = · · · = 16(1 + 3x2)
1 + 3∆2(2−∆2)
Stanford statistics seminar
Tuning tie-breaker experiments 24
Variance vs ∆
0.0 0.2 0.4 0.6 0.8 1.0
02
46
810
12
Variance vs Delta0 = regression discontinuity, 1 = experiment
Delta
N x
var
ianc
e
● ● ●●
●●
●● ● ● ● ● ● ● ● ● ● ● ● ● ●
● Coefficient of zCoefficient of z*x
Var(β̂2) = 3Var(β̂3) all ∆.
Stanford statistics seminar
Tuning tie-breaker experiments 25
RCT vs RDD
Method ∆ Var(β̂2) Var(β̂3)
Regression discontinuity 0 4/N 12/N
Experiment 1 1/N 3/N
An RDD with N observations is as good as an RCT with N/4 observations.
Section 6 of Jacob, Zhu, Somers & Bloom (2012) has this and more observations.
Stanford statistics seminar
Tuning tie-breaker experiments 26
Var(2(β̂2 + xβ̂3))
0.0 0.2 0.4 0.6 0.8 1.0
510
2050
Variance of treatment effect vs xLinear regression
Target location x
N x
Var
ianc
e
Top = reg. discontinuity, Delta=0Bottom = experiment, Delta=1Step size 0.1
The worst RCT (at x = 1) is better than the best RDD (at x = 0).
Stanford statistics seminar
Tuning tie-breaker experiments 27
Gain from the sampleThe expected payoff per customer in the data set is
1
N
N∑i=1
(β0 + β1xi + β2E(zi) + β3xiE(zi)
)
E(zi) =
−1, xi < −∆
0, |xi| 6 ∆
1, xi > ∆
So plan with
g(∆) ≡ 1
2
∫ −∆
−1
(β0 + β1x− β2 − β3x) dx+1
2
∫ ∆
−∆
(β0 + β1x) dx
+1
2
∫ 1
∆
β0 + β1x+ β2 + β3x dx
= β0 + β3(1−∆2)/2.
Stanford statistics seminar
Tuning tie-breaker experiments 28
The tradeoffShort term gain per customer
g(∆) = β0 +β3(1−∆2)
2
Define the information gain per customer
info(∆) ≡ 1
NVar(β̂3)=
1
3− (1−∆2)2
4
Balance
v(∆) ≡ g(∆) + λ× info(∆)
= β0 + β31−∆2
2+ λ(1
3− (1−∆2)2
4
)
NB: β0 does not affect our choice of ∆.
Stanford statistics seminar
Tuning tie-breaker experiments 29
Optimal ∆β3 is the coefficient of xiziλ is value of information
∆∗ =
1, β3/λ 6 0√
1− β3/λ, 0 6 β3/λ 6 1
0, 1 6 β3/λ.
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0.0
0.2
0.4
0.6
0.8
1.0
Present / future value
Opt
imal
Del
ta
Stanford statistics seminar
Tuning tie-breaker experiments 30
Value of future information
It is really hard to quantify the value of that information.
Maybe harder than eliciting a prior.
Stanford statistics seminar
Tuning tie-breaker experiments 31
Simpler approachLet ∆0 be smallest ∆ with efficiency ρ vs RCT ∆ = 1.
We know that 1/4 6 ρ 6 1.
ρ =Var(2(β̂2 + xβ̂3) | ∆ = 1)
Var(2(β̂2 + xβ̂3) | ∆ = ∆0)= · · · = 1 + 3∆2
0(2−∆20)
1 + 3(2− 1)
Solve a quadratic equation for ∆20
3∆40 − 6∆2
0 + 4ρ− 1 = 0
=⇒ ∆0 =
√1−
√1− (4ρ− 1)/3
Stanford statistics seminar
Tuning tie-breaker experiments 32
Minimal ∆ for efficiency ρ
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Efficiency demanded
Min
imal
Del
ta
ρ ∆0
0.99 0.94
0.9 0.80
0.8 0.70
0.7 0.61
0.6 0.52
Stanford statistics seminar
Tuning tie-breaker experiments 33
Gaussian running variableFor xi = Φ−1
(i−1/2
n
),
experiment on central ∆N observations, then
1
NXTX ≈
1 zx z x
1 1 φG 0 0
zx φG 1 0 0
z 0 0 1 φG
x 0 0 φG 1
φG = avg(xiz(xi)) = · · · = 2ϕ
(Φ−1
(1 + ∆
2
))After some algebra, the RDD efficiency vs RDD is
π
π − 2
.= 2.75
Goldberger (1972).
Stanford statistics seminar
Tuning tie-breaker experiments 34
CarpentryWe don’t have to keep p(x) ≡ Pr(Z = 1 | x) ∈ {0, 1/2, 1}.
−1.0 −0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Running variable x
P(
Z=
1 | x
)
Carpentry doesn’t really help
(or hurt).
Stanford statistics seminar
Tuning tie-breaker experiments 35
Carpentry ctdUnder symmetry,
p(−x) = 1− p(x),
the shape of the curve doesn’t matter, only
zx ≡ 1
2
∫ 1
−1
xE(Z | x) dx =1
2
∫ 1
−1
x(2p(x)− 1) dx > 0
short term gain is
β0 + β3zx
information proportional to1
3− zx2
Asymmetric p
Replace by a symmetric one. That reduces
diag(Var(β̂))
keeping gain the same. Stanford statistics seminar
Tuning tie-breaker experiments 36
Two quadratics
E(Y ) = β0 + β1x+ β2z + β3xz+β4x2 + β5x
2z
1
NXTX .
=
1 zx x2 z x zx2
1 1 φ1 1/3 · · ·
zx φ1 1/3 φ3 · · ·
x2 1/3 φ3 1/5 · · ·
z · · · 1 φ1 1/3
x · · · φ1 1/3 φ3
zx2 · · · 1/3 φ3 1/5
φ1 = (1−∆2)/2 φ3 = (1−∆4)/4
Stanford statistics seminar
Tuning tie-breaker experiments 37
Treatment effect
E(y | x, z = 1)− E(y | x, z = −1) = 2(β2 + xβ3 + x2β5)
0.0 0.2 0.4 0.6 0.8 1.0
1050
200
1000
Variance of treatment effect vs xQuadratic regression
Target location x
N x
Var
ianc
e
Top = reg. discontinuity, Delta=0Bottom = experiment, Delta=1Step size 0.1
Note log scale.
Gelman & Imbens (2017) warn against polynomial RDD.
Stanford statistics seminar
Tuning tie-breaker experiments 38
More elaborate modelsFor a feature vector F = F (x) ∈ Rd including intercept
E(y) = FTβ + zFTγ
take
zi =
1, θTFi > ∆
random, |θTFi| < ∆
−1, θTFi 6 −∆.
Now
XTX =
A B
B A
, A =∑i
FiFTi , B =
∑i
wiFiFTi ,
for
wi = E(zi | Fi) =
1, θTFi > ∆,
2p− 1, |θTFi| < ∆,
−1, θTFi 6 −∆.Stanford statistics seminar
Tuning tie-breaker experiments 39
Inverting block matrices
Var(γ̂) = Var(β̂) = (A−BA−1B)−1σ2
Cov(γ̂, β̂) = −A−1B(A−BA−1B)−1σ2
We could pick θ, F and p by brute force search with Monte Carlo as an inner loop.
Here matrix algebra can replace the inner Monte Carlo.
Big ∆ better
For large enough ∆ we get B = 0.
Smaller ∆ raises BA−1B and hence Var(β̂).
Stanford statistics seminar
Tuning tie-breaker experiments 40
Non-central regionsThe airline won’t give upgrades to half of their passengers.
They are more likely to do:
z =
1, top few
random, next few
−1, majority.
Of the majority, only retain those where the linear model is ok.
Stanford statistics seminar
Tuning tie-breaker experiments 41
Two linesExperiment in range (A,B):
Method A B Var(β̂3)
Full experiment −1.00 1.00 3.00/N
RDD 0.00 0.00 12.00/N
Expt on bottom 50% −1.00 0.00 13.09/N
Expt on second 10% 0.60 0.80 137.56/N
Top 10% only 0.80 0.80 751.03/N
Top 15% only 0.70 0.70 223.44/N
Top 20% only 0.60 0.60 95.21/N
Stanford statistics seminar
Tuning tie-breaker experiments 42
Followup directions• This x can be the output of a prediction algorithm based on many variables.
So how does the sampling plan help fit the next model?
I.e., how to handle concomitants?
• What about binary responses y?
Logistic regression efficiency actually depends on the underlying β.
Usual approaches are Bayesian.
Stanford statistics seminar
Tuning tie-breaker experiments 43
Thanks• Hal Varian, co-author
• Google, environment
Stanford statistics seminar