Tuning tie-breaker experiments 1 Tuning the tie …statweb.stanford.edu/~owen/pubtalks/tiebreaker.pdfTuning tie-breaker experiments 1 Tuning the tie-breaker design Art B. Owen , Stanford

Tuning tie-breaker experiments 1

Tuning the tie-breaker design

Art B. Owen∗, Stanford University

and

Hal Varian, Google

∗Work (mostly) done for Google, not as part of my Stanford responsibilities.

Stanford statistics seminar


OverviewLots of problems come up when you combine data of different types.

From 40,000 feet

• Bayes

• Likelihood

• Empirical Bayes

• Transportability

At ground level

Specifics are interesting.

First I sketch some recent examples.

Then the work with Hal Varian.



Big data & small dataWith Aiyou Chen and Minghui Shi of Google.

Small and good data set

(xi, yi), i ∈ S, n obs

We want β for this small population.

Huge data set, possibly relevant

(xi, yi), i ∈ B, N � n obs

Approach

Shrink β̂S towards β̂BStein or Bayes



Related GWASWith Edgar Dobriban, Stuart Kim, Kristen Fortney

• Tiny underpowered GWAS on centenarians

• Seek optimal weighting of the SNP hypotheses (inverse weight the p-values)

• Using huge GWAS on age-related illness (eg diabetes, hypertension)

We got some new longevity-associated genes.



Partial conjunction testsConcept from Benjamini & Heller

Test same H0 n data sets. Require at least r rejections.

Better reproducibility than meta-analysis.

2 papers lead by Jingshu Wang

Paper 1

Conditions for admissible testing of a weirdly composite “sparsity null”.

Wang & O (2018) JASA

Paper 2

N genes in n studies

An N × n matrix of p-values

Filtering idea to do N PC tests at once.

Wang, Su, Sabatti, O (2018)



Propensity workWith Evan Rosenman and Michael Baiocchi and Hailey Banack (2018)

Does W ∈ {0, 1} cause y?

Huge data base (Wi, xi, yi) for i ∈ Obs.

Wi chosen in a way that could depend on xi

Small randomized experiment (Wi, xi, yi) for i ∈ Expt.

Wi chosen at random

First idea

Put experimental subjects into a propensity bucket.

The one they would have occupied in the observational data.

Women’s health initiative

Both kinds of data on hormone replacement vs coronary heart disease.



Hal Varian

Google chief economist



Customer loyalty plansAn airline can give an upgrade to n out of N customers. Who?

• The n most loyal customers?

• The n customers most likely to start flying / spending more?

Other examples

• Hotels & car rental companies

• E-commerce platforms, for their advertisers, reviewers, or content producers



Two goals

1) Get the most value from the offer

2) Measure the causal effect of the offer



Two acronyms

1) RDD = Regression Discontinuity Design

2) RCT = Randomized Controlled Trial

We will hybridize between these approaches.



The random variablesi customer id

zi treatment, YES = 1, NO = −1

yi outcome, e.g., revenue one year later (or profit, or · · · )xi assignment variable (larger the better)

Assignment / running variable x

1) It could be past revenue, or

2) a machine learning prediction.



Some simplificationsSuppose at first that half of zi = 1 and half are−1.

(undo later)

Rank transformation

Sort customers, x1 6 x2 6 · · · 6 xN , then

re-define xi ←2i−N − 1

N

Now−1 < xi < 1.

Two-line regression

yi = β0 + β1xi + β2zi + β3xizi + εi εi ∼ (0, σ2)

Other models are interesting, but we need to pick one, so this is it.



Regression discontinuityTreatment IFF x > 0 Thistlethwaite & Campbell (1960)

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●●

●●●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

−1.0 −0.5 0.0 0.5 1.0

89

1011

1213

Running variable

Res

pons

e

People just left of the discontinuity should be comparable to those just right of it.Stanford statistics seminar


Separate linear regressions

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

−1.0 −0.5 0.0 0.5 1.0

89

1011

1213

Regression discontinuity

Running variable

Res

pons

e

Raises thorny extrapolation/linearity issues at large |xi|. Stanford statistics seminar


Regression discontinuity

Famous example:

x = test score

z = merit scholarship iff x > τ

y = went to grad school

then logistic regression.

RDD is the second most believable causal inference method.



Tie-breaker designPick cutoffs A 6 B, then

zi =

1, xi > B

−1, xi 6 A

random, A < xi < B

−1.0 −0.5 0.0 0.5 1.0

NO YES50 : 50

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

A B

Extreme cases

1) x1 < A = B < xN =⇒ RDD

2) A < x1 6 · · · 6 xN < B =⇒ RCT

Also called “cutoff designs” Cappelleri & Trochim Stanford statistics seminar


Examplesx z Ref

Reading ability Remedial English class Aiken et al. (1998)

Student ranking Post secondary financial aid Angrist et al (2014)

Composite prognostic Inpatient rehab Havassy

Lanarkshire milk experiment

Student (1931)

Maybe a tie-breaker would have worked.



Tie-breakers∆ = Fraction in RDD between Blue dashed lines

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●●

●●

●

●

●

●●●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

−1.0 −0.5 0.0 0.5 1.0

9498

102

106

RCT: Delta = 1

x

Out

com

e

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●

●●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●●

−1.0 −0.5 0.0 0.5 1.0

9510

010

5

RDD: Delta = 0

x

Out

com

e

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●●

−1.0 −0.5 0.0 0.5 1.0

9510

010

5

Delta = 1/3

x

Out

com

e

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●●

−1.0 −0.5 0.0 0.5 1.0

9510

010

5

Delta = 2/3

x

Out

com

e



Two-line regression

E(y) = β0 + β1x+ β2z + β3xz

X =

1 x1 z1 x1z1

1 x2 z2 x2z2

......

......

1 xN zN xNzN

Var(β̂) = (XTX )−1σ2

Pr(zi = 1) =

0, xi 6 −∆

1/2, |xi| < ∆

1, xi > ∆



Integral approximation

1

NXTX ≈

1 x z xz

1 1 0 0 φ(∆)

x 0 1/3 φ(∆) 0

z 0 φ(∆) 1 0

xz φ(∆) 0 0 1/3

where

φ(∆) ≡ 1

2

∫ 1

−1

xE(z | x) dx

=1

2

∫ −∆

−1

(−x) dx+1

2

∫ ∆

−∆

0 dx+1

2

∫ 1

∆

xdx

=1−∆2

2

The error above is Op(1/√N).

Even less under stratification. Stanford statistics seminar


Rearrange XTX/N

1 zx z x

1 1 φ · ·zx φ 1/3 · ·z · · 1 φ

x · · φ 1/3

(using · for 0)

N ×Var

β̂0

β̂3

β̂2

β̂1

=1

1/3− φ2

1/3 −φ · ·−φ 1 · ·· · 1/3 −φ· · −φ 1

σ2

φ = φ(∆) =1−∆2

2



NormalizationThe design choice is which ∆ to use.

That comes down toVar(cTβ̂; ∆1)

Var(cTβ̂; ∆0)

for various vectors c.

Cancellation

σ2 cancels in this ratio.

So we fix σ2 = 1.



ImpactChanging z from−1 to +1 increases E(y) by(

β0 + β1x+ β2 + β3x)−(β0 + β1x− β2 − β3x

)= 2(β2 + xβ3)

So β2 and β3 are important.

So is x.

(If we didn’t already know)

Variance

Var(2(β̂2 + xβ̂3)) = · · · = 16(1 + 3x2)

1 + 3∆2(2−∆2)



Variance vs ∆

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

12

Variance vs Delta0 = regression discontinuity, 1 = experiment

Delta

N x

var

ianc

e

● ● ●●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ●

● Coefficient of zCoefficient of z*x

Var(β̂2) = 3Var(β̂3) all ∆.



RCT vs RDD

Method ∆ Var(β̂2) Var(β̂3)

Regression discontinuity 0 4/N 12/N

Experiment 1 1/N 3/N

An RDD with N observations is as good as an RCT with N/4 observations.

Section 6 of Jacob, Zhu, Somers & Bloom (2012) has this and more observations.



Var(2(β̂2 + xβ̂3))

0.0 0.2 0.4 0.6 0.8 1.0

510

2050

Variance of treatment effect vs xLinear regression

Target location x

N x

Var

ianc

e

Top = reg. discontinuity, Delta=0Bottom = experiment, Delta=1Step size 0.1

The worst RCT (at x = 1) is better than the best RDD (at x = 0).



Gain from the sampleThe expected payoff per customer in the data set is

1

N

N∑i=1

(β0 + β1xi + β2E(zi) + β3xiE(zi)

)

E(zi) =

−1, xi < −∆

0, |xi| 6 ∆

1, xi > ∆

So plan with

g(∆) ≡ 1

2

∫ −∆

−1

(β0 + β1x− β2 − β3x) dx+1

2

∫ ∆

−∆

(β0 + β1x) dx

+1

2

∫ 1

∆

β0 + β1x+ β2 + β3x dx

= β0 + β3(1−∆2)/2.



The tradeoffShort term gain per customer

g(∆) = β0 +β3(1−∆2)

2

Define the information gain per customer

info(∆) ≡ 1

NVar(β̂3)=

1

3− (1−∆2)2

4

Balance

v(∆) ≡ g(∆) + λ× info(∆)

= β0 + β31−∆2

2+ λ(1

3− (1−∆2)2

4

)

NB: β0 does not affect our choice of ∆.



Optimal ∆β3 is the coefficient of xiziλ is value of information

∆∗ =

1, β3/λ 6 0√

1− β3/λ, 0 6 β3/λ 6 1

0, 1 6 β3/λ.

0.0 0.2 0.4 0.6 0.8 1.0 1.2

0.0

0.2

0.4

0.6

0.8

1.0

Present / future value

Opt

imal

Del

ta



Value of future information

It is really hard to quantify the value of that information.

Maybe harder than eliciting a prior.



Simpler approachLet ∆0 be smallest ∆ with efficiency ρ vs RCT ∆ = 1.

We know that 1/4 6 ρ 6 1.

ρ =Var(2(β̂2 + xβ̂3) | ∆ = 1)

Var(2(β̂2 + xβ̂3) | ∆ = ∆0)= · · · = 1 + 3∆2

0(2−∆20)

1 + 3(2− 1)

Solve a quadratic equation for ∆20

3∆40 − 6∆2

0 + 4ρ− 1 = 0

=⇒ ∆0 =

√1−

√1− (4ρ− 1)/3



Minimal ∆ for efficiency ρ

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Efficiency demanded

Min

imal

Del

ta

ρ ∆0

0.99 0.94

0.9 0.80

0.8 0.70

0.7 0.61

0.6 0.52



Gaussian running variableFor xi = Φ−1

(i−1/2

n

),

experiment on central ∆N observations, then

1

NXTX ≈

1 zx z x

1 1 φG 0 0

zx φG 1 0 0

z 0 0 1 φG

x 0 0 φG 1

φG = avg(xiz(xi)) = · · · = 2ϕ

(Φ−1

(1 + ∆

2

))After some algebra, the RDD efficiency vs RDD is

π

π − 2

.= 2.75

Goldberger (1972).



CarpentryWe don’t have to keep p(x) ≡ Pr(Z = 1 | x) ∈ {0, 1/2, 1}.

−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Running variable x

P(

Z=

1 | x

)

Carpentry doesn’t really help

(or hurt).



Carpentry ctdUnder symmetry,

p(−x) = 1− p(x),

the shape of the curve doesn’t matter, only

zx ≡ 1

2

∫ 1

−1

xE(Z | x) dx =1

2

∫ 1

−1

x(2p(x)− 1) dx > 0

short term gain is

β0 + β3zx

information proportional to1

3− zx2

Asymmetric p

Replace by a symmetric one. That reduces

diag(Var(β̂))

keeping gain the same. Stanford statistics seminar


Two quadratics

E(Y ) = β0 + β1x+ β2z + β3xz+β4x2 + β5x

2z

1

NXTX .

=

1 zx x2 z x zx2

1 1 φ1 1/3 · · ·

zx φ1 1/3 φ3 · · ·

x2 1/3 φ3 1/5 · · ·

z · · · 1 φ1 1/3

x · · · φ1 1/3 φ3

zx2 · · · 1/3 φ3 1/5

φ1 = (1−∆2)/2 φ3 = (1−∆4)/4



Treatment effect

E(y | x, z = 1)− E(y | x, z = −1) = 2(β2 + xβ3 + x2β5)

0.0 0.2 0.4 0.6 0.8 1.0

1050

200

1000

Variance of treatment effect vs xQuadratic regression

Target location x

N x

Var

ianc

e

Top = reg. discontinuity, Delta=0Bottom = experiment, Delta=1Step size 0.1

Note log scale.

Gelman & Imbens (2017) warn against polynomial RDD.



More elaborate modelsFor a feature vector F = F (x) ∈ Rd including intercept

E(y) = FTβ + zFTγ

take

zi =

1, θTFi > ∆

random, |θTFi| < ∆

−1, θTFi 6 −∆.

Now

XTX =

A B

B A

, A =∑i

FiFTi , B =

∑i

wiFiFTi ,

for

wi = E(zi | Fi) =

1, θTFi > ∆,

2p− 1, |θTFi| < ∆,

−1, θTFi 6 −∆.Stanford statistics seminar


Inverting block matrices

Var(γ̂) = Var(β̂) = (A−BA−1B)−1σ2

Cov(γ̂, β̂) = −A−1B(A−BA−1B)−1σ2

We could pick θ, F and p by brute force search with Monte Carlo as an inner loop.

Here matrix algebra can replace the inner Monte Carlo.

Big ∆ better

For large enough ∆ we get B = 0.

Smaller ∆ raises BA−1B and hence Var(β̂).



Non-central regionsThe airline won’t give upgrades to half of their passengers.

They are more likely to do:

z =

1, top few

random, next few

−1, majority.

Of the majority, only retain those where the linear model is ok.



Two linesExperiment in range (A,B):

Method A B Var(β̂3)

Full experiment −1.00 1.00 3.00/N

RDD 0.00 0.00 12.00/N

Expt on bottom 50% −1.00 0.00 13.09/N

Expt on second 10% 0.60 0.80 137.56/N

Top 10% only 0.80 0.80 751.03/N

Top 15% only 0.70 0.70 223.44/N

Top 20% only 0.60 0.60 95.21/N



Followup directions• This x can be the output of a prediction algorithm based on many variables.

So how does the sampling plan help fit the next model?

I.e., how to handle concomitants?

• What about binary responses y?

Logistic regression efficiency actually depends on the underlying β.

Usual approaches are Bayesian.



Thanks• Hal Varian, co-author

• Google, environment


Tuning tie-breaker experiments 1 Tuning the tie …statweb.stanford.edu/~owen/pubtalks/tiebreaker.pdfTuning tie-breaker experiments 1 Tuning the tie-breaker design Art B. Owen , Stanford

Documents