Top Banner
In the world beyond p < .05: When & How to use p < .0499Yoav Benjamini Tel Aviv University, Israel replicability.tau.ac.il Supported by ERC PSARPS and HBP grants
35

Yoav Benjamini, "In the world beyond p

Jan 24, 2018

Download

Education

jemille6
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Yoav Benjamini, "In the world beyond p

In the world beyond p < .05:

When & How to use p < .0499…

Yoav Benjamini

Tel Aviv University, Israel

replicability.tau.ac.il

Supported by ERC PSARPS and HBP grants

Page 2: Yoav Benjamini, "In the world beyond p

We cannot assure the replicability of results from a single study:

Only to enhance it

The gold standard of science: A discovery should be replicable

Page 3: Yoav Benjamini, "In the world beyond p

When to use p-values? Almost Always

1. Statistical significance testing (via the p-value) is the

first-line defense against being fooled by randomness:

• Valid calculation of the p-value requires minimal

assumptions – less than any other statistical method.

• The “assumptions” need not be merely assumed, but

can be assured by a properly designed experiment

(randomization + non-parametric test)

Page 4: Yoav Benjamini, "In the world beyond p

When to use p-values? Almost Always

• Usable, even if the scale of measurement is

meaningless, and only directional decision is

needed: Significant difference gives sign

determination (the null need not be precisely true).

• In some emerging branches of science it’s the only

way to compare across conditions: GWAS, fMRI,

Brain Networks, Genomics pathways..

• The p-value offers a common concept and language

for addressing randomness across science

Page 5: Yoav Benjamini, "In the world beyond p

When to use with p<“some line”?

Often

2a. Selection according to some line is unavoidable

• When the analysis should lead to an action.

• A line is needed for power analysis in the design

stage, allowing justification of human and animal

sample sizes

• A minimal requirement for publication decision in

journals

• A bright line is needed when the action should be fair

Page 6: Yoav Benjamini, "In the world beyond p

When to use with p<“some line”?

Often

2b. Selection according to some line is unavoidable

in modern science

• A line is always used for selection when the analysis

results are too numerous to be included in total

• When highlighting results

Page 7: Yoav Benjamini, "In the world beyond p

• Giovannucci et al. (1995) look for relationships between more than a hundred types of food intakes and the risk of prostate cancer

• The abstract reports only three (marginal) 95% confidence intervals (CIs), apparently only for those relative risks whose CIs do not cover 1.

“Eat Ketchup and Pizza and avoid Prostate Cancer”

7

Epidemiology: a p-values free zone

Page 8: Yoav Benjamini, "In the world beyond p

Selection by a Table

SCIENCE, ‘07

Y Benjamini

Page 9: Yoav Benjamini, "In the world beyond p

The Main Table

11 selected by the table out of ~400,000

Y Benjamini

Page 10: Yoav Benjamini, "In the world beyond p

YB

32,0001 Voxels searched

1

448,000

SNPs

Selection by a Figure

number of tests ~ 13,000,000,000

10

Goal: Association between volume changes at voxels with genotype (Stein et al.’10)

Page 11: Yoav Benjamini, "In the world beyond p

When use it with <0.0499…?

3. Selection according to is p < 0.0499… is not too bad

• In a well designed FDA regulated clinical trial with a

single primary endpoint, the .05 line is used to define

success or failure. It managed to screen ~50% too

much.

• Moreover, FDA does not approve a drug based on a

single successful trial. At least twice, and often more.

• Indeed, the p<.05 threshold served us well until the

80’s.

A change in the way science is conducted.

Page 12: Yoav Benjamini, "In the world beyond p

The industrialization of the scientific process

1888 1999

1950 2010

Page 13: Yoav Benjamini, "In the world beyond p

In general scientists were slow to adjust

One paper per year from Genetics: #Gene-expressions reported

1

10

100

1000

10000

100000

1000000

Number of…

Micro-arrays

invented

Multiplicity addressed

for the first time

Page 14: Yoav Benjamini, "In the world beyond p

When use it with <0.0499…?

In large-scale problems lower levels as threshold for significance

are used in order to control some error-rate

With which we can maintain that

most of the selected findings will not be false !

The p < 5*10-8 used in whole genome scan was chosen

to control the prob. of making even one error (FWER) at 0.05

And it works well, without getting rid of the p-value, or of the .0499…

Page 15: Yoav Benjamini, "In the world beyond p

The Main Table

11 selected by the table out of ~400,000 FWER <.05

Y Benjamini

Page 16: Yoav Benjamini, "In the world beyond p

In depth analysis of 100 papers from NEJM 2002-2010. All

had multiple endpoints (Cohen and YB ‘16)

• # of endpoints in a paper 4-167 ; mean=27

• In 80% the issue of multiplicity was entirely ignored: p≤0.05

• All studies designated primary endpoints

• Conclusions were based on other endpoints when the

primary failed

The above reflects most of the published medical research

But not at the regulatory stage (Phase III trials).

The reasons that ~ 50% of phase III trials fail ?

YB

Elsewhere? in medical research?

Page 17: Yoav Benjamini, "In the world beyond p

In Experimental Psychology?

Our analysis of the 100 in the Psychology reproducibility

project:

# of inferences per study (4-700, average 72);

Only 11 (very very partially) addressed selection

YB

Page 18: Yoav Benjamini, "In the world beyond p

“With a clean conscience” (Schnall et al ’08, Psy.

Sc.)Presented with 6 moral dilemmas and asked “how wrong each action was”

Does priming for cleanliness affect the response:

One assessment of wrong-doing & 9 emotions rating per each dilemma;

Two methods of priming verbal & physical (separate experiments) (at least m=84)

Results: No significant difference on any of the emotions in any of the experiments;

only a contrast for disgust was significant

Each experiment barely significant on moral judgment over all dilemmas;

In 3 of the 6*2 particular dilemmas priming made a significant difference.

All tests at 0.05; No adjustment for selection.

Their Conclusion: The findings support the idea that moral judgment is affected by

priming for cleanliness.

The replication study could not replicate these results.

Ignoring selection in the reported results is a quite killer of

replicability in problems of medium complexity

Page 19: Yoav Benjamini, "In the world beyond p

How should p-values be used?

Like every other statistical tool

1. Address the effect of selection

2. Add other tools for inference: confidence intervals &

estimators if relevant and possible, Empirical/Bayes

(but follow 1 above)

3. Use the relevant variability

4. Replicate others’ work as a way of life

Page 20: Yoav Benjamini, "In the world beyond p

Address selective inference

Inference on a selected subset of the parameters that

turned out to be of interest after viewing the data!

How is selection manifested?

Out-of-study selection - not evident in the published work

File drawer problem / publication bias

The garden of forking paths, p-hacking,

Data dredging, Double dipping,

Inferactive Data Analysis

Page 21: Yoav Benjamini, "In the world beyond p

In-study selection - evident in the published work:

Selection by the Abstract, a Table, a Figure

Selection by highlighting those passing a threshold

Selection by modeling: AIC, Cp, BIC, FDR, LASSO,…

Page 22: Yoav Benjamini, "In the world beyond p

Address the effect of in-study selection

Report adjusted p-values by some method,

controlling

either The Familywise Error Rate (FWER)

or The rate Conditional on being selected

or The False Discovery Rate (FDR)

Alternatively, highlight/table/display only results

remaining statistically significant after adjustment

Page 23: Yoav Benjamini, "In the world beyond p

Address the effect of out-of-study

“bright lines” selection

Policy: Select to report the result only if the estimator is

significant at some level (e.g. publication bias / file drawer)

Y given |Y| ≥ z1-a/2,

Conditional Conf. Int. -> False Coverage Rate

Conditional density

Conditional max. likelihood estimator

23

Hedges ’84, Weinstein et al ’13, Taylor and others ‘14…

Page 24: Yoav Benjamini, "In the world beyond p

Conditional MLE Cond. MLE and CI for

correlation

Hedges ‘84, Zhong and Prentice ’08, Fithian, Sun, Taylor (16) YB and Meir (16+)

Use to assess ‘publication bias’ & other bright line rules

Page 25: Yoav Benjamini, "In the world beyond p

CIs extend more towards 0 and beyond as the original estimators are closer to

significance. 77% of replications fall inside. Meir Zeevi and YB (17+)

-

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

−0.50

−0.25

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00

Original Effect Size

Re

plic

atio

n E

ffe

ct S

ize

p−valueNot Significant

Significant

Replication Power0.6

0.7

0.8

0.9

CI and estimators in original study conditional on being significant at 5%.

Page 26: Yoav Benjamini, "In the world beyond p

Thresholding at

p-value ≤ 0.005

●●

●●

● ●

−0.50

−0.25

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00

Original Effect Size

Re

plic

atio

n E

ffe

ct S

ize

p−valueNot Significant

Significant

Replication Power0.80

0.85

0.90

0.95

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

−0.50

−0.25

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00

Original Effect Size

Re

plic

atio

n E

ffe

ct S

ize

p−valueNot Significant

Significant

Replication Power0.6

0.7

0.8

0.9Thresholding at

p-value ≤ 0.05

Page 27: Yoav Benjamini, "In the world beyond p

Marginal Confidence Intervals are too optimistic

2.Where possible add confidence intervals & estimators –adjusted for selection as well

Page 28: Yoav Benjamini, "In the world beyond p

20 parameters to be estimated with 90%

CIs

●● ●

● ●● ● ●

● ●

●●

● ● ●● ● ●

5 10 15 20

−4−2

02

4

Index

●● ●

● ●● ● ●

● ●

●●

● ● ●● ● ●

5 10 15 20

−4−2

02

4

Index

●● ●

● ●● ● ●

● ●

●●

● ● ●● ● ●

5 10 15 20

−4−2

02

4

Index

●● ●

● ●● ● ●

● ●

●●

● ● ●● ● ●

5 10 15 20

−4−2

02

4

Index

●● ●

● ●● ● ●

● ●

●●

● ● ●● ● ●

5 10 15 20

−4−2

02

4

Index

● ●

3/20 do not cover

3/4 CI do not cover

when selected

Need for False

Coverage-

statement Rate

controlling intervals

(FCR)

Selection of this form

harms Bayesian Intervals

as well

(Wang & Lagakos ‘07 EMR, Yekutieli 2012)

28

Page 29: Yoav Benjamini, "In the world beyond p

Mouse phenotyping example: opposite single lab results

Kafkafi et al (’17 Nature Methods)

3. Address the relevant variability

Page 30: Yoav Benjamini, "In the world beyond p

The above GxL interaction is “a fact of life”

Genotype-by-Lab effect for a genotype in a new lab is not

known; but when its variability s2GxLcan be estimated, use

Mean(MG1) – Mean(MG2)

(sWithin (1/n+1/n)+ s

GxL )1/2

Interaction size is the right “yardstick” against which genetic

differences should be compared,

when the concern is about replicability in other labs.

Good design, large sample size, transparency, avoiding p-

values - won’t solve it. So GxL adjust at each lab:

Page 31: Yoav Benjamini, "In the world beyond p

Single-lab analyses in all known replication

studies

Kafkafi et al (’17 Nature Methods)

Page 32: Yoav Benjamini, "In the world beyond p

From the example to generality

Choosing the relevant level of variability is critical in order

to increase replicability, for any inferential procedure: tests,

confidence intervals, and estimates.

Many small studies are better than single large one even if

underpowered!

Clinical research: multiple centers with Center by

Treatment interaction

Educational research: random effects for schools &

teachers (and interactions)

Functional MRI: Random effect for subjects

YB

Page 33: Yoav Benjamini, "In the world beyond p

Replicability is a minimal form of Generalizability

The ‘‘Many Labs’’ Replication Project ‘14

Page 34: Yoav Benjamini, "In the world beyond p

4. Replicate others’ work as a way of

life• Check consistency of effect’s directional decision (sign)

Significant (p≤.05) in both studies (Fisher’s definition); or

Both confidence intervals are entirely on same side.

If replicated, strong evidence against randomness, (p<0.0025),

but Scientifically much stronger:

combining different tests by different investigators

r2/2–value = max (p1,p2)

ru/m–value: the smallest significance level at which

effects in at least u out of the m studies

adjusted for selection are significant

Have been used to analyze generalizability of results

In the Psychological Reproducibility Project 36% replicated

Page 35: Yoav Benjamini, "In the world beyond p

Replicate others’ work as a way of life

Reproducibility projects are not sustainable.

Neither are publishing many papers with negative results only.

Instead

• Every research proposal and paper should have a replicability-

check component of a result, considered by the authors

important for their proposed research.

• Its result will be reported whatever the outcome is, in the

extended-abstract/main-body in 1-2 searchable sentences.

• The authors of a replicated study will receive special recognition

for having published a result considered important enough by

others to invest the effort toreplicate it.

• Researchers, Granting agencies, Publishers, Academic leaders