Non-replicating comments on replication Steven Goodman, MD, PhD Johns Hopkins University SAMSI Workshop July 10-12, 2006
Mar 27, 2015
Non-replicating comments on
replication
Steven Goodman, MD, PhD
Johns Hopkins University
SAMSI Workshop
July 10-12, 2006
Things identified as cancer risks
(SImon and Altman, JNCI, 1994)
Electric Razors
Broken Arms
(in women)
Fluorescent lights
Allergies
Breeding reindeer
Being a waiter
Owning a pet bird
Being short
Being tall
Hot dogs
Having a refrigerator!!
Outline
Glaring examples
P-value/replication misconceptions
Ioannidis methods/conclusions
Evidence of selective reporting
Reproducible research
“We have no idea how or why the magnets work.”
“A real breakthrough…”
“…the [study] must be regarded as preliminary….” “But…the early results were clear and... the treatment ought to be put to use immediately.”
FDA Discussion, cont. (Fisher, CCT, 20:16-39,1999)
Dr. Lipicky: What are the p-values needed for the secondary endpoints? …Certainly we’re not talking 0.05 anymore. …You’re out of this 0.05 stuff and I would have like to have seen what you thought was significant and at what level…
What p-value tells you that it’s there study after study?
Dr. Konstam: …what kind of statistical correction would you have to do that survival data given the fact that it’s not a specified endpoint? I have no idea how to do that from a mathematical viewpoint.
Replication probability, as a function of the p-value
P-value of initial
experiment
Probability of p<0.05 when the first observed
difference is the true one
Probability of p<0.05 when has
a uniform prior before the first
experiment
0.10 .37 .41 0.05 .50 .50 0.03 .58 .56 0.01 .73 .66 0.005 .80 .71 0.001 .91 .78
Goodman, SN, “A Comment on Replication, P-values and Evidence, Stat Med, 11:875-879, 1992.
What do we mean by replication?
Statistical significance?
Same results/concslusions from same original data?
Same results/conclusions from same analytic data?
Same R/C in ostensibly identical study?
Same R/C in similar but non-identical study?
Surrogate for whether underlying hypothesis is true?
Is combinability/heterogeneity a more profitable concept to explore?
Reasons for non-replication
Hypothesis not true. {Prior / Posterior probability}
Misrepresented evidence. {Improper/selective analysis, improper/selective reporting}
Different balance of unmeasured covariates across studies/designs {Quality of design, reliability of mechanistic knowledge}
Different handling/measurement of measured covariates across studies/designs. {Combinability / heterogeneity}
Fundamentally different question asked, i.e. new study is not a replicate of previous one. {Combinability / mechanistic knowledge}
JAMA, 2005
Ioannidis findings
45 original articles claiming effectiveness w/ > 1000 citations in NEJM, Lancet, JAMA, 1990-2003
7 (16%) subsequently contradicted
7 (16%) exaggerated effects
20 (44%) replicated
11 (24%) unchallenged
5/6 nonrandomized studies contradicted + exag. vs. 9/39 RCTs.
Unit of analysis?
Study Condition Agent
NHS CAD prevention Estrogen / Progestin
NHS CAD prevention Vit. E (women)
HPFS CAD prevention Vit. E (men)
Zutphen CAD prevention Flavonoids
Case series Leukemia Trans retinoic acid
Case series Resp. distress Nitric Oxide
Effect of “bias” on Bayes factor
LR(H1 vs. Ho | p-value, , , bias)
= Pr (pŠ | H1)
Pr (pŠ | Ho )
bias (1 )
(1 ) bias
As -->0, the LR -->
1- (1-bias)
bias
LR (bias, )
1 (1 bias)(1 0.05) bias 0.05
p≤0.05 p-value=0
Bias Power =80%
90% 80% 90%
0.1 5.7 6.3 8.2 9.1
0.3 2.6 2.8 2.9 3.1
0.5 1.7 1.8 1.8 1.9
0.8 1.2 1.2 1.2 1.2
1- (1- bias)
bias
R u Practical Example
PPV p<5%
PPV 1%ŠpŠ5%(LR = 10.5)
PPV pŠ1%
(LR = 80)
PPV for pŠ0.05
PPV pŠ0.01
LR for pŠ1%
0.8 1.00 0.10 Adequately powered RCT with little bias and
1:1 pre-study odds
0.97 0.91 0.99 0.87 0.89 7.85
0.95 2.00 0.30 Confirmatory meta-analysis of goodquality
RCTs
0.99 0.95 0.99 0.86 0.86 3.18
0.8 0.33 0.40 Meta-analysis of small inconclusive studies
0.91 0.78 0.96 0.41 0.42 2.18
0.2 0.20 0.20 Underpowered, but well-performed phase I/II
0.62 0.68 0.94 0.25 0.26 1.76
0.2 0.20 0.80 Underpowered, poorly performed phase I/II
0.62 0.68 0.94 0.17 0.17 1.05
0.8 0.10 0.30 Adequately powered exploratory
epidemiological study
0.76 0.51 0.89 0.21 0.22 2.83
0.2 0.10 0.30 Underpowered exploratory
epidemiological study
0.44 0.51 0.89 0.12 0.13 1.45
0.2 0.001 0.80 Discovery-oriented exploratory research with
massive testing
0.01 0.01 0.07 0.00 0.00 1.05
0.2 0.001 0.20 As in previous example, but with more limited
bias (more standardized)
0.01 0.01 0.07 0.00 0.00 1.76
WITHOUT BIAS FACTOR WITH BIAS FACTOR
, Including this one
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Abstract conclusions
Design Cohort study using protocols and published reports of randomized trials approved by the Scientific-Ethical Committees for Copenhagen and Frederiksberg, Denmark, in 1994-1995. The number and characteristics of reported and unreported trial outcomes were recorded from protocols, journal articles, and a survey of trialists….
Results One hundred two trials with 122 published journal articles and 3736 outcomes were identified. Overall, 50% of efficacy and 65% of harm outcomes per trial were incompletely reported. Statistically significant outcomes had a higher odds of being fully reported compared with nonsignificant outcomes for both efficacy (pooled odds ratio, 2.4; 95% confidence interval [CI], 1.4-4.0) and harm (pooled odds ratio, 4.7; 95% CI, 1.8-12.0) data. In comparing published articles with protocols, 62% of trials had at least 1 primary outcome that was changed, introduced, or omitted. Eighty-six percent of survey responders (42/49) denied the existence of unreported outcomes despite clear evidence to the contrary.
Conclusions The reporting of trial outcomes is not only frequently incomplete but also biased and inconsistent with protocols. Published articles, as well as reviews that incorporate them, may therefore be unreliable and overestimate the benefits of an intervention. To ensure transparency, planned trials should be registered and protocols should be made publicly
Reproducible Research
Roger Peng, F. Dominici, S. Zeger
AJE, 2006
A Research Pipeline
What is Reproducible Research?
Data: Analytic dataset is available
Methods: Computer code underlying figures, tables, and other principal results is available
Documentation: Adequate documentation of the code, software environment, and data is available
Distribution: Standard methods of distribution are employed for others to access materials
A Research Pipeline (reprise)
A Licensing Spectrum for Data
Full access: Data can be used for any purpose
Attribution: Data can be used for any purpose so long as a specific citation is used
Share-alike: Data can be used to produce new findings --- any modifications/linkages must be made available under the same terms
Reproduction: Data can only be used for reproducing results and commenting on those results via a letter to the editor
(No Data Available)
Issues to Consider
Making datasets available
What is code? Does it exist?
Separating content from presentation
Technical sophistication of authors, publishers, readers; requirements?
Protecting authors’ original ideas
Logistics – data storage, accessibility
RR Options considered at medical journal
Assign and “advertise” RR “Score” depending on how much info author makes available.
Do we ask everyone? Do we penalize those who don’t/can’t share data? How do we prioritize between components of the score? Do we treat differently sophisticated and unsophisticated analysts?
Divulge data sharing policy of author, including code-sharing, like roles on manuscript and conflict of interest.
What do we mean by replication?
Statistical significance?
Same results/concslusions from same original data?
Same results/conclusions from same analytic data?
Same R/C in ostensibly identical study?
Same R/C in similar but non-identical study?
Surrogate for whether underlying hypothesis is true?
Is combinability/heterogeneity a more profitable concept to explore?
Reasons for non-replication
Hypothesis not true. {Prior / Posterior probability}
Misrepresented evidence. {Improper/selective analysis, selective reporting}
Different balance of unmeasured covariates across studies/designs {Quality of design, reliability of mechanistic knowledge}
Different handling/measurement of measured covariates across studies/designs. {Combinability / heterogeneity}
Fundamentally different question asked, i.e. new study is not a replicate of previous one. {Combinability / mechanistic knowledge}