Controversy Over the Significance Test Controversy

Post on 16-Apr-2017

91 Views

Category:

Education

6 Downloads

Preview:

Click to see full reader

Transcript

Symposium: Philosophy of Statistics in the Age of Big Data and Replication Crises

Controversy Over the Significance Test Controversy

Controversy

Philosophy of Science Association Biennial MeetingNovember 4, 2016

Deborah G Mayo (Virginia Tech)

2

“Science is in Crisis!”

O Once high profile failures of replication went beyond the social sciences to genomics, bioinformatics, people started to worry about scientific credibility

O Replication research, methodological activism, fraudbusting, statistical forensics

3

Methodological Reforms without philosophy of statistics are blind

Proposed methodological reforms are being adopted–many welcome (preregistration)–some quite radical

Without better understanding of the philosophical, statistical, historical issues many are likely to fail

4

American Statistical Association (ASA):Statement on P-values

“The statistical community has been deeply concerned about issues of reproducibility and replicability of scientific conclusions. …. much confusion and even doubt about the validity of science is arising. Such doubt can lead to radical choices such as…to ban P-values” (ASA 2016)

5

2015: ASA brought members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values

6

I was a ‘philosophical observer’ at the ASA P-value “pow wow”

7

“Don’t throw out the error control baby with the bad statistics bathwater” The American Statistician

8

Error Statistics

Statistics: Collection, modeling, drawing inferences from data to claims about aspects of processes

The inference may be in error

It’s qualified by a claim about the method’s capabilities to control and alert us to erroneous interpretations (error probabilities)

Significance tests (R.A. Fisher) are a small part of an error statistical methodology

9

“p-value. …to test the conformity of the particular data under analysis with H0 in some respect:…we find a function T = t(y) of the data, to be called the test statistic, such that• the larger the value of T the more

inconsistent are the data with H0;• The random variable T = t(Y) has a

(numerically) known probability distribution when H0 is true.

…the p-value corresponding to any t0bs asp = Pr(t) = Pr(T ≥ t0bs; H0)”

(Mayo and Cox 2006, p. 81)

10

Testing Reasoning

O If even larger differences than t0bs occur fairly frequently under H0 (P-value is not small), there’s scarcely evidence of incompatibility with H0 

O Small P-value indicates some underlying discrepancy from H0 because very probably you would have seen a less impressive difference than t0bs were H0 true.

O This indication isn’t evidence of a genuine statistical effect H, let alone a scientific conclusion H*

Stat-Sub fallacy  H => H*

11

Neyman-Pearson (N-P) tests: A null and alternative hypotheses

H0, H1 that are exhaustive

H0: μ ≤ 12 vs H0: μ > 12

O So this fallacy of rejection HH* is impossible

Rejecting the null only indicates statistical alternatives (how discrepant from null)

12

I’m not keen to defend many uses of significance tests long lampooned

I introduce a reformulation of tests in terms of discrepancies (effect sizes) that are and are not severely-tested

The criticisms are often based on misunderstandings; consequently so are many “reforms”

13

A paradox for significance test critics

Critic: It’s much too easy to get small P-values.

You: Why do they find it so difficult to replicate the small P-values in published reports? 

Is it easy or is it hard?

14

Only 36 of 100 psychology experiments yielded small P-values in Open Science Collaboration on

replication in psychology

OSC: Reproducibility Project: Psychology: 2011-15 (Science 2015): Crowd-sourced effort to replicate 100 articles (Led by Brian Nozek, U. VA)

15

R.A. Fisher: it’s easy to lie with statistics by selective reporting, “political principle”

Sufficient finagling—cherry-picking, P-hacking, significance seeking, multiple testing, look elsewhere—may practically guarantee a preferred claim H gets support, even if it’s unwarranted by evidence (verification fallacy)

(biasing selection effects, need to adjust P-values)

Note: Rejecting a null taken as support for some non-null claim H

16

O You report: Such results would be difficult to achieve under the assumption of H0

O When in fact such results are common under the assumption of H0

The ASA (p. 131) correctly warns that “[c]onducting multiple analyses of the data and reporting only those with certain p-values” leads to spurious p-values (Principle 4)

You say Pr(P-value ≤ Pobs; H0) = Pobs (small)

But in fact Pr(P-value ≤ Pobs; H0) = high*

*Note P-values measure distance from H0 in reverse

17

Minimal (Severity) Requirement for evidence

If the test procedure had little or no capability of finding flaws with H (even if H is incorrect), then agreement between data x0 and H provides poor (or no) evidence for H

(“too cheap to be worth having” Popper)

Such a test fails a minimal requirement for a stringent or severe test

My account: severe testing based on error statistics (requires reinterpreting tests)

18

Alters role of probability: typically just 2

Probabilism. To assign a degree of probability, confirmation, support or belief in a hypothesis, given data x0.

(e.g., Bayesian, likelihoodist)—with regard for inner coherency

Performance. Ensure long-run reliability of methods, coverage probabilities (frequentist, behavioristic Neyman-Pearson)

19

What happened to using probability to assess error probing capacity and severity?

Neither “probabilism” nor “performance” directly captures it

Good long-run performance is a necessary, not a sufficient, condition for severity

20

A claim H is not warranted _______

O Probabilism: unless H is true or probable (or gets a probability boost, is made comparatively firmer)

O Performance: unless it stems from a method with low long-run error

O Probativism (severe testing) unless something (a fair amount) has been done to probe ways we can be wrong about H

21

Problems with selective reporting, cherry picking, stopping when the data look good, P-hacking, are not problems about long-runs—

It’s that we cannot say about the case at hand that it has done a good job of avoiding the sources of misinterpreting data

22

O If you assume probabilism, error probabilities are relevant for inference only by misinterpretation False!

O They play a key role in appraising well-testedness

O It’s crucial to be able to say, H is believable or plausible but this is a poor test of it

O With this in mind consider a continuation of the paradox of replication

23

Critic: It’s too easy to satisfy standard significance thresholds

You: Why do replicationists find it so hard to achieve significance thresholds (with preregistration)?

Critic: Obviously the initial studies were guilty of P-hacking, cherry-picking, data-dredging (QRPs)

You: So, the replication researchers want methods that pick up on, adjust, and block these biasing selection effects.

Critic: Actually “reforms” recommend methods where the need to alter P-values due to data dredging vanishes

24

Likelihood Principle (LP)The vanishing act links to a pivotal disagreement in the philosophy of statistics battles

In probabilisms (Bayes factors, posteriors), the import of the data is via the ratios of likelihoods of hypotheses

P(x0;H1)/P(x0;H0) for x0 fixed

They condition on the actual data,error probabilities take into account other outcomes that could have occurred but did not (sampling distribution)

25

All error probabilities violate the LP (even without selection effects):

“Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space.” (Lindley 1971, p. 436) 

“The LP implies…the irrelevance of predesignation, of whether a hypothesis was thought of before hand or was introduced to explain known effects.” (Rosenkrantz, 1977, p. 122)

26

Today’s Meta-research is not free of philosophy of statistics

“Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-value…

But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of ‘objectivity’ that is often made for the P-value” (Goodman 1999, p. 1010)

(To his credit, he’s open about this; heads the Meta-Research Innovation Center at Stanford)

27

Sum-up so far:O Main source of hand-wringing behind the

statistical crisis in science stems from cherry-picking, hunting for significance, P-hacking

O Picked up by concern for performance or severity (but violated in abuses of tests)

O Reforms based on “probabilisms” enable rather than check unreliable results due to biasing selection effects

“Bayes factors can be used in the complete absence of a sampling plan” (Bayarri, Benjamin, Berger, Sellke 2016) O Probabilists may find other ways to block bad

inferences: background beliefs–for the discussion

28

A few remarks on interconnected issues that cry out for philosophical insight…

29

1. Replication research

O Aims to use significance tests correctly

Preregistered, avoid P-hacking, designed to have high power

Free of “perverse incentives” of usual research: guaranteed to be published

30

Repligate

O Replication research has pushback: some call it methodological terrorism (enforcing good science or bullying?)

O I’m (largely) on pro-replication side, but they need to go further….

31

Non-replications construed as simply weaker effects

O One of the non-replications: cleanliness and

morality: Does unscrambling soap words make you less judgmental?

“Ms. Schnall had 40 undergraduates unscramble some words. One group unscrambled words that suggested cleanliness (pure, immaculate, pristine), while the other group unscrambled neutral words. They were then presented with a number of moral dilemmas, like whether it’s cool to eat your dog after it gets run over by a car. …”

32

…Turns out, it did. Subjects who had unscrambled clean words weren’t as harsh on the guy who chows down on his chow.” (Chronicle of Higher Education)

O Focusing on the P-values ignore larger questions of measurement in psych & the leap from the statistical to the substantive.

HH*O Increasingly the basis for experimental

philosophy-needs philosophical scrutiny(free will/cheating-another non-replication)

33

2. Philosophy and History of Statistics

 O What actually happened: N-P tests aimed to put

Fisherian tests on logical ground (a theory of generating tests)

O All is hunky dory until Fisher and Neyman began fighting (from 1935) almost entirely due to professional and personality disputes

O What’s read into what happened: A huge philosophical difference is read into their in-fighting

 

34

Long-run error control (performance) vs inference

O (Neyman) N-P methods ->performance only, irrelevant for inference

O (Fisher) P-values are inferential (in some sense)

Contemporary work begins hereO The only way for P-values to be inferential is to

misinterpret them as posterior probabilities!O All of error statistics is deemed problematic

35

It’s the method, stupid

O Even if it were true that Fisher-Neyman held rival philosophies (inferential-behavioral performance), we should look at what the methods do

(Beyond “inconsistent hybrid” Gigerenzer 2004)O Instead P-value users rob themselves of

features from N-P tests they need(an animal called NHST)O Many say P-values must be reconciled with

posteriors in some way

36

3. Diagnostic Screening Model of Tests: urn of nulls

(focus on science-wise error rate performance)

O If we imagine randomly selecting hypotheses from an urn of nulls 90% of which are true

O Consider just 2 possibilities: H0: no effect H1: meaningful effect, all else ignored,

O Take the prevalence of 90% as Pr(H0 you picked) = .9, Pr(H1)= .1

O Rejecting H0 with a single (just) .05 significant result, cherry-picking to boot

37

The unsurprising result is that most “findings” are false Pr(H0|findings with a P-value of .05) > .5 Pr(H0| findings with a P-value of .05) ≠ Pr(P-value of .05| H0)  Major source of confusion….Neyman on steroids(not a N-P type 1 error probability) (Berger and Sellke 1987, Ioannidis 2005, Colquhoun 2014)

38

4. Shifts in Philosophy of Statistics

Decoupling of methods from traditional philosophies

O Some Bayesians reject probabilism…..

O They’re interested in “using modern statistics to implement the Popperian criteria of severe tests.” (Gelman and Shalizi 2013, p. 10)

O “Bayesian methods have seen huge advances in the past few decades. It is time for Bayesian philosophy to catch up…” (ibid. p. 79)  

40

The ASA’s Six PrinciplesO (1) P-values can indicate how incompatible the data are with

a specified statistical model

O (2) P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone

O (3) Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold

O (4) Proper inference requires full reporting and transparency

O (5) A p-value, or statistical significance, does not measure the size of an effect or the importance of a result

O (6) By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis

41

The ASA’s Six PrinciplesO (1) P-values can indicate how incompatible the data are with a

specified statistical model

O (2) P-values do NOT measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone

O (3) Scientific conclusions and business or policy decisions should NOT be based only on whether a p-value passes a specific threshold

O (4) Proper inference requires full reporting and transparency

O (5) A p-value, or statistical significance, does NOT measure the size of an effect or the importance of a result

O (6) By itself, a p-value does NOT provide a good measure of evidence regarding a model or hypothesis

42

Mayo and Cox (2010): Frequentist Principle of Evidence (FEV); SEV: Mayo and Spanos (2006) FEV/SEV: insignificant result: A moderate P-value is evidence of the absence of a discrepancy δ from H0, only if there is a high probability the test would have given a worse fit with H0 (i.e., d(X) > d(x0) ) were a discrepancy δ to exist  FEV/SEV significant result d(X) > d(x0) is evidence of discrepancy δ from H0, if and only if, there is a high probability the test would have d(X) < d(x0) were a discrepancy as large as δ absent

43

Test T+: Normal testing: H0: μ < μ0 vs. H1: μ > μ0 σ known (FEV/SEV): If d(x) is not statistically significant, then

μ < M0 + kεσ/√n passes the test T+ with severity (1 – ε)

 (FEV/SEV): If d(x) is statistically significant, then

μ > M0 + kεσ/√n passes the test T+ with severity (1 – ε)

  where P(d(X) > kε) = ε

44

References• Armitage, P. 1962. “Contribution to Discussion.” In The Foundations of Statistical

Inference: A Discussion, edited by L. J. Savage. London: Methuen.

• Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses." Journal of Mathematical Psychology 72: 90-103. 

• Berger, J. O. 2003 'Could Fisher, Jeffreys and Neyman Have Agreed on Testing?' and 'Rejoinder,', Statistical Science 18(1): 1-12; 28-32.

• Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402.

• Berger, J. O. and Sellke, T. 1987. 'Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence (with Discussion & Rejoinder)', Journal of the American Statistical Association 82(397): 112–22; 135-9.

• Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the Editor).” Nature 225 (5237) (March 14): 1033.

• Efron, B. 2013. 'A 250-Year Argument: Belief, Behavior, and the Bootstrap', Bulletin of the American Mathematical Society 50(1): 126-46.

• Box, G. 1983. “An Apology for Ecumenism in Statistics,” in Box, G.E.P., Leonard, T. and Wu, D. F. J. (eds.), pp. 51-84, Scientific Inference, Data Analysis, and Robustness. New York: Academic Press.

• Colquhoun, D. 2014. 'An Investigation of the False Discovery Rate and the Misinterpretation of P-values', Royal Society Open Science, 1(3): 140216 (16 pages).

45

• Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and Hall.

• Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in Frequentist Inference.” In Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University Press.

• Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd.

• Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78.

• Gelman, A. and Shalizi, C. 2013. “Philosophy and the Practice of Bayesian Statistics” and “Rejoinder’” British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80.

• Gigerenzer, G. 2004. 'Mindless statistics', Journal of Socio-Economics, 33(5): 587-606. • Gigerenzer, G., Swijtink, Porter, T. Daston, L. Beatty, J, and Kruger, L. 1989. The

Empire of Chance. Cambridge: Cambridge University Press.• Gilbert, D. Twitter post: https://twitter.com/dantgilbert/status/470199929626193921

• Gill, comment: On the “Suspicion of Scientific Misconduct by Jens Forster by Neuroskeptic May 6, 2014 on Discover Magazine Blog: http://blogs.discovermagazine.com/neuroskeptic/2014/05/06/suspicion-misconduct-forster/#.Vynr3j-scQ0.

• Goldacre, B. 2008. Bad Science. HarperCollins Publishers.

• Goldacre, B. 2016. “Make journals report clinical trials properly”, Nature 530(7588); 7; online 04Feb2016.

46

• Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes factor,” Annals of Internal Medicine 1999; 130:1005 –1013.

• Handwerk, B. 2015. “Scientists Replicated 100 Psychology Studies, and Fewer than Half Got the Same Results.” Smithsonian Magazine (August 27, 2015) http://www.smithsonianmag.com/science-nature/scientists-replicated-100-psychology-studies-and-fewer-half-got-same-results-180956426/?no-ist

• Hasselman, F. and Mayo, D. 2015, April 17. “seveRity” (R-program). Retrieved from osf.io/k6w3h

• Ioannidis, J. 2005.  'Why most published research findings are false', PLoS Med 2(8):0696-0701.

• Levelt Committee, Noort Committee, Drenth Committee. 2012. 'Flawed science: The fraudulent research practices of social psychologist Diederik Stapel', Stapel Investigation: Joint Tilburg/Groningen/Amsterdam investigation of the publications by Mr. Stapel. (https://www.commissielevelt.nl/)

• Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston.

• Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press.

• Mayo, D. G. 2016. 'Don't Throw Out the Error Control Baby with the Bad Statistics Bathwater: A Commentary', The American Statistician, online March 7, 2016. http://www.tandfonline.com/doi/pdf/10.1080/00031305.2016.1154108.

• Mayo, D. G. Error Statistics Philosophy Blog: errorstatistics.com

47

• Mayo, D. G. and Cox, D. R. (2010). "Frequentist Statistics as a Theory of Inductive Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.

• Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357.

• Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook of the Philosophy of Science. The Netherlands: Elsevier.

• Meehl, P. E., and N. G. Waller. 2002. “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7 (3): 283–300.

• Morrison, D. E., and R. E. Henkel, ed. 1970. The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter.

• Open Science Collaboration (Nozeck, B. et al). 2015. “Estimating the Reproducibility of Psychological Science.” Science 349(6251)

• Pearson, E. S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published in Bul. Acad. Pol.Sci. 73-96.

48

•Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.

•Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London: Methuen.

•Selvin, H. 1970. “A critique of tests of significance in survey research. In The significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter.

•Simonsohn, U. 2013, "Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone", Psychological Science, vol. 24, no. 10, pp. 1875-1888.

•Smithsonian Magazine (See Handwerk)

•Trafimow D. and Marks, M. 2015. “Editorial”, Basic and Applied Social Psychology 37(1): pp. 1-2.

•Wasserstein, R. and Lazar, N. 2016. “The ASA’s statement on p-values: context, process, and purpose”, The American Statistician

• Link to ASA statement & Commentaries (under supplemental): http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108

 

top related