A few weeks ago, Nature published an article summarising the various measures and counter-measures suggested to improve statistical inferences and science as a whole (Chawla, 2017). It detailed the initial call to lower the significance threshold to 0.005 from 0.05 (Benjamin et al., 2017) and the paper published in response (Lakens et al., 2017). It was a well written article, with one minor mistake: an incorrect definition of a p-value[note]Thank you J.P. de Ruiter for highlighting this.[/note]:
The two best sources for the correct definition of a p-value (along with its implications and examples of how a p-value can be misinterpreted) are Wasserstein & Lazar (2016)[note]This is the publication of the American Statistical Association’s statement on p-values.[/note] and its supplementary paper Greenland et al. (2016). A p-value has been defined as: “a statistical summary of the compatibility between the observed data and what we would predict or expect to see if we knew the entire statistical model (all the assumptions used to compute the P value) were correct” (Greenland et al., 2016). To put it another way, it tells us the probability of finding the data you have or more extreme data assuming the null hypothesis (along with all the other assumptions about randomness in sampling, treatment, assignment, loss, and missingness, the study protocol, etc.) are true.[note]In formal logic terms, it’s P(D>d;H0) where P means probability, D>d means the data observed or more extreme data, ; means assuming, and H0 means the null hypothesis.[/note] The definition provided in the Chawla article is incorrect because it states “the smaller the p-value, the less likely it is that the results are due to chance”. This gets things backwards: the p-value is a probability deduced from a set of assumptions e.g. the null hypothesis is true, so it can’t also tell you the probability of that assumption at the same time.[note]It’s another way of stating the p-value is the probability of the hypothesis given the data, or P(H|d).[/note] Joachim Vandekerckhove and Ken Rothman give further evidence as to why this definition is incorrect:
What’s the big deal?
I posted the above photo to the Facebook group Psychological Methods Discussion Group where it prompted some discussion. What interested me the most was one comment by J. Kivikangas which I have screen capped below.
I responded with[note]I’ve paraphrased my responses as there were some irrelevant comments. You can see the full discussion on Facebook here if you’d like to check.[/note]: “I’m a non-methodologist as well… but my view is that the incorrect understanding will lead to far greater confidence in a result than is warranted. You may believe your experiment shows strong support for the alternative hypothesis, or that the result provides strong support for the theory being tested (https://twitter.com/chrisdc77/status/910614437995991042). Because of this, you may pursue lines of research that appear to be well validated but aren’t. You may believe an effect is strong enough that you try and generalise it to other situations (or more likely presume that it can generalise) because you believe the hypothesis has been so powerfully demonstrated. These can have long-term impacts. Though of course I may be catastrophising about the negative impact, I’m open to being convinced otherwise.”
J. Kivikangas replied:
I then gave my last significant contribution to the discussion: “Using the traditional 0.05 cut off point for significance, E.J. Wagenmakers & Quentin Gronau showed that p-values close to the cut off point are only very weak evidence against the null https://www.bayesianspectacles.org/. So that’s a specific (yet prevalent) example of people rejecting the null, thinking they have found an effect when in fact there is very weak evidence in favour of it. Gigerenzer et al. give an example that I think relates: using NHST without thinking or understanding it lead to the wrong conclusion being drawn http://library.mpib-berlin.mpg.de/ft/gg/gg_null_2004.pdf. Your analogy appears to be solid to me, but don’t take that as a meaningful seal of approval”.
I then asked on Twitter if anyone could provide a better answer to the question. Posted below are their responses.
Daniel Lakens also gave a link to his blog post[note]Which I highly recommend.[/note] where he explains how p-values traditionally classed as significant e.g. 0.03<p<0.05, can be more likely under H0 than H1.
Ben Prytherch remarked that: “It isn’t just that small p-values provide weaker evidence than one would think under the false interpretation of “probability these results were due to chance”. It’s also that large p-values don’t provide anywhere near the kind of evidence in support of the null that this interpretation implies. Imagine getting p = 0.8. The popular misinterpretation of the p-value would say that there’s an 80% chance that the null is true. But this is nuts, especially considering how many nulls are point nulls that can’t possibly be true (e.g. “the population correlation is precisely zero”).”
J Kivinkangas commented back:
This is as far as the conversation went for the time being. If you have any points you would like to make on the topic, please comment below, and I will add them.
So where does this leave us?
Whilst I don’t think we persuaded J. Kivikangas about the importance of understanding the precise definition of a p-value, I still believe it is useful. Beyond the inherent value of knowing the correct definition I feel it has real world significance (as I detail above) and therefore all researchers should understand what a p-value actually is. However, I can understand why others might disagree if they believe it doesn’t negatively impact the research they perform.
Benjamin, D. J., Berger, J., Johannesson, M., Nosek, B. A., Wagenmakers, E.J., Berk, R., … Johnson, V. (2017). Redefine statistical significance. Nature Human Behaviour. Available at: https://doi.org/10.17605/OSF.IO/MKY9J
Chawla, D.S. (2017). ‘One-size-fits-all’ threshold for P values under fire. Nature News. Available at: http://www.nature.com/news/one-size-fits-all-threshold-for-p-values-under-fire-1.22625#/b1 [accessed on: 02/10/2017].
Gigerenzer, G.; Krauss, S.; & Vitouch, O. (2004). The Null Ritual: What You Always Wanted to Know About Significance Testing but Were Afraid to Ask. In: Kaplan, D. (Ed.). (2004). The Sage handbook of quantitative methodology for the social sciences (pp. 391–408). Thousand Oaks, CA: Sage.
Greenland, S.; Senn, S.J.; Rothman, K.J. et al. (2016) European Journal of Epidemiology 31, 337. https://doi.org/10.1007/s10654-016-0149-3
Lakens, D.; Adolfi, F.; Albers, C.; … Zwaan, R. (2017). Justify Your Alpha: A Response to “Redefine Statistical Significance”. DOI: 10.17605/OSF.IO/9S3Y6. Available at: https://psyarxiv.com/9s3y6 [accessed on: 02/10/2017]
Wagenmakers, E.J. & Gronau, Q. (2017). Bayesian Spectacles. Available at: https://www.bayesianspectacles.org/ [accessed on: 02/10/2017]
Wasserstein, R.L. & Lazar, N.A. (2016) The ASA’s Statement on p-Values: Context, Process, and Purpose. The American Statistician, 70 (2), 129-133, DOI: 10.1080/00031305.2016.1154108