A few weeks ago, Nature published an article summarising the various measures and counter-measures suggested to improve statistical inferences and science as a whole (Chawla, 2017). It detailed the initial call to lower the significance threshold to 0.005 from 0.05 (Benjamin et al., 2017) and the paper published in response (Lakens et al., 2017). It was a well written article, with one minor mistake: an incorrect definition of a p-value^{1}:

The two best sources for the correct definition of a p-value (along with its implications and examples of how a p-value can be misinterpreted) are Wasserstein & Lazar (2016)^{2} and its supplementary paper Greenland et al. (2016). A p-value has been defined as: “a statistical summary of the compatibility between the observed data and what we would predict or expect to see if we knew the entire statistical model (all the assumptions used to compute the P value) were correct” (Greenland et al., 2016). To put it another way, it tells us the probability of finding the data you have or more extreme data assuming the null hypothesis (along with all the other assumptions about randomness in sampling, treatment, assignment, loss, and missingness, the study protocol, etc.) are true.^{3} The definition provided in the Chawla article is incorrect because it states “the smaller the p-value, the less likely it is that the results are due to chance”. This gets things backwards: the p-value is a probability deduced from a set of assumptions e.g. the null hypothesis is true, so it can’t also tell you the probability of that assumption at the same time.^{4} Joachim Vandekerckhove and Ken Rothman give further evidence as to why this definition is incorrect:

This definition is specifically wrong if the alternative hypothesis is unlikely a priori, which is reasonably often.

— J. Vandekerckhove (@VandekerckhoveJ) October 4, 2017

Thinking p is the probability of the null is a form of base rate neglect fallacy, which gets worse as the base rate gets lower.

— J. Vandekerckhove (@VandekerckhoveJ) October 4, 2017

…Here is a true story that shows why a P-value cannot tell you whether the null hypothesis is correct. pic.twitter.com/OjXPuIK25o

— Ken Rothman (@ken_rothman) October 4, 2017

What’s the big deal?

I posted the above photo to the Facebook group Psychological Methods Discussion Group where it prompted some discussion. What interested me the most was one comment by J. Kivikangas which I have screen capped below.

I responded with^{5}: “I’m a non-methodologist as well… but my view is that the incorrect understanding will lead to far greater confidence in a result than is warranted. You may believe your experiment shows strong support for the alternative hypothesis, or that the result provides strong support for the theory being tested (https://twitter.com/chrisdc77/status/910614437995991042). This may lead you to pursue lines of research that appear to be well validated but aren’t. It may lead you to believe the effect is strong enough that you try and generalise it to other situations (or more likely presume that it can generalise) because you believe the hypothesis has been so powerfully demonstrated. These can have long-term impacts. Though of course I may be catastrophising about the negative impact, I’m open to being convinced otherwise.”

J. Kivikangas replied:

I then gave my last significant contribution to the discussion: “Using the traditional 0.05 cut off point for significance, E.J. Wagenmakers & Quentin Gronau showed that p-values close to the cut off point are only very weak evidence against the null https://www.bayesianspectacles.org/. So that’s a specific (yet prevalent) example of people rejecting the null, thinking they have found an effect when in fact there is very weak evidence in favour of it. Gigerenzer et al. give an example that I think relates: using NHST without thinking or understanding it lead to the wrong conclusion being drawn http://library.mpib-berlin.mpg.de/ft/gg/gg_null_2004.pdf. Your analogy appears to be solid to me, but don’t take that as a meaningful seal of approval”.

I then asked on Twitter if anyone could provide a better answer to the question. Several people obliged, and their responses are posted below.

of finding p = 0.04 until it is LESS likely under H1 than H0. Should be known to people. But you *still* have a 5% error rate.

— Daniël Lakens (@lakens) September 22, 2017

Problem with not knowing definition, is you don't know which questions you are asking.That's a problem if you do science.

— Daniël Lakens (@lakens) September 22, 2017

Daniel Lakens also gave a link to his blog post^{6} where he explains how p-values traditionally classed as significant e.g. 0.03<p<0.05, can be more likely under H0 than H1.

Agree with what you and Daniël wrote. + H0 vs. HA is purely a numerical matter; HA may be true for diff reasons (e.g. expectancy fx, bias)

— Jan Vanhove (@janhove) September 22, 2017

J Kivinkangas commented back:

This is as far as the conversation went for the time being. If you have any points you would like to make on the topic, please comment below, and I will add them.

So where does this leave us?

Whilst I don’t think we persuaded J. Kivikangas about the importance of understanding the precise definition of a p-value, I still believe it is useful. Beyond the inherent value of knowing the correct definition I feel it has real world significance (as I detail above) and therefore all researchers should understand what a p-value actually is. However, I can understand why others might disagree if they believe it doesn’t negatively impact the research they perform.

References:

Benjamin, D. J., Berger, J., Johannesson, M., Nosek, B. A., Wagenmakers, E.J., Berk, R., … Johnson, V. (2017). Redefine statistical significance. *Nature Human Behaviour*. Available at: https://doi.org/10.17605/OSF.IO/MKY9J

Chawla, D.S. (2017). ‘One-size-fits-all’ threshold for P values under fire. *Nature News**. *Available at: http://www.nature.com/news/one-size-fits-all-threshold-for-p-values-under-fire-1.22625#/b1 [accessed on: 02/10/2017].

Gigerenzer, G.; Krauss, S.; & Vitouch, O. (2004). The Null Ritual What You Always Wanted to Know About Significance Testing but Were Afraid to Ask. In: Kaplan, D. (Ed.). (2004). *The Sage handbook of quantitative methodology for the social sciences* (pp. 391–408). Thousand Oaks, CA: Sage.

Greenland, S.; Senn, S.J.; Rothman, K.J. et al. (2016) *European Journal of Epidemiology* 31, 337. https://doi.org/10.1007/s10654-016-0149-3

Lakens, D.; Adolfi, F.; Albers, C.; … Zwaan, R. (2017). Justify Your Alpha: A Response to “Redefine Statistical Significance”. DOI: 10.17605/OSF.IO/9S3Y6. Available at: https://psyarxiv.com/9s3y6 [accessed on: 02/10/2017]

Wagenmakers, E.J. & Gronau, Q. (2017). Bayesian Spectacles. Available at: https://www.bayesianspectacles.org/ [accessed on: 02/10/2017]

Wasserstein, R.L. & Lazar, N.A. (2016) The ASA’s Statement on p-Values: Context, Process, and Purpose. *The American Statistician*, 70 (2), 129-133, DOI: 10.1080/00031305.2016.1154108

- Thank you J.P. de Ruiter for highlighting this.
- This is the publication of the American Statistical Association’s statement on p-values.
- In formal logic terms, it’s P(D>d|H0) where P means probability, D>d means the data observed or more extreme data, | means conditional upon or given, and H0 means the null hypothesis.
- It’s another way of stating the p-value is the probability of the hypothesis given the data, or P(H|d).
- I’ve paraphrased my responses as there were some irrelevant comments. You can see the full discussion on Facebook here if you’d like to check.
- Which I highly recommend.

Jose Perezgonzalez (10 October 2017)

I’m a non-methodologist, does it matter if my definition is slightly wrong?

I actually see another mistake with the Nature article, one which is relevant to this discussion. Indeed, it not only defines the p-value as informing about the probability of the null hypothesis, but it also defines the null hypothesis as the hypothesis of chance.

Kivikangas questions underlie an important concern, especially for practitioners (a.k.a., non-methodologists, with little-to-no interest on philosophy of statistics or on science): Basically, can I get to the grain of the issue without the unnecessary fluff? As far as an ideal practitioner goes, he also poses quite specific assertions which are eminently true, such as p-values (frequentist statistics) have an underlying positive correlation with Bayes Factors (Bayesian statistics), so they show some degree of evidence even if it were to be not as strong as the Jeffreysians would like it to be. As posed in his examples, if all you want to measure is a difference, then even the “wrong” measuring tool may give an indication of such difference.

Kivilangas then finishes with something a practitioner would like: a rule-of-thumb, and nothing else. Actually, he is quite wise in this instance as well, as pretty much everyone works with such rules-of-thumb, including the Bayesians, the frequentists, and everyone in between. This is why Chawla (2017) also makes a mistake in his definition of p-values and of the null hypothesis.

The answer I can offer Kivikangas (and PsychBrief) is from a Human Factors point of view. A correct definition of p-values (and anything else) leads to better and more nuanced understanding and interpretation of the material at hand, therefore to more precise communication of which conclusions are warranted or not warranted. It also helps prevent errors of interpretation and, worse, mistakes in implementation (see, example, some real errors commented here: Perezgonzalez, 2026a–https://mro.massey.ac.nz/handle/10179/10444–, 2016b–https://mro.massey.ac.nz/handle/10179/10445).

For example, the null hypothesis is not the hypothesis of chance, as Chawla says, therefore, smaller p-values cannot be evidence that results are less likely to be due to random variation. The null hypothesis only says that there is no difference or correlation (i.e., that the effect size is ‘zero’). The hypothesis also provides a probability model of how much random samples may show an effect size different to ‘zero’ when taken from the original population (a.k.a., the probability distribution). Thus, chance resides in the sampling, so it does not resides on the null hypothesis. (See also Perezgonzalez, 2017, https://doi.org/10.3389/fpsyg.2017.01715.) When we use the incorrect definition, then non-significant results are ‘understood’ as due to chance, while significant results are ‘understood’ as due to systematic effects, which is ridiculous because a test of significance does not separate random results from systematic ones.

From here we can move onto what is the definition of a p-value. The correct definition would see the p-value as a descriptive measure (where is the sample located within the probability model posed by the null hypothesis?). A.k.a., the p-value is more of a percentile than a probability (see Perezgonzalez, 2015, https://doi.org/10.3389/fpsyg.2015.00341). Thus, it shows no ‘evidence’ of anything, the evidence being in the mind of the researcher / practitioner, not on the statistic itself.

In order to work as evidence proper, there is a need for an alternative hypothesis, where the same p-value also has a probability under such alternative hypothesis. A simple test of significance (Fisher’s) does not even take into account this alternative hypothesis, but it exists as it kind of determines the level of significance and the reason why the null gets rejected. However, as Lakens suggests, the fact that a result is significant is not necessarily evidence that it supports the alternative hypothesis. You need an explicit alternative in order to make sure that the research has enough power and, therefore, that any statistically significant result shows more evidence towards the alternative than it does towards the null (Neyman-Pearson’s tests of acceptance).

A Bayesian approach (e.g., Jeffrey’s Bayes Factors) also work with two hypotheses (albeit BF works with two models). So in both cases, Neyman-Pearson’s tests and Bayesian approaches can inform about some evidence in support of one or the other hypothesis. With Fisher’s approach, a significant result often shows such evidence, but you may as well be wrong in so assuming. Therefore, knowing which tool you are using and its limitations can help understand research better and prevent making serious mistakes in the interpretation and communication of results.

So… does the scale (Fisher) informs of a difference as well as a tape measure in cms (Neyman-Pearson) or in inches (Jeffreys), enough of a practitioner? Sometimes they may, sometimes they may not. But why risking it?

The solution to a practitioner (and most researchers) is to create rules-of-thumb that both work with a correct definition but is also simple and quick to apply. My rules-of-thumbs would be (1) to define the p-value as a percentile on the distribution of the null hypothesis, (2) to quantify an alternative hypothesis when interpreting evidence (it can be done using sensitiveness—Perezgonzalez, 2017, https://osf.io/preprints/psyarxiv/qd3gu—, Neyman-Pearson’s approach—a.k.a., a power analysis—, or Jeffreys’s approach—a.k.a., Bayes Factors; the latter can be easily implemented using JASP, a statistics software that allows for both frequentist and BF analysis, and (3) to only talk of testing hypotheses when using a fully developed Bayesian approach (thus, neither with simple tests of significance, nor with tests of acceptance, nor with Bayes Factors).

Thank you for such a detailed response, lots to take in and learn. I’m going to read the links you provided and make sure I fully understand what a p-value is.

Thank you for an enlightening answer! I have seen the normal distribution and the percentile interpretation many times, but this was the first time something about it clicked for me. This is now my go-to definition!

In general, you seem to support my (reformulated) point I was trying to make: that as the correct definition is so abstract and difficult, people will not use it, so it’s better to rely on rules-of-thumb that convey the crucial idea without tripping us with the unintuitive exact wording. Adding the two additional rules – alternative hypothesis and approaches that utilize it, and that really testing a hypothesis needs a fully Bayesian approach (which I don’t know what it is right now) – give good context to show the limits of the p-value and what else there is. This is how p-values should be taught in universities!

I feared my question was stupid, but it seems that once again it was better to still present it; I actually learned something.

Mathematical training gives us the mindset to think about the existence of things, rather than how common or rare that existence might be.

The twitter quotes from D. Lakens show the existence of cases where there is (1) a specific alternate hypothesis (i.e. the hypothesis specifies a value for a model parameter, not only that it differs from the null hypothesis), (2) a statistical test with extremely high power, over 99%, and (3) a dataset that is large, perhaps even larger than is strictly needed.

Under these conditions, some p-values slightly below 0.05 would actually support the null hypothesis rather than the research hypothesis, and the simplistic use of p-values gives the wrong result. To be able to recognize and correctly deal with these conditions, one should indeed have a detailed understanding of the p-value.

But how common are these conditions? How often have you faced these conditions in your career? How often would a random researcher face these conditions? Every year? Or, on the average, not even once in their lifetime?

Isn’t it overwhelmingly more common that people work with tests that have statistical power that is rather lower than desired (rather than over 99%), or with non-specific alternate hypotheses so that statistical power is not even defined, and with datasets that are rather too small than too large? So does it really matter if we use only the simplistic understanding of the p-value?

Psychology has a replication crisis, but it has to do with detecting false signals at 0.05 significance level from too small datasets. It has to do with publication bias. I don’t think it has to do with misinterpreting the meaning of a p-value when applying almost too strong statistical power to almost too large datasets.