“This emphasises the importance of separating the effect size from variability and not combining them into standardised effect sizes. The variability of the study is affected by the standard deviation (SD). If a study has a large SD (and thus higher variability) this reduces the power estimate, independent of the effect size you are looking for.”

Surely the effect size (if we’re still talking about Cohen’s D) is already a function of the smallest detectable difference and the standard deviation, so given effect size isn’t power independent of variation by definition? Or have I misunderstood what you mean by effect size in that section?

]]>The twitter quotes from D. Lakens show the existence of cases where there is (1) a specific alternate hypothesis (i.e. the hypothesis specifies a value for a model parameter, not only that it differs from the null hypothesis), (2) a statistical test with extremely high power, over 99%, and (3) a dataset that is large, perhaps even larger than is strictly needed.

Under these conditions, some p-values slightly below 0.05 would actually support the null hypothesis rather than the research hypothesis, and the simplistic use of p-values gives the wrong result. To be able to recognize and correctly deal with these conditions, one should indeed have a detailed understanding of the p-value.

But how common are these conditions? How often have you faced these conditions in your career? How often would a random researcher face these conditions? Every year? Or, on the average, not even once in their lifetime?

Isn’t it overwhelmingly more common that people work with tests that have statistical power that is rather lower than desired (rather than over 99%), or with non-specific alternate hypotheses so that statistical power is not even defined, and with datasets that are rather too small than too large? So does it really matter if we use only the simplistic understanding of the p-value?

Psychology has a replication crisis, but it has to do with detecting false signals at 0.05 significance level from too small datasets. It has to do with publication bias. I don’t think it has to do with misinterpreting the meaning of a p-value when applying almost too strong statistical power to almost too large datasets.

]]>In general, you seem to support my (reformulated) point I was trying to make: that as the correct definition is so abstract and difficult, people will not use it, so it’s better to rely on rules-of-thumb that convey the crucial idea without tripping us with the unintuitive exact wording. Adding the two additional rules – alternative hypothesis and approaches that utilize it, and that really testing a hypothesis needs a fully Bayesian approach (which I don’t know what it is right now) – give good context to show the limits of the p-value and what else there is. This is how p-values should be taught in universities!

I feared my question was stupid, but it seems that once again it was better to still present it; I actually learned something.

]]>I’m a non-methodologist, does it matter if my definition is slightly wrong?

I actually see another mistake with the Nature article, one which is relevant to this discussion. Indeed, it not only defines the p-value as informing about the probability of the null hypothesis, but it also defines the null hypothesis as the hypothesis of chance.

Kivikangas questions underlie an important concern, especially for practitioners (a.k.a., non-methodologists, with little-to-no interest on philosophy of statistics or on science): Basically, can I get to the grain of the issue without the unnecessary fluff? As far as an ideal practitioner goes, he also poses quite specific assertions which are eminently true, such as p-values (frequentist statistics) have an underlying positive correlation with Bayes Factors (Bayesian statistics), so they show some degree of evidence even if it were to be not as strong as the Jeffreysians would like it to be. As posed in his examples, if all you want to measure is a difference, then even the “wrong” measuring tool may give an indication of such difference.

Kivilangas then finishes with something a practitioner would like: a rule-of-thumb, and nothing else. Actually, he is quite wise in this instance as well, as pretty much everyone works with such rules-of-thumb, including the Bayesians, the frequentists, and everyone in between. This is why Chawla (2017) also makes a mistake in his definition of p-values and of the null hypothesis.

The answer I can offer Kivikangas (and PsychBrief) is from a Human Factors point of view. A correct definition of p-values (and anything else) leads to better and more nuanced understanding and interpretation of the material at hand, therefore to more precise communication of which conclusions are warranted or not warranted. It also helps prevent errors of interpretation and, worse, mistakes in implementation (see, example, some real errors commented here: Perezgonzalez, 2026a–https://mro.massey.ac.nz/handle/10179/10444–, 2016b–https://mro.massey.ac.nz/handle/10179/10445).

For example, the null hypothesis is not the hypothesis of chance, as Chawla says, therefore, smaller p-values cannot be evidence that results are less likely to be due to random variation. The null hypothesis only says that there is no difference or correlation (i.e., that the effect size is ‘zero’). The hypothesis also provides a probability model of how much random samples may show an effect size different to ‘zero’ when taken from the original population (a.k.a., the probability distribution). Thus, chance resides in the sampling, so it does not resides on the null hypothesis. (See also Perezgonzalez, 2017, https://doi.org/10.3389/fpsyg.2017.01715.) When we use the incorrect definition, then non-significant results are ‘understood’ as due to chance, while significant results are ‘understood’ as due to systematic effects, which is ridiculous because a test of significance does not separate random results from systematic ones.

From here we can move onto what is the definition of a p-value. The correct definition would see the p-value as a descriptive measure (where is the sample located within the probability model posed by the null hypothesis?). A.k.a., the p-value is more of a percentile than a probability (see Perezgonzalez, 2015, https://doi.org/10.3389/fpsyg.2015.00341). Thus, it shows no ‘evidence’ of anything, the evidence being in the mind of the researcher / practitioner, not on the statistic itself.

In order to work as evidence proper, there is a need for an alternative hypothesis, where the same p-value also has a probability under such alternative hypothesis. A simple test of significance (Fisher’s) does not even take into account this alternative hypothesis, but it exists as it kind of determines the level of significance and the reason why the null gets rejected. However, as Lakens suggests, the fact that a result is significant is not necessarily evidence that it supports the alternative hypothesis. You need an explicit alternative in order to make sure that the research has enough power and, therefore, that any statistically significant result shows more evidence towards the alternative than it does towards the null (Neyman-Pearson’s tests of acceptance).

A Bayesian approach (e.g., Jeffrey’s Bayes Factors) also work with two hypotheses (albeit BF works with two models). So in both cases, Neyman-Pearson’s tests and Bayesian approaches can inform about some evidence in support of one or the other hypothesis. With Fisher’s approach, a significant result often shows such evidence, but you may as well be wrong in so assuming. Therefore, knowing which tool you are using and its limitations can help understand research better and prevent making serious mistakes in the interpretation and communication of results.

So… does the scale (Fisher) informs of a difference as well as a tape measure in cms (Neyman-Pearson) or in inches (Jeffreys), enough of a practitioner? Sometimes they may, sometimes they may not. But why risking it?

The solution to a practitioner (and most researchers) is to create rules-of-thumb that both work with a correct definition but is also simple and quick to apply. My rules-of-thumbs would be (1) to define the p-value as a percentile on the distribution of the null hypothesis, (2) to quantify an alternative hypothesis when interpreting evidence (it can be done using sensitiveness—Perezgonzalez, 2017, https://osf.io/preprints/psyarxiv/qd3gu—, Neyman-Pearson’s approach—a.k.a., a power analysis—, or Jeffreys’s approach—a.k.a., Bayes Factors; the latter can be easily implemented using JASP, a statistics software that allows for both frequentist and BF analysis, and (3) to only talk of testing hypotheses when using a fully developed Bayesian approach (thus, neither with simple tests of significance, nor with tests of acceptance, nor with Bayes Factors).

]]>