Does calling a study “under powered” help or hinder criticism?

A common criticism of research (past and present) is that it’s “under powered” or “has low power”. What this typically means is the study doesn’t have many participants (typically between 5 and 40) and so has low statistical power for most effect sizes in psychology

[note]This is based on the fact most psychological effects are subtle in the real world and will have a correspondingly small effect size. If psychological effects were universally powerful, then we would see the effects far more readily. For a fun discussion of this, see Gervais (2019).[/note]

. But something being “under powered” only makes sense when compared with an effect size. Power is determined by the effect size you want to detect, the size of your sample, and the alpha level

[note]For a more detailed explanation of how to think about power, I recommend this blog post of mine here.[/note]

. Sample sizes which are often labelled as “under powered” can actually have high power, depending on the hypothetical effect size.

I’ll show you “under powered”!

For example, if you are running an independent t-test and you can only collect 20 participants, you have 79.2% power

[note]80% is given as the usual benchmark to aspire for. But I’m starting to believe we need to dispense with such ‘rules of thumb’ to improve our work (Lakens, 2019).[/note]

to detect an effect size of 0.9. You might argue that an experiment isn’t worth much if it can only detect an effect that large. But that’s a different argument to whether it is under powered in a universal sense. This effect size is also slightly smaller than the median effect size (D=0.93) for nominally statistically significant results in cognitive neuroscience and psychology papers (Szucs & Ioannidis, 2017)

[note]I don’t for a second believe this result is an accurate reflection of reality; these results have very likely been inflated through measurement error (Loken & Gelman, 2017) and a host of other Questionable Research Practices leading to false positives.[/note]

. Calling an n=20 study “under powered” is meaningless if you don’t also discuss the hypothetical effect size.

When the hypothetical effect size is d=0.9, an n=20 experiment has 80% power. From: https://lakens.shinyapps.io/p-curves/

If the criticism you are making is “this study has a small sample size”, just say that. Adding the additional layer of power confuses rather than clarifies the issue. I don’t think many are using it as a means to unthinkingly cast aside a result. But promoting incorrect views about power[note]That a study “has” power, rather than it being a function of various parameters (one of which is sample size).[/note] to make a methodological criticism doesn’t further the conversation. Considerations of power and how it affects confidence in our statistical inferences is important. The broad intention of the criticism is good. But using a criticism which unintentionally adds noise to the discussion is not the way to do it.

The power within

On Twitter, Roger Giner-Sorolla made a valuable point regarding within-participant designs. Whilst he agreed that discussions of power should be framed with a discussion of the effect size, he preferred saying “under powered” as merely focusing on sample size ignores the benefits of within-subjects designs (Giner-Sorolla et al., 2019). Within-subject designs can have very high power despite what is typically regarded as a low number of participants. I think the main take away with these additional considerations is the importance of being precise in our criticisms. By going into detail, you can avoid any potential misunderstandings or inappropriate critiques.

References

Gervais, W. (2019). The WORLDBREAKER Heuristic. Available at: https://willgervais.com/blog/2019/6/7/the-worldbreaker-heuristic

Giner-Sorolla, R.; Aberson, C.L.; Bostyn, D.H.; Carpenter, T.; Conrique, B.G.; Lewis, Jr, N.A.; Montoya, A.K.; Ng, B.; Reifman, A.; Schoemann, A.M.; & Soderberg, C. (2019). Power to Detect What? Considerations for Planning and Evaluating Sample Size. Available at: https://osf.io/jnmya/

Lakens, D. (2019). The New Heuristics. Available at: http://daniellakens.blogspot.com/2019/03/the-new-heuristics.html

Loken, E. & Gelman, A. (2017). Measurement error and the replication crisis. Science, 584-585.

PsychBrief (2017). Why you should think of statistical power as a curve. Available at: https://psychbrief.com/power-curve/

Szucs D., & Ioannidis J.P.A. (2017) Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLoS Biol 15(3): e2000797. https://doi.org/10.1371/journal.pbio.2000797

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: