The positive predictive value/false discovery rate are fundamentally flawed

The positive predictive value (PPV) was brought to a wider audience by John Ioannidis in his famous 2005 paper ‘Why Most Published Research Findings Are False’. The related concept false discovery rate (FDR) was popularised by Colquhoun (2014), though the idea it’s based on dates back to Jeffreys (1939). They have become very popular in recent years (e.g. Button et al., 2013) as a way to test: how likely it is your significant result from a statistical test reflects an actual effect or not; and the health of a field as a whole i.e. how many of the findings published in the literature are “true”. But are they useful? Are they valid tests of how likely a statistical result is to be correct? Mayo & Morey (2017) argue no[note]As Silbert (2018, personal communication) points out, the Ioannidis (2005) paper is more of a thought experiment to consider whether the literature is full of false positives. This is in contrast with the FDR which is an explicit statement that there are a lot more false positives than the alpha level should allow.[/note]. I will summarise their arguments related to the PPV and FDR, though I highly recommend you read the original article.

Theoretical background

The PPV and FDR rely on what Mayo and Morey call the Diagnostic Screening (DS) model of tests. It is a criticism of how significance tests are currently performed and intended as a replacement of error probabilities of significance tests. It has 3 parts to it.

  1. The null hypothesis (H0), e.g. “this drug has no effect on x”, is viewed as a selection, potentially random, from a population of null hypotheses[note]”Null hypothesis” in this instance doesn’t necessarily mean a ‘hypothesis of no relation’ i.e. a true null but ‘the hypothesis this statistical test is based on’; some of which are true nulls and some of which aren’t.[/note] (other drugs you could test).
  2. Most null hypotheses are assumed to be true nulls (most drugs won’t have an effect).
  3. Results are dichotomised into “positive” or “negative” depending on whether they pass a previously agreed upon statistical significance threshold (typically 0.05). “Positive” results are interpreted as you discovering a “real effect”.

Because your hypothesis might be a true null (based on the prior probability of you choosing a hypothesis from all the potential hypotheses you could have chosen to test), there’s a chance your statistically significant result is a false positive. But this probability will be higher than the significance level. This can be written out formally as: the probability of your test rejecting the null hypothesis assuming the null hypothesis is true can be the significance level e.g. 0.05, Pr(H0 is rejected|H0)=0.05[note]Mayo and Morey explain they use the phrase “conditional upon” and its notation “|” because Ioannidis and Colquhoun do, despite it not being technically correct. This is because the use of “conditional upon” implies H0 is a random variable when it isn’t (Wasserman, 2013). Less misleading notation like “;” is encouraged, which can be thought of as “assuming” or “under the scenario”.[/note], while the purported probability of the null hypothesis being true conditional upon the null hypothesis being rejected can be greater than 0.05, Pr(H0|H0 is rejected)>0.05, if enough null hypotheses are true nulls e.g. 50%. This is not the same as the Type I error rate in standard NHST, which is set before the experiment begins and is the long-run rate of falsely rejecting the null hypothesis without consideration of prior probabilities, Pr(H0 is rejected|H0).[/note].

By looking at it in formal language, we can see the probability derived from the diagnostic screening model of tests and the significance level are subtly different (the first part of the equation and the expression it is conditional on have been switched). The proponents of the DS criticism argue that because these values are not the same for a given test, this is a problem. The implication is the PPV should be close to (if not exactly) 0.95 so the FDR equals the 0.05 significance level. Using this logic, some advocates of the DS criticism argue that if a test with alpha=0.05 leads to a PPV less than 0.95[note]And therefore a FDR of more than 0.05.[/note], the evidence against the null is “exaggerated”. But why should alpha and the FDR be equal when they are different things? So the concept of the PPV and FDR is off to a rocky start. However, as the authors point out, this isn’t enough to dismiss the DS criticism.

Definitions

The PPV is defined as the probability that a significant finding reflects a true effect, Pr(H1 is true|H0 is rejected), where H1 is the alternative hypothesis that there is an effect, (Ioannidis, 2005). Its complement (the statistical opposite) is the FDR[note]To clarify what “complement” means, you can think of it in relation to an event we will call “A”; the complement of event A occurring is event A not occurring.[/note] (see this tree diagram from Colquhoun, 2014 for a visual representation of these concepts). The FDR is the prevalence of true nulls that are found to be statistically significant[note]As we saw earlier although this sounds exactly the same as the alpha level, it isn’t.[/note]. A key assumption of the PPV is your hypothesis is chosen at random from all the possible hypotheses you could test. A high PPV is desirable because you can (supposedly) be more confident a significant statistical result reflects a genuine effect. There are two ways to obtain a high PPV: assuming a high ability to correctly identify a true negative (probability of getting a non-significant result when the null hypothesis is true); and assuming a high prior prevalence of null hypotheses that are false i.e. there is some kind of effect.

Conceptual problems

The PPV and FDR imagines you are drawing hypotheses to be tested from a population of possible hypotheses. For the probabilities given by these measures to make sense you have to state which relevant hypotheses in the population you are drawing your sample from[note]This is called the ‘reference class’ and all probabilities must be defined in relation to it.[/note], how many hypotheses in the population are true, and how many are false. But how can you define this? It is almost impossible to draw up every hypothesis that has been tested within your field or ones that hypothetically could be tested. And then you have the insurmountable task of defining how many of those hypotheses are true and false. This means the fundamental calculations that the PPV and the FDR rely on cannot be made.

Even if you could construct a meaningful foundation, a high PPV wouldn’t necessarily reflect strong evidence for a hypothesis. This is because almost everything you can measure correlates with everything else (dubbed the “crud factor” in Meehl, 1990). So if the “crud factor” is high (and therefore there is a high prevalence of true hypotheses in the population) and your result is significant, there would be a good chance your significant result reflects a “true” result. But because the test you conducted wasn’t stringent the result is trivially true. The result stems from finding correlations in background noise. This is an example of how the PPV doesn’t index what we are interested in (whether our significant result is an accurate reflection of reality).

Statistical problems

The DS criticism of statistical tests rests on two related assumptions. The first is the probability of H0 or H1 being true sum to 1 i.e. those options encompass every possibility[note]The power of the alternative hypothesis rests on a single value for the effect size and, therefore, this is a point to point comparison between the previously stated effect size and 0. This is a very narrow range of effect sizes to constrain 100% of the probability to.[/note]. The second is that the alternative has high power e.g. 80%. But when performing significance tests, the alternative is typically a composition of many different hypotheses[note]For example: the mean of the experimental group is 2 scores higher than the control; the mean is 3 scores higher than the control; the mean is 4 scores higher etc.[/note] and statistical power is a function of the test rather than a static number(Neyman, 1977, p.7). When your hypotheses covers every possible outcome, the power associated with a single parameter value within H1 can be very low. But the DS criticism requires a high power for the alternative. This means the power to detect H1 isn’t the complement of the probability of detecting H0. The DS criticism assumes a hypothesis/effect has a single value of power that can therefore be used to help compute other test statistics. But this isn’t the case as power depends on the hypothesised size of the discrepancy you are testing. Using the DS criticism confuses the specification of a test before data collection with the use of power, after the data has been collected, to assess the effect size of a statistically significant result.

Implications

So what are the consequences of adopting a PPV/FDR framework for considering whether a significant result is “true” or not? Besides causing further confusion for statistical terms, one of them would be a likely increase in the resources spent on work that is either trivial, safe, or both. Ioannidis advocates that “Large-scale evidence should be targeted for research questions where the pre-study probability is already considerably high” (Ioannidis, 2005, p. 0700). This isn’t to say I believe there is no value to investing lots of resources into testing hypotheses with high prior odds of being true or that all tested hypotheses should be novel (the collective obsession with novelty is part of the reason there is a replication crisis). But if there is believed to be theoretical justification for running safe experiments then researchers may be less willing to take the risk for hypotheses with lower prior odds. For more consequences of using this framework, please see Mayo & Morey (2017).

Conclusion

The main purpose of the PPV and FDR is to give a clear way of assessing what the probability is of a significant result reflecting a “true” difference. However, it fails in its main task and introduces confusion to already poorly understood concepts like the Type I error rate. Fundamentally, it uses power and Type I errors in ways that were never intended and don’t work. This isn’t to say the whole idea of the diagnostic screening model has no utility. There are some environments where it is extremely useful (the obvious example being screening for diseases). But it has not helped the situation when it comes to assessing whether a significant result reflects a true effect.

P.S.

For making it all the way through a very dry blog post, here’s a gif of a cute rabbit.

Note

The material on the diagnostic test, the delineation of tests, and most of the ideas in this blog post come from Mayo’s upcoming book “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars” (2018).

Author feedback

Many thanks to Adam Pegler, Alex Etz, Deborah Mayo, Noah Silbert, Crystal Steltenpohl, Rebecca Linnett, and Richard Morey for their constructive feedback (listed in the order I received their comments).

References

Button, K.S.; Ioannidis, J.P.A.; Mokrysz, C.; Nosek, B.A.; Flint, J.; Robinson, E.S.J.; & Munafò, M.R. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 365–376 doi:10.1038/nrn3475

Colquhoun, D. (2014) An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society of Open Science. 1, 140216.

Ioannidis, J.P.A. (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124. https://doi.org/10.1371/journal.pmed.0020124

Jeffreys, H. (1939). Theory of Probability. Oxford: Oxford University Press.

Mayo (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge University Press.

Mayo, D. & Morey, M. (2017). A Poor Prognosis for the Diagnostic Screening Critique of Statistical Tests. Available at: https://osf.io/tv4bq/

Meehl, P. (1990). Why Summaries of Research on Psychological Theories Are Often Uninterpretable. Psychological Reports, 1990, 66, 195–244.

Neyman, J. (1977). Frequentist Probability and Frequentist Statistics. Synthese, 36 (1), Foundations of Probability and Statistics, Part I, pp. 97-131.

PsychBrief. (2017). Why you should think of statistical power as a curve. Available at: https://psychbrief.com/why-you-should-think-of-statistical-power-as-a-curve/

Wasserman, L. (2013). Double Misunderstandings About p-values. Available at: https://normaldeviate.wordpress.com/2013/03/14/double-misunderstandings-about-p-values/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: