Summary of a critical discussion of preregistration and Registered Reports

Preregistration and Registered Reports (RRs) are rapidly gaining popularity as a means of asking rigorous scientific questions. I think preregistration, and especially RRs, will positively shape how psychologists engage in scientific work. But critically discussing them is vital for our collective understanding. To this end, I was invited to the Cambridge ReproducibiliTea meeting to chat about them. The slides from my talk can be found here. Together with the slides, this blog post is a summary of the points from the evening. I do not claim credit for any of these ideas. This is not an exhaustive list of pros or cons. I’m focusing more on the negatives and pragmatic thoughts about implementation as I feel the positives have been widely discussed already. Thanks to everyone who was present and especially to those who contributed.

Registration of intent

The general purpose of preregistration and RRs is to demarcate between what you planned to do before collecting data and what you ended up doing. This is to separate between hypothesis-generating (exploratory) and hypothesis-testing (confirmatory) research [zotpressInText item=”{5421944:YNPXGZTS}”]. Being open about what you originally planned to do and what happened will strengthen your inferences and increase confidence in your results. Readers can see whether you fished around for a result or tested a prediction you made. This means they can be more sure your result isn’t a false positive, as you haven’t inflated the false positive rate through researcher degrees of freedom [zotpressInText item=”{5421944:ZAGY4CXI}”]. They also increase the chance of conducting a severe test of your hypothesis [zotpressInText item=”{5421944:ERXU8X99}”][note]A severe test of a hypothesis is one where you would likely find a null result if that was the true state of affairs.[/note].

An RR is the bigger sibling of preregistered studies. Not only are data collection and analysis plans decided prior to collection, the methods are peer reviewed. Thus, studies have less of a chance of being dead on arrival as potentially fatal methodological flaws are caught prior to collecting data. It also increases the chance a study is properly powered, as a justification of the sample size using a power analysis must be provided before the methodology is agreed upon. The study is then published[note]As long as there aren’t any major deviations from the agreed plan.[/note] regardless of the result. This is to combat publication bias, as there are well known incentives to hide non-significant results [zotpressInText item=”{5421944:LPNU2RJZ}”].

There have also been recent moves to bring together funders, journals, and authors [zotpressInText item=”{5421944:7WWIVNHK}”]. Authors submit their methods once, the methods are reviewed and decided upon, the journal commits to publishing the article, and the funders agree to fund the study. This streamlines the process of publishing and reduces waste from poorly designed studies or well designed studies not being funded. Whilst these welcome changes improve the practicalities of preregistration and RRs, they don’t deal with fundamental theoretical criticisms.

Preregistration and theory

Preregistration seeks to make a clear divide between confirmatory and exploratory research. But does such a line exist? In many areas of psychology, there doesn’t seem to be strong enough theory to derive predictions/hypotheses from. This is essential for confirmatory research. If we don’t have a good reason to hypothesise a certain finding, how can we “confirm” it? In psychology, there is often just as much justification for predicting the results will go in one direction as it does in the opposite. Does it therefore make sense to preregister one analysis plan over another?

This is related to what Meehl calls the ‘methodological paradox’ [zotpressInText item=”{5421944:Q7A4UQHX}”]. Psychology typically tests a theory by making a prediction, such as a directional effect e.g. this intervention will reduce self-reported anxiety. Because the prediction is so vague, the test of the theory is very weak. As methods become more rigorous[note]Through stronger methods, larger samples, etc.[/note], the chances of finding this effect increases. Therefore, the test of the theory becomes weaker with stronger methods. This is in contrast with physics. Better methods allow more precise and risky predictions[note]Like a point estimate on a scale.[/note], which are stronger tests of a theory and therefore stronger corroborators if they are correct. Preregistration (arguably) increases methodological rigour, but does nothing to address the underlying issue of weak theory. Therefore, this increase in robustness cannot help us test theories and make discoveries.

Preregistration and scientific discovery

Although preregistration and RRs may increase our confidence in the results, arguably they do not help with the fundamental purpose of science. According to philosophy of psychological science, psychology isn’t just about finding effects; it’s about finding the underlying mechanisms [zotpressInText item=”{5421944:RX24DYFS}”]. This is because effects are not explanations; they’re things to be explained. This process requires exploration of the theoretical space[note]The possible explanations for the underlying phenomena.[/note], which is vast and complex. Preregistration is unsuited for this [zotpressInText item=”{5421944:5QN4NMS8}”] because testing individual, specific hypotheses one at a time with ‘yes’/’no’ questions is unlikely to aid discovery [zotpressInText item=”{5421944:6TR43CXY}”].

But some argue preregistration can be useful for exploring theories. Regarding cognitive modelling[note]Using scientific models in cognitive psychology to understand cognitive processes.[/note], [zotpressInText item=”{5421944:C2HX7CYU}”] found it is suited for the more confirmatory nature of model application[note]These studies involve applying an existing cognitive model to empirical data to provide insight into how that cognitive process works.[/note], comparison[note]Where multiple models are compared to see how well they account for empirical data.[/note], and evaluation[note]Where one or more models try to explain different patterns in empirical data.[/note]. However, it is less suited to the more exploratory model development[note]Modifying models to create a new one.[/note]. Thus, preregistration can be useful for theory development by more rigorously testing models, which are the formalised ideas[note]Via quantification, visualisation, etc.[/note] derived from theories.

Yet despite the potential validity of preregistration for theories, it still places a primacy on hypothesis testing and statistics. Many argue this a poor way to make discoveries as methods cannot overcome weak theory [zotpressInText item=”{5421944:5MAJQW5Q},{5421944:FHB85D4L}”]. Not only that, but this increased confidence via reproducibility may lead us astray. Reproducible results don’t necessarily lead us towards truth [zotpressInText item=”{5421944:RDC5L2Q6}”], so focusing on this aspect of research may cause us to go down theoretical dead ends that are nonetheless reproducible. On a related note, you could argue strictly focusing on methods limits the hypothesis space to that which can be measured well. And an overemphasis on severe tests could stop people from conducting research (making the perfect the enemy of the good)[note]A side note regarding severe tests: How can you know a severe test would show a negative if that was reality without knowing the true state of nature? At which point, what’s the value in the study when you already know the answer?[/note].

Late registration

There has been much discussion around how to deal with deviations from the preregistration plan. If deviations are rarely justifiable [zotpressInText item=”{5421944:5KWNG5DL}”], then they will be thought of as a questionable research practice [zotpressInText item=”{5421944:UPYLPXK2}”]. This will incentivise hiding such deviations or the whole analysis plan. This undermines one of the main goals of preregistration, which is to increase openness and honesty. One response is that preregistration is better thought of as a public documentation system[note]This raises the question as to whether preregistrations are the best format for this kind of record.[/note]. This version of preregistration emphasises honesty and can tolerate deviations, but undermines one of the proposed benefits of preregistration and RRs: controlling the false positive rate.

But how well can these methods control the type I error rate? Some argue preregistration doesn’t limit p-hacking unless it is so strict to not tolerate any deviations. But then it becomes an analysis prison as you cannot do anything different, even if there is an obviously better choice to be made. This is something many advocates (including myself) have argued against. Being open about your flexibility is undoubtedly a positive, but that flexibility in light of the data still inflates your type I error rate. It seems up to each individual to judge how flexible they are willing to be and be transparent, as I don’t think blanket rules will be useful. Preregistration therefore seems to entail a trade-off between sticking to the preregistered plan and inflating the type I error rate.

This discussion of p-values might be moot, however, as some argue your p-value was doomed from the start [zotpressInText item=”{5421944:I8XDSLSK}”]. Therefore, stopping p-hacking is not an issue because your p-value never had a chance to mean anything, so it doesn’t matter if it was hacked or not [zotpressInText item=”{5421944:BDJYJHR7}”].

Standards of preregistrations and RRs

One of the main reasons for doing a preregistration or RR is the increase in robust/confidence in the results via open data and improved reproducibility. But does this bear out in reality? [zotpressInText item=”{5421944:33NXICQ2}”] analysed 62 RRs (that met inclusion criteria) and found data were available for 41 articles, with 37 articles (58%) providing analysis scripts. For the main results in 36 articles that shared both data and code, they ran the scripts for 31 analyses, and could reproduce the main results for 21 (58%) of those articles. This is certainly an improvement over business as usual [zotpressInText item=”{5421944:TWDLURLI}”], but you may hope for better from a methodology which is designed to be more open.

There have also been reports of reviewers instructing authors of RRs to deviate from the preregistration after seeing the results, which defeats the whole point of RRs. I don’t know how common this is, but such a blatant disregard for the meaning of RRs is worrying. If some journals are willing to bend the format this far, it’s likely less extreme examples also occur. Does an RR at one journal mean the same as at another? Are some journals jumping the through the hoops of offering RRs without buying into the philosophy? Thus gaining social capital by appearing to engage in socially desirable practices, without doing it properly? As a result, can we have a meaningful/consistent concept of RRs without oversight or enforcement? [zotpressInText item=”{5421944:TXISGYPZ}”] have advocated a Yelp style feedback mechanism for RR-adopting journals, to help ensure that editors comply with their own policy.

MULTI-BALL

As preregistration and RRs become more prevalent, more ways of designing them are put forward. When preregistering your analysis plan, you may preregister multiple analyses. Thus, after data collection, you are not bound to only one. This seeks to give greater flexibility and control the nominal false positive rate. But which analyses do you preregister? There is often many justifiable ways of analysing the data, and without strong theory to guide you, your analysis plan may be arbitrary. A more principled way to justify an analysis plan would be to preregister the analyses for your replication based on what you originally found[note]Assuming you have the resources to run a replication study.[/note]. A related but distinct option is multiverse analysis. This is a more grand version of exploring multiple analysis paths by running many different analyses with different conditions. Whilst it, along with others [zotpressInText item=”{5421944:8IC6ZHFD}”], are valuable options, it isn’t a silver bullet for robust inferences.

Looking backwards and forwards

Preregistration and RRs are a rapidly expanding feature of our scientific landscape [zotpressInText item=”{5421944:TXISGYPZ}”]. Learning more about them, their strengths and weaknesses, how to differently employ them, will make them a better tool. In addition, if we aren’t open about the problems/difficulties, we risk putting people off as it can present an unrealistic picture. You could also argue that opening with preregistration in open science conversations, as I’ve often seen happen, perhaps sets an unrealistic standard and dissuades people from engaging. I hope this isn’t the case, as there are many strands to open science practices, all with relative pros, cons, and opportunities.

Regardless of how we discuss preregistration and RRs, we should maintain perspective. Good work was conducted before preregistration (and indeed the scientific method). In our collective enthusiasm, it can sometimes appear like we have forgotten this important fact. Having a more humble science would likely benefit all of us.

References

[zotpressInTextBib style=”apa” sort=”ASC”]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: