# Prediction markets and how to power a study

Do you think you know which studies will replicate? Do you want to help improve the replicability of science? Do you want to make some money? Then take part in this study on predicting replications!1

But why are researchers gambling on whether a study will successfully replicate (defined as finding “an effect in the same direction as the original study and a p-value<0.05 in a two-sided test” for this study)? Because there is some evidence to suggest that a prediction market can be a good predictor of replicability, even better than individual psychology researchers.

What is a prediction market?

A prediction market is a forum where people can bet on whether something will happen or not. They do this by buying or selling stocks for it. In this study, if you think a study will replicate you can buy stocks for that study. This will then raise the value of the stock because it is in demand. Because the transactions are public, this shows other buyers that you (though you remain anonymous) think this study will replicate. They might agree with you and also buy, further raising the value of the stocks. However, if you don’t think a study will replicate you won’t buy stocks in it (or you will sell the stocks you have) and the price will go down. If you’re feeling confident, you can short sell a stock. This is where you borrow stocks of something you think will devalue (a study that you don’t think will replicate and you believe people will agree with you in the future), sell them to someone else, then buy back the stocks after their value has dropped to return to the person you initially borrowed from, keeping the margin.

This allows the price to be interpreted as the “predicted probability of the outcome occurring”. For each study, participants could see the current market prediction for the probability of successful replication.

A brief history of prediction markets for science:

Placing bets on whether a study will replicate is a relatively new idea; the idea was first tested with a sample of the studies from the Reproducibility Project: Psychology (2015) by Dreber et al. (2015)2. They took 44 of the studies in the RP:P and, prior to the replications being conducted, asked participants (researchers who took part in the RP:P, though they weren’t allowed to bet on studies they took part in) to rate how likely they thought the studies were to replicate. They created a market where participants could trade stocks on how likely they thought a study would replicate. They then compared how accurate the individual researchers and the market was for predicting which studies would replicate. The market significantly outperformed the researchers prior predictions as to how likely a replication was to occur, with the researcher’s prediction results not being suitably unlikely assuming the null hypothesis was true (and when weighted for self-rated expertise, performed exactly as good as flipping a coin).

This result was not replicated in a sample of 18 economic’s papers by Camerer et al. (2016a) who found that both methods (market or prior prediction) did not have a significant difference between their predicted replication rate and the actual replication rate. However they noted the smaller sample size in their study which may have contributed to this null finding (as well as other factors).

What the current study will do:

Camerer (2016b) will take 21 studies and attempt to replicate them. Before the results are published, a sample of participants will trade stocks on the chance of these studies replicating. In order to calculate the number of participants (n) the researchers would need to conduct a worthwhile replication, the researchers looked at the p-value, n, and standardised effect size (r) of the original. They then calculated how many participants they would need to have a “90% power to detect 75% of the original, standardised effect size”. This means that if the original r was 0.525, with a n of 51, and a p-value of 0.000054, you would need 65 participants to have a 90% chance of detecting 75% of the original effect. If the first replication isn’t successful, the researchers will conduct another replication with a larger sample size (pooled from the original replication sample and a second sample) that will have 90% power to detect 50% of the original effect.

But there’s one problem; you shouldn’t use the effect sizes from previous results to determine your sample size.

Using the past to inform the future:

Morey & Lakens (2016) argue that most psychological studies are statistically unfalsifiable because the statistical power (usually defined as the probability of correctly rejecting a false null hypothesis) is often too low. This is mainly due to small sample sizes, which is a ubiquitous problem in psychological research (though as Sam Schwarzkopf notes here, measurement error can add to this problem). This threatens the evidentiary value of a study as it might mean it wasn’t suitably powered to detect a real effect. Coupled with publication bias (Kühberger, 2014) and the fact that low sample size studies are the most susceptible to p-hacking (Head et al., 2015), this makes judging the evidential weight of the initial finding difficult.

So any replication is going to have to take into account the weaknesses of the original studies. This is why basing the required n on previous effect sizes is flawed, because they are likely to be over-inflated. Morey and Lakens give the example of a fire alarm: “powering a study based on a previous result is like buying a smoke alarm sensitive enough to detect only the raging fires that make the evening news. It is likely, however, that such an alarm would fail to wake you if there were a moderately-sized fire in your home.” Because of publication bias, the researchers will expect there to be a larger effect size than may actually exist. This may mean they don’t get enough participants to discover an effect.

But these researchers have taken steps to try to avoid this problem. They have powered their studies to detect 50% of the original effect size (if they are unsuccessful with their first attempt at 75%). Why 50%? Because the average effect size found in the RP:P was 50% of the original. So they are prepared to discover an effect size that is half as large as was reported in the initial study. This of course leaves the possibility that they miss a true effect because it is less than half of the originally reported effect (and given the average effect size was 50% we know there will very likely be smaller effects). But they have considered the problem of publication bias and attempted to mitigate it. They have also eliminated the threat of p-hacking as these are Registered Reports which have their protocol preregistered so there can be almost no chance of questionable research practices (QRP’s) occurring.

But is their attempt to eliminate the problem of publication bias enough? Are there better methods for interpreting replication results than theirs?

There is another way:

There have been a variety of approaches to interpreting a replication attempt published in the literature. I am going to focus on two: the ‘Small Telescopes’ approach (Simonsohn, 2015) and the Bayesian approach (Verhagen & Wagenmakers, 2014). The ‘Small Telescopes’ approach focuses on whether an initial study was suitably powered to be able to detect the effect. If the replication with the larger n finds a statistically significantly smaller effect size than the original e.g. an effect that would give 33% power to the first study3 (if you were to take the replication effect size and put into the power calculation of the original), then this suggests the initial study didn’t have a big enough sample size to detect the effect. In this paper he also recommends using 2.5x the number of participants of the initial study. This ensures 80% power regardless of publication bias, the power of the original studies, and study design. Looking at the replications in the Social Sciences Replication Project, 18 of the 21 studies have at least 2.5x the number of participants of the first study by the second phase of data collection, leaving just 3 that don’t (studies 2, 6, and 11).

The Bayesian test to quantify whether a replication was successful or not compares two hypotheses: the effect is spurious and isn’t greater than 0 (null hypothesis); and the effect is consistent with what was found in the initial study (the posterior distribution of the replication is similar to the posterior distribution of the original). The larger the effect size in the first study, the larger we expect the effect size of the replication to be and the further from zero we expect it to be. We can therefore quantify how much the replication data supports the null hypothesis or the alternative hypothesis (that the replication will find a similar effect to the original study).

Simonsohn argues that the use of the somewhat arbitrary 90% power to detect 75% or 50% of the original effect is inferior when compared to the ‘Small Telescopes’ or the Bayesian approach (personal communication, October 24, 2016). This is because the other two approaches have thoroughly analysed what the results mean and how often you get what results under what underlying effect. The same has not been done for the method used in this study so we can’t be as confident in the results as we can for the other two methods.

Conclusion:

I would recommend getting involved in this study (if you can) and if you can’t, look for the results when they are published as they will be very interesting. But try to avoid using previous effect sizes to calculate future n’s because of the problem of publication bias. You also need to consider how you interpret replication results and the best means of doing this (does it have a solid theoretical justification, a clear meaning, and have the statistical properties been thoroughly examined?).

Author feedback:

I contacted all the researchers from Camerer (2016b) prior to publication to see if they had any comments. Anna Dreber had no criticisms of the post and Colin Camerer expressed familiarity with the issues raised by Morey & Lakens. Magnus Johannesson made the point that they used a 50% chance of finding the original effect size precisely because of publication bias and the fact the average effect size in the RP:P was 50%. I updated my post to reflect this. I contacted Richard Morey to clarify about calculating the required n for a study using previous effect sizes. I contacted Uri Simonsohn about whether the actions of the researchers had overcome the problems he outlined in his ‘Small Telescopes’ paper. He argued they were worse than previously explored methods and I added this information.

References:

Almenberg, J.; Kittlitz, K.; & Pfeiffer, T. (2009). An Experiment on Prediction Markets in Science. PLoS ONE 4(12): e8500. doi:10.1371/journal.pone.0008500

Camerer, C.; Dreber, A.; Ho, T.H.; Holzmeister, F.; Huber, J.; Johannesson, M.; Kirchler, M.; Almenberg, J.; Altmejd, A.; Buttrick, N.; Chan, T.; Forsell, E.; A.; Heikensten, E.; Hummer, L.; Imai, T.; Isaksson, S.;  Nave, G.; Pfeiffer, T.; Razen, M.;& Wu, H. (2016a). Evaluating replicability of laboratory experiments in economics. Science, DOI: 10.1126/science.aaf0918

Camerer, C.; Dreber, A.; Ho, T.H.; Holzmeister, F.; Huber, J.; Johannesson, M.; Kirchler, M.; Nosek, B.; Altmejd, A.; Buttrick, N.; Chan, T.; Chen, Y.; Forsell, E.; Gampa, A.; Heikensten, E.; Hummer, L.; Imai, T.; Isaksson, S.; Manfredi, D.; Nave, G.; Pfeiffer, T.; Rose, J.; & Wu, H. (2016b). Social Sciences Replication Project. [online] Available at: http://www.socialsciencesreplicationproject.com/

Dreber, A.; Pfeiffer, T.; Almenbergd, J.; Isakssona, S.; Wilsone, B.; Chen, Y.; Nosek, B.A.; & Johannesson, M. (2015) Using prediction markets to estimate the reproducibility of scientific research. Proceedings of the National Academy of Sciences, 112 (50), 15343–15347.

Hanson, R. (1995). Could gambling save science? Encouraging an honest consensus. Social Epistemology, 9 (1).

Head, M.L.; Holman, L.; Lanfear, R.; Kahn, A.T.; & Jennions, M.D. (2015). The Extent and Consequences of P-Hacking in Science. PLoS Biol 13(3): e1002106. doi:10.1371/journal.pbio.1002106

Investopedia (2016). Short Selling. [online] Available at: http://www.investopedia.com/terms/s/shortselling.asp. Accessed on: 21/10/2016.

Kühberger, A.; Fritz, A.; & Scherndl, T. (2014). Publication Bias in Psychology: A Diagnosis Based on the Correlation between Effect Size and Sample Size. PLoS ONE 9(9): e105825. doi:10.1371/journal.pone.0105825

Open Science Collaboration (2015). Estimating the reproducibility of psychological
science. Science 349 (6251):aac4716.

Open Science Foundation (2016). Registered Reports. [online] Available at: https://osf.io/8mpji/wiki/home/

Simonsohn, U. (2015). Small Telescopes: Detectability and the Evaluation of Replication Results. Psychological Science, 26 (5), 559–569

Morey, R. & Lakens, D. (2016). Why most of psychology is statistically unfalsifiable. [online] Available at: https://github.com/richarddmorey/psychology_resolution/blob/master/paper/response.pdf

Schwarzkopf, S. (2016). Boosting power with better experiments. [online] Available at: https://neuroneurotic.net/2016/09/18/boosting-power-with-better-experiments/

Verhagen, J. & Wagenmakers, E.J. (2014). Bayesian Tests to Quantify the Result of a Replication Attempt. Journal of Experimental Psychology: General, 143 (4), 1457-1475.

1. This study has now finished collecting participants for the first phase of the study.
2. Though it wasn’t the first study to apply the idea of prediction markets to science results, which was done by Almenberg, Kittlitz, & Pfeiffer in 2009 and the idea was most notably proposed by Hanson (1995).
3. You don’t need to use the 33% threshold to judge whether the original study was suitably powered to detect the new effect size, or indeed hypothesis testing in general. You can use different values or confidence intervals instead.