Why do psychologists leave academia?

Every once in a while in the psychology sphere of social media there’s a discussion about why people leave academia. This talking point often comes up in the context of “the open science movement” and whether more academics leave because of the cultural of criticism or because of the lack of replicability of some findings. People who have left academia offer their reasons and people who are still in give several anecdotes about why someone they know left. But what seems to be lacking is some actual data. So I’ve written this survey with the hopes of shedding some light on the situation. It’s for people who have considered leaving or have actually left academia or practicing psychology (educational, clinical, etc.). But this survey will only be useful if you share this with people you know who have left. So please share the survey on social media or relevant mailing lists but especially link it directly to people you know who have left psychology. I’m writing this blog post so those who are subscribed to the methods blog feed will see this survey, hopefully increasing the number of respondents. Thank you for your help.

Survey: https://www.surveymonkey.co.uk/r/ZRS982G

PS sorry if you’ve seen this survey already!

[Guest post] How Twitter made me a better scientist

I’m a big fan of Twitter and have learned so much from the people on there1, so I’m always happy to share someone singing it’s praises. This article was written by Jean-Jacques Orban de Xivry2 for the University of Leuven’s blog. He talks about how he uses it to find out about interesting papers and a whole host of other benefits. The article can be found here. His twitter account is: @jjodx.

Improving the psychological methods feed

The issue of diversity has once again been raised in relation to online discussions of psychology. Others have talked about why it may happen and the consequences of it. I have nothing to add about those areas so I’m not going to discuss them. The purpose of this post is to analyse the diversity of my main contribution to social media discussions that I have total control over: the psychological methods blog feed. How many blogs by women are featured? How many non-white authors are there? How many early-career researchers (ECR’s) are shared 1

Looking at gender first (simply because that was the issue that started this) I coded the gender of the authors of the blog to male and female. If it was a blog with a wide collection of authors I excluded them from the analysis (I’m looking at you JEPS). If there were multiple authors, I coded them individually (hence why the total number is greater than the number of blogs in the feed). So how many male and female authors are there? Compared to the very low bar set by a previous analysis it’s not terrible (mainly because n>0). But it could (should) be better. Here are my results:

For the ethnicity of the author (white or not) 2 I judged whether they were white by their social media profiles and photos of that person. Again, if the blog contains posts by a large collection of authors then I excluded them and I for multiple authors I coded them individually. The results aren’t great:

Coding for whether the author was an ECR or not I used the definition provided by the Research Excellence Framework, which states an ECR is someone who has been hired to teach and perform research within the past 5 years (REF, 2014).  To ascertain whether the blog author was an ECR I consulted their public CV’s or asked them if they believed they qualified (according to the above definition). The descriptive statistics are:

So what does this tell us? That the vast majority of blog authors in my feed are male, white, and not ECR’s. Not particularly diverse. As I said, the purpose of this isn’t to show that I should be marched naked through the streets with a nun chanting “Shame!” behind me while ringing a bell. It’s to recognise I could do better and ask for your help. I want to increase the variety of voices people hear through my blog feed, so do you have any suggestions? The blogs don’t need to focus exclusively on psychological methods but they do need to discuss them. Feel free to comment on this post, contact me via Twitter (@psybrief) or Facebook (search PsychBrief on Facebook and you’ll find me), or send me an email (www.psychbrief@gmail.com). Any other names put forward are much appreciated. Please check the blog list (http://psychbrief.com/psychological-methods-blog-feed/) before adding to see if I haven’t already included it.



Ledgerwood, A. (2017). Why the F*ck I Waste My Time Worrying about Equality. [online] Incurably Nuanced. Available at: http://incurablynuanced.blogspot.co.uk/2017/01/inequality-in-science.html. Accessed on 25/01/2017

Ledgerwood, A.; Haines, E.; & Ratliff, K. (2015). Guest Post: Not Nutting Up or Shutting Up. [online] sometimes i’m wrong. Available at: http://sometimesimwrong.typepad.com/wrong/2015/03/guest-post-not-nutting-up-or-shutting-up.html. Accessed on 25/01/2017

Research Excellence Framework (2014). FAQ’s. [online] Available at: http://www.ref.ac.uk/about/guidance/faq/all/. Accessed on 25/01/2017



#Create a string called “BlogName” with all the names of the different blogs in it
BlogName<-c(“Brown”, “Coyne”, “Allen”, “Neurobonkers”, “Sakaluk”, “Heino”, “Kruschke”, “Giner-Sorolla”, “Magnusson”, “Zwaan”, “CogTales”, “Campbell”, “Vanderkerckhove”, “Mayo”, “Funder”, “Schonbrodt”, “Fried”, “Coyne”, “Yarkoni”, “Neuroskeptic”, “JEPS”, “Morey”, “PsychBrief”, “DataColada”, “Innes-Ker”, “Schwarzkopf”, “PIG-E”, “Rousselet”, “Gelman”, “Bishop”, “Srivastava”, “Vazire”, “Etz”, “Bastian”, “Zee”, “Schimmack”, “Hilgard”, “Rouder”, “Lakens”)
#Create a vector called “BlogGender” with a string of numbers to represent either female, male, or N/a
#Turn BlogGender into a factor where 1 is labelled Female, 2 male, and 3 N/a
BlogGender<-factor(BlogGender, levels= c(1:3), labels =c(“Female”,”Male”, “N/a”))
#Create a data frame of the variable BlogName by the variable BlogGender
Blogs<-data.frame(Name=BlogName, Gender=BlogGender)
#Because I’m a peasant and can’t work out how to create a graph straight from the data frame I created (though #I’m pretty sure I can’t in its current form and don’t know how to transform it into something that can be #mapped to a graph) I created one vector and one string with the number of male and female blog authors after #counting them up
Gender<-c(“Female”, “Male”)
#Data frame of the vector and string
#Graph object of the data frame with gender as the x axis and frequency as the y, coloured according to the variable Gender
Gender_Graph<-ggplot(Blogsdata, aes(Gender, Frequency, fill=Gender))
#Put bars on my graph object and give it a title
Gender_Graph+geom_bar(stat=”identity”)+ ggtitle(“Number of female blog authors compared to male blog authors”)

BlogName<-c(“Brown”, “Coyne”, “Allen”, “Neurobonkers”, “Sakaluk”, “Heino”, “Kruschke”, “Giner-Sorolla”, “Magnusson”, “Zwaan”, “CogTales”, “Campbell”, “Vanderkerckhove”, “Mayo”, “Funder”, “Schonbrodt”, “Fried”, “Coyne”, “Yarkoni”, “Neuroskeptic”, “JEPS”, “Morey”, “PsychBrief”, “DataColada”, “Innes-Ker”, “Schwarzkopf”, “PIG-E”, “Rousselet”, “Gelman”, “Bishop”, “Srivastava”, “Vazire”, “Etz”, “Bastian”, “Zee”, “Schimmack”, “Hilgard”, “Rouder”, “Lakens”)
BlogEthn<-factor(Ethnlist, levels= c(1:3), labels =c(“White”,”Non-white”, “N/a”))
Ethn<-c(“White”, “Non-white”)
Frequency<-c(39, 2)
EthnGraph<-ggplot(Ethndata, aes(Ethn, Frequency, fill=Ethn))
EthnGraph+geom_bar(stat=”identity”)+ ggtitle(“Number of non-white blog authors compared to white blog authors”)


BlogName<-c(“Brown”, “Coyne”, “Allen”, “Neurobonkers”, “Sakaluk”, “Heino”, “Kruschke”, “Giner-Sorolla”, “Magnusson”, “Zwaan”, “CogTales”, “Campbell”, “Vanderkerckhove”, “Mayo”, “Funder”, “Schonbrodt”, “Fried”, “Coyne”, “Yarkoni”, “Neuroskeptic”, “JEPS”, “Morey”, “PsychBrief”, “DataColada”, “Innes-Ker”, “Schwarzkopf”, “PIG-E”, “Rousselet”, “Gelman”, “Bishop”, “Srivastava”, “Vazire”, “Etz”, “Bastian”, “Zee”, “Schimmack”, “Hilgard”, “Rouder”, “Lakens”)
BlogECR<-factor(ECRlist, levels= c(1:3), labels =c(“Yes”,”No”, “N/a”))
ECR<-c(“Yes”, “No”)
Frequency<-c(9, 31)
ECRGraph<-ggplot(ECRdata, aes(ECR, Frequency, fill=ECR))
ECRGraph+geom_bar(stat=”identity”)+ ggtitle(“Number of non-ECR blog authors compared to ECR blog authors”)

The best papers and articles of 2016

These are some of the best scientific papers and articles I’ve read this year. They’re in no particular order and not all of them were written this year. I don’t necessarily agree with them. I’ve divided it into different categories for convenience.


Current Incentives for Scientists Lead to Underpowered Studies with Erroneous Conclusions by Andrew Higginson and Marcus Munafò. How the current way of doing things in science encourages scientists to run lots of small scale studies with low evidentiary value.

Selective Publication of Antidepressant Trials and Its Influence on Apparent Efficacy by Erick Turner, Annette Matthews, Eftihia Linardatos, Robert Tell, and Robert Rosenthal. The paper that drove home for me the suppression of negative trials for the efficacy of antidepressants and how this affected our perception of them.

Why Does the Replication Crisis Seem Worse in Psychology? by Andrew Gelman. Why psychology is at the forefront of the replication crisis.

False positive psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant by Joseph Simmons, Leif D. Nelson, and Uri Simonsohn. Essential reading for everyone. An excellent demonstration of how damaging some standard research practices can be.

Power failure: why small sample size undermines the reliability of neuroscience by Katherine Button, John Ioannidis, Claire Mokrysz, Brian Nosek, Jonathan Flint, Emma Robinson, & Marcus Munafò. A discussion of the average power of neuroscience studies, what this means, and how to improve the situation. Another must read.

Recommendations for Increasing Replicability in Psychology by Jens B. Asendorpf, Mark Conner, Filip De Fruyt, Jan De Houwer, Jaap Denissen, Klaus Fiedler, Susann Fiedler, David Funder, Reinhold Kliegl, Brian Nosek, Marco Perugini, Brent Roberts, Manfred Schmitt, Marcel van Aken, Hannelore Weber, Jelte M. Wicherts.  A list of how to improve psychology.

Do Learners Really Know Best? Urban Legends in Education by Paul Kirschner & Jeroen van Merriënboer. A critique of some of the myths in education, such as “digital natives” and learning styles.

My position on “Power Poses” by Dana Carney. Dana Carney explains why she no longer believes in the well known phenomenon “power posing”. A rare and important document that should be encouraged and celebrated.

Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking by Jelte Wicherts, Coosje  Veldkamp, Hilde Augusteijn, Marjan Bakker, Robbie van Aert, and Marcel van Assen. A checklist to consult when reading or designing a study to make sure the authors haven’t engaged in p-hacking. A very useful resource.

Instead of “playing the game” it is time to change the rules: Registered Reports at AIMS Neuroscience and beyond by Christopher Chambers, Eva Feredoes, Suresh Muthukumaraswamy, and Peter Etchells. What Registered Reports are & are not and why they are important for improving psychology.

Replication initiatives will not salvage the trustworthiness of psychology by James Coyne. Why replications, though important, are not enough to save psychology (open data & research methods are also essential)

Saving Science by Daniel Sarewitz. Why scientists need to make their research not only accessible to the public but also applicable, so as to stop science from “self-destructing”.

A Multilab Preregistered Replication of the Ego-Depletion Effect by Martin Hagger and Nikos Chatzisarantis. The paper that undermined the previously rock solid idea of ego-depletion and brought the replication crisis to the public.

Everything is Fucked: The Syllabus by Sanjay Srivastava. A collection of articles demonstrating many of the problems in psychology and associated methodologies.

Why summaries of research on psychological theories are often uninterpretable by Paul Meehl. Seminal paper by Meehl which discusses how ten obfuscating factors undermine psychological theories.


Donald Trump: Moosbrugger for President by David Auerbach. The best analysis of Trump’s personality I’ve read this year.

Hillary Clinton, ‘Smart Power’ and a Dictator’s Fall by Jo Becker and Scott Shane. An exposé on the Libyan intervention and the role Hillary Clinton played.

Your App Isn’t Helping The People Of Saudi Arabia by Felix Biederman. A brief history of how religion came to dominate life in Saudi Arabia, interviews with some of the people negatively affected by this, and how the involvement of tech innovations won’t help.

The Right Has Its Own Version of Political Correctness. It’s Just as Stifling by Alex Nowrasteh. A welcome antidote to the constant message that the left are the only one’s who censor others.

Too Much Stigma, Not Enough Persuasion by Conor Friedersdorf. Why the left’s habit of tearing our own apart is so counterproductive.

When and Why Nationalism Beats Globalism by Jonathan Haidt. When and why globalism loses to nationalism in Western politics.

Democracies end when they are too democratic. And right now, America is a breeding ground for tyranny by Andrew Sullivan. The more democratic a nation becomes, the more vulnerable it is to a demagogue.

How Half Of America Lost Its F**king Mind by David Wong and Trump: Tribune Of Poor White People by Rod Dreher. People who went and spoke to Trump supporters explain why he appeals to them.

It’s NOT the economy, stupid: Brexit as a story of personal values by Eric Kaufmann. How personality affected voting patterns in the British referendum.


A crisis of politics, not economics: complexity, ignorance, and policy failure by Jeffrey Friedman. The libertarian explanation for the financial crash of ’08.

Capitalist Fools by Joseph Stiglitz. The more typical explanation for the ’08 crash.


P-Curve: A Key to the File-Drawer by Uri Simonsohn, Leif D. Nelson, and Joseph P. Simmons. A useful tool to test for publication bias in a series of results.

Statistical points and pitfalls by Jimmie Leppink, Patricia O’Sullivan, and Kal Winston. A series of publications on common statistical errors. Read them so you can avoid these mistakes.

Improving your statistical inferences by Daniel Lakens. I’m kind of cheating with this one (well, totally cheating) but this is the best resource I’ve used to develop my understanding of statistical tests and results.

The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant by Andrew Gelman and Hal Stern. A relatively simple statistical concept that isn’t as well known as it should be.

Small Telescopes by Uri Simonsohn. A helpful way to interpret studies and design suitably powered replications.


Book Review: Albion’s Seed by Scott Alexander. Review of a book that (kind of) explains the differences in American geopolitics by looking at the different groups of people who settled in America.

There is no language instinct by Vyvyan Evans. A dismantling of the pervasive idea that humans are born with an innate ability to interpret language.

The Failed Promise of Legal Pot by Tom James. How and why decriminalisation of marijuana can fail, as well as the way you need to approach legalisation in order for it to succeed.

Clean eating and dirty burgers: how food became a matter of morals by Julian Baggini. How we moralise food and the negative consequences of this.

The truth about the gender wage gap by Sarah Kliff. The best explanation of why there is a gap in pay between the genders that I’ve read.

Gamble for science!

Do you think you know which studies will replicate? Do you want to help improve the replicability of science? Do you want to make some money? Then take part in this study on predicting replications!1

But why are researchers gambling on whether a study will successfully replicate (defined as finding “an effect in the same direction as the original study and a p-value<0.05 in a two-sided test” for this study)? Because there is some evidence to suggest that a prediction market can be a good predictor of replicability, even better than individual psychology researchers.

What is a prediction market?

A prediction market is a forum where people can bet on whether something will happen or not. They do this by buying or selling stocks for it. In this study, if you think a study will replicate you can buy stocks for that study. This will then raise the value of the stock because it is in demand. Because the transactions are public, this shows other buyers that you (though you remain anonymous) think this study will replicate. They might agree with you and also buy, further raising the value of the stocks. However, if you don’t think a study will replicate you won’t buy stocks in it (or you will sell the stocks you have) and the price will go down. If you’re feeling confident, you can short sell a stock. This is where you borrow stocks of something you think will devalue (a study that you don’t think will replicate and you believe people will agree with you in the future), sell them to someone else, then buy back the stocks after their value has dropped to return to the person you initially borrowed from, keeping the margin.

This allows the price to be interpreted as the “predicted probability of the outcome occurring”. For each study, participants could see the current market prediction for the probability of successful replication.

A brief history of prediction markets for science:

Placing bets on whether a study will replicate is a relatively new idea; the idea was first tested with a sample of the studies from the Reproducibility Project: Psychology (2015) by Dreber et al. (2015)2. They took 44 of the studies in the RP:P and, prior to the replications being conducted, asked participants (researchers who took part in the RP:P, though they weren’t allowed to bet on studies they took part in) to rate how likely they thought the studies were to replicate. They created a market where participants could trade stocks on how likely they thought a study would replicate. They then compared how accurate the individual researchers and the market was for predicting which studies would replicate. The market significantly outperformed the researchers prior predictions as to how likely a replication was to occur, with the researcher’s prediction results not being suitably unlikely assuming the null hypothesis was true (and when weighted for self-rated expertise, performed exactly as good as flipping a coin).

This result was not replicated in a sample of 18 economic’s papers by Camerer et al. (2016a) who found that both methods (market or prior prediction) did not have a significant difference between their predicted replication rate and the actual replication rate. However they noted the smaller sample size in their study which may have contributed to this null finding (as well as other factors).

What the current study will do:

Camerer (2016b) will take 21 studies and attempt to replicate them. Before the results are published, a sample of participants will trade stocks on the chance of these studies replicating. In order to calculate the number of participants (n) the researchers would need to conduct a worthwhile replication, the researchers looked at the p-value, n, and standardised effect size (r) of the original. They then calculated how many participants they would need to have a “90% power to detect 75% of the original, standardised effect size”. This means that if the original r was 0.525, with a n of 51, and a p-value of 0.000054, you would need 65 participants to have a 90% chance of detecting 75% of the original effect. If the first replication isn’t successful, the researchers will conduct another replication with a larger sample size (pooled from the original replication sample and a second sample) that will have 90% power to detect 50% of the original effect.

But there’s one problem; you shouldn’t use the effect sizes from previous results to determine your sample size.

Using the past to inform the future:

Morey & Lakens (2016) argue that most psychological studies are statistically unfalsifiable because the statistical power (usually defined as the probability of correctly rejecting a false null hypothesis) is often too low. This is mainly due to small sample sizes, which is a ubiquitous problem in psychological research (though as Sam Schwarzkopf notes here, measurement error can add to this problem). This threatens the evidentiary value of a study as it might mean it wasn’t suitably powered to detect a real effect. Coupled with publication bias (Kühberger, 2014) and the fact that low sample size studies are the most susceptible to p-hacking (Head et al., 2015), this makes judging the evidential weight of the initial finding difficult.

So any replication is going to have to take into account the weaknesses of the original studies. This is why basing the required n on previous effect sizes is flawed, because they are likely to be over-inflated. Morey and Lakens give the example of a fire alarm: “powering a study based on a previous result is like buying a smoke alarm sensitive enough to detect only the raging fires that make the evening news. It is likely, however, that such an alarm would fail to wake you if there were a moderately-sized fire in your home.” Because of publication bias, the researchers will expect there to be a larger effect size than may actually exist. This may mean they don’t get enough participants to discover an effect.

But these researchers have taken steps to try to avoid this problem. They have powered their studies to detect 50% of the original effect size (if they are unsuccessful with their first attempt at 75%). Why 50%? Because the average effect size found in the RP:P was 50% of the original. So they are prepared to discover an effect size that is half as large as was reported in the initial study. This of course leaves the possibility that they miss a true effect because it is less than half of the originally reported effect (and given the average effect size was 50% we know there will very likely be smaller effects). But they have considered the problem of publication bias and attempted to mitigate it. They have also eliminated the threat of p-hacking as these are Registered Reports which have their protocol preregistered so there can be almost no chance of questionable research practices (QRP’s) occurring.

But is their attempt to eliminate the problem of publication bias enough? Are there better methods for interpreting replication results than theirs?

There is another way:

There have been a variety of approaches to interpreting a replication attempt published in the literature. I am going to focus on two: the ‘Small Telescopes’ approach (Simonsohn, 2015) and the Bayesian approach (Verhagen & Wagenmakers, 2014). The ‘Small Telescopes’ approach focuses on whether an initial study was suitably powered to be able to detect the effect. If the replication with the larger n finds a statistically significantly smaller effect size than the original e.g. an effect that would give 33% power to the first study3 (if you were to take the replication effect size and put into the power calculation of the original), then this suggests the initial study didn’t have a big enough sample size to detect the effect. In this paper he also recommends using 2.5x the number of participants of the initial study. This ensures 80% power regardless of publication bias, the power of the original studies, and study design. Looking at the replications in the Social Sciences Replication Project, 18 of the 21 studies have at least 2.5x the number of participants of the first study by the second phase of data collection, leaving just 3 that don’t (studies 2, 6, and 11).

The Bayesian test to quantify whether a replication was successful or not compares two hypotheses: the effect is spurious and isn’t greater than 0 (null hypothesis); and the effect is consistent with what was found in the initial study (the posterior distribution of the replication is similar to the posterior distribution of the original). The larger the effect size in the first study, the larger we expect the effect size of the replication to be and the further from zero we expect it to be. We can therefore quantify how much the replication data supports the null hypothesis or the alternative hypothesis (that the replication will find a similar effect to the original study).

Simonsohn argues that the use of the somewhat arbitrary 90% power to detect 75% or 50% of the original effect is inferior when compared to the ‘Small Telescopes’ or the Bayesian approach (personal communication, October 24, 2016). This is because the other two approaches have thoroughly analysed what the results mean and how often you get what results under what underlying effect. The same has not been done for the method used in this study so we can’t be as confident in the results as we can for the other two methods.


I would recommend getting involved in this study (if you can) and if you can’t, look for the results when they are published as they will be very interesting. But try to avoid using previous effect sizes to calculate future n’s because of the problem of publication bias. You also need to consider how you interpret replication results and the best means of doing this (does it have a solid theoretical justification, a clear meaning, and have the statistical properties been thoroughly examined?).

Author feedback:

I contacted all the researchers from Camerer (2016b) prior to publication to see if they had any comments. Anna Dreber had no criticisms of the post and Colin Camerer expressed familiarity with the issues raised by Morey & Lakens. Magnus Johannesson made the point that they used a 50% chance of finding the original effect size precisely because of publication bias and the fact the average effect size in the RP:P was 50%. I updated my post to reflect this. I contacted Richard Morey to clarify about calculating the required n for a study using previous effect sizes. I contacted Uri Simonsohn about whether the actions of the researchers had overcome the problems he outlined in his ‘Small Telescopes’ paper. He argued they were worse than previously explored methods and I added this information.


Almenberg, J.; Kittlitz, K.; & Pfeiffer, T. (2009). An Experiment on Prediction Markets in Science. PLoS ONE 4(12): e8500. doi:10.1371/journal.pone.0008500

Camerer, C.; Dreber, A.; Ho, T.H.; Holzmeister, F.; Huber, J.; Johannesson, M.; Kirchler, M.; Almenberg, J.; Altmejd, A.; Buttrick, N.; Chan, T.; Forsell, E.; A.; Heikensten, E.; Hummer, L.; Imai, T.; Isaksson, S.;  Nave, G.; Pfeiffer, T.; Razen, M.;& Wu, H. (2016a). Evaluating replicability of laboratory experiments in economics. Science, DOI: 10.1126/science.aaf0918

Camerer, C.; Dreber, A.; Ho, T.H.; Holzmeister, F.; Huber, J.; Johannesson, M.; Kirchler, M.; Nosek, B.; Altmejd, A.; Buttrick, N.; Chan, T.; Chen, Y.; Forsell, E.; Gampa, A.; Heikensten, E.; Hummer, L.; Imai, T.; Isaksson, S.; Manfredi, D.; Nave, G.; Pfeiffer, T.; Rose, J.; & Wu, H. (2016b). Social Sciences Replication Project. [online] Available at: http://www.socialsciencesreplicationproject.com/

Dreber, A.; Pfeiffer, T.; Almenbergd, J.; Isakssona, S.; Wilsone, B.; Chen, Y.; Nosek, B.A.; & Johannesson, M. (2015) Using prediction markets to estimate the reproducibility of scientific research. Proceedings of the National Academy of Sciences, 112 (50), 15343–15347.

Hanson, R. (1995). Could gambling save science? Encouraging an honest consensus. Social Epistemology, 9 (1).

Head, M.L.; Holman, L.; Lanfear, R.; Kahn, A.T.; & Jennions, M.D. (2015). The Extent and Consequences of P-Hacking in Science. PLoS Biol 13(3): e1002106. doi:10.1371/journal.pbio.1002106

Investopedia (2016). Short Selling. [online] Available at: http://www.investopedia.com/terms/s/shortselling.asp. Accessed on: 21/10/2016.

Kühberger, A.; Fritz, A.; & Scherndl, T. (2014). Publication Bias in Psychology: A Diagnosis Based on the Correlation between Effect Size and Sample Size. PLoS ONE 9(9): e105825. doi:10.1371/journal.pone.0105825

Open Science Collaboration (2015). Estimating the reproducibility of psychological
science. Science 349 (6251):aac4716.

Open Science Foundation (2016). Registered Reports. [online] Available at: https://osf.io/8mpji/wiki/home/

Simonsohn, U. (2015). Small Telescopes: Detectability and the Evaluation of Replication Results. Psychological Science, 26 (5), 559–569

Morey, R. & Lakens, D. (2016). Why most of psychology is statistically unfalsifiable. [online] Available at: https://github.com/richarddmorey/psychology_resolution/blob/master/paper/response.pdf

Schwarzkopf, S. (2016). Boosting power with better experiments. [online] Available at: https://neuroneurotic.net/2016/09/18/boosting-power-with-better-experiments/

Verhagen, J. & Wagenmakers, E.J. (2014). Bayesian Tests to Quantify the Result of a Replication Attempt. Journal of Experimental Psychology: General, 143 (4), 1457-1475.

Credit where credit is due

There has been a lot of tension in the psychological community recently. Replications are becoming more prevalent and many of them are finding much smaller effects or none at all. This then raises a lot of uncomfortable questions: is the studied effect real? How was it achieved in the first place? Were less than honest methods used (p-hacking etc.)? The original researchers can sometimes feel that these questions go beyond valid criticisms to full-blown attacks on their integrity and/or their abilities as a scientist. This has led to heated exchanges and some choice pejoratives being thrown about by both “sides”1.

This blog post isn’t here to pass judgement on those who are defending the original studies or the replications (and all the associated behaviour and comments). This article is here to celebrate the behaviour of a researcher whose example I think many of us should follow.

Dana Carney was one of the original researchers who investigated a phenomenon called “power posing” (Carney, Cuddy, Yapp; 2010). They supposedly found that “high-power nonverbal displays” affected your hormone levels and feelings of confidence. But a large-scale failed replication and a re-analysis later, it appears there is no effect.

So, as one of the main researchers for this effect, what should you do when faced with this evidence? All the incentive structures currently in place2 would encourage you to hand-wave away these issues: the replication wasn’t conducted properly, there are hidden moderators that were not present in the replication, the replicators were looking to find no effect, etc. But Carney has written an article stating that she does not “believe that “power pose” effects are real.” She goes into further detail as to the problems with the original study (admitting to using “researcher degrees of freedom” to fish for a significant result and to analysing “subjects in chunks” and stopping when a significant result was found).

I find this honesty commendable and wish all researcher’s whose work is shown to be false would be able to admit past mistakes and wrong-doings. Psychology cannot improve as a science unless we update our beliefs in the face of new evidence. As someone who is quite early in their science career, I’ve never had the experience of someone failing to replicate a finding of mine but I imagine it is quite hard to take (for more detail I recommend this post by Daniel Lakens). Admitting that something you have discovered isn’t real, whilst difficult, helps us have a clearer picture of reality. Hopefully this acknowledgement will encourage others to be more honest with their work.

But there’s a reason why few have taken this step. The potential negative consequences can be quite daunting: loss of credibility due to admissions of p-hacking, undermining of key publications (which may have an impact on job and tenure applications), to name a few. I understand and am (slightly) sympathetic as to why it is so rare. This is why I like Tal Yarkoni’s suggestion of an “amnesty paper” where authors could freely admit they have lost confidence in a finding of theirs and why. They could do so without any fear of repercussions and because many others are doing it, it would be less daunting. Until journals are willing to publish these kinds of articles, I would suggest there be  a website/repository created which is dedicated to such articles. This will mean there is a publicly available record of this paper of doubt about a researcher’s finding.  I also think celebrating those who do make the decision to publicly denounce one of their findings is important as it should encourage scientists to see this admission as a sign of strength, not weakness. This will help change the culture of how we interpret failed replications and past findings. This will hopefully then encourage scientists to write these articles expressing doubts about their past work and journals to publish them. I believe psychology will only improve if this behaviour becomes the norm.

Notes: I contacted Dana Carney prior to publication and she had no corrections to make.


Carney, D.R.; Cuddy, A.J.C.; & Yap, A.J. (2010). Power Posing: Brief Nonverbal Displays Affect Neuroendocrine Levels and Risk Tolerance. Psychological Science, 21 (10) 1363–1368.

Lakens, D. (2016). Why scientific criticism sometimes needs to hurt [online] Available at: http://daniellakens.blogspot.co.uk/2016/09/why-scientific-criticism-sometimes.html

Ranehill, E.; Dreber, A.; Johannesson, M.; Leiberg, S.; Sul, S.; & Weber, R.A. (2015). Psychological Science, 1–4.

Simmons, D. & Simonsohn, U. (2015). [37] Power Posing: Reassessing The Evidence Behind The Most Popular TED Talk [online] Available at: http://datacolada.org/37

Notes on Paul Meehl’s “Philosophical Psychology Session” #05

These are the notes I made whilst watching the video recording of Paul Meehl’s philosophy of science lectures. This is the fifth episode (a list of all the videos can he found here). Please note that these posts are not designed to replace or be used instead of the actual videos (I highly recommend you watch them). They are to be read alongside to help you understand what was said. I also do not include everything that he said (just the main/most complex points).

  • Operationism states all misible concepts in scientific theory must be operationally defined in observable predicates BUT that’s incorrect, don’t need all theoretical postulates to map to observable predicates.
  • Don’t need constants to be able to use functions and see if the components are correct. Given the function forms you can know the parameters (ideal case is to derive parameters). Weaker version: I can’t say what a, b, and c are but I know they are transferable or that a tends to be twice as big as b. If theory permits that it’s a risky prediction (could be shown to be wrong). Theories are lexically organised (from higher to lower parts). You don’t ask questions about lower points before answering the higher up ones in a way that makes the theories comparable. If two theories have the same entities arranged in the same structure with the same connections, with the same functions that describe the connections between them, and the parameters are the same: t1 and t2 are empirically the same theory. If we can compare two theories, we can compare our theory (tI) to omniscient Jones’ theory (tOJ) and see verisimilitude of our theory (how much it corresponds with tOJ).
  • People can become wedded to theories or methods. This results in demonising the “enemy” & an unwillingness to give up that theory/method.


  • Lakatosian defence (general model of defending a theory): 1) (t^At^Cp^Ai^Cn) follows deductively that [sideways T, strict turnstile of deducibility] (o1,  [if, then], o2)

AND absent the theory P(o2/[conditional on]o1)bk[background knowledge] is small

– this extension allows you to say you have corroborated the theory by the facts (because without this small prior it’s formally invalid logic). When P is very small, meets Salmon criteria for a damn strange coincidence


t= theory we are interested in

At= theoretical auxiliaries we’ve tied to our initial theory (almost always more than 1)

Cp= ceteris paribus clause (all other things being equal). No systematic other factors (they have been randomised/controlled for) but there will be individual differences.

Ai= instrumental auxiliaries. Theories about some controlling or measuring instruments. You distinguish between At and Ai by which field it’s in (if it’s in the same science then it’s an At)

Cn= conditions, experimenter describes to you what they did, very thorough methodology (often incompletely described).

*If the theory is true and the auxiliaries are true, the ceteris paribus clause is true and the instruments are accurate and you did what you said you did, it follows deductively that if you observe o1 you will observe o2

  • This only works left to right; can never deduce the scientific theory from the facts.
  • Sometimes you can’t assume the main theory to test the auxiliary theories; you are testing both of them. So if it’s corroborated, then you’ve corroborated both.
  • Can be validating a theory and validating a test at the same time. Only works if the conjunction of the two leads to a damn strange coincidence.


  • Strong use of predictions=to refute the theory.
  • Suppose: (o1,-o2). Modus tollens: P>Q, ~Q therefore ~P
  • Lakatosian criticism: Modus tollens only tells us the whole of the left side is false, not which specific part is.
  • To deny: (p x q x r x s is equivalent to p is false or q is false or r is false or s is false.
  • Formal equivalent of ~ on top a conjunction is disjunctions between statements on the left.
  • Short form: the denial of a conjunction is a disjunction of the denial of the conjuncts.
  • So when we falsify the right in the lab, we falsify the left but because its a conjunction it only tells us something on the left is wrong. But we are testing T so we want to specify whether that is false or not.
  • Randomness is essential for Fisherian statistics.
  • In soft psychology, probability that Cp is literally true is incredibly small.
  • If you start distributing confidence levels to the different conjuncts you work towards “robustness”, can see how by how much Cp is false.


  • Often can’t tell (from an experiment) whether a finding is due to what is reported or a confounding variable. Have to consider all potential confound variables and escape from logically invalid 3rd syllogism by exploring all of them.
  • Different methods result in different Cp’s & At’s, something not often considered.
  • Lakatosian defence of theory is only worthwhile if it has something going for it; it has been falsified in a literal sense but has enough verisimilitude that it’s worth sticking with.
  • When examining part of the conjunct, look at Cn first. Can say: “Let’s wait to see if it replicates”.
  • Ai isn’t a great place to start for psychologists.
  • Cp is good (can almost assume it’s false). When we have different types of experiments over different qualitative domains of data, by challenging Cp in one experiment it doesn’t threaten the theories success in other domains.
  • If you challenge At, if that auxiliary plays a role in derivation chain to experiments in other domains and you try to fix up failed experiments by challenging auxiliaries then all derivation chains that worked in past will now be screwed (because you’re undermining one of the links). Cp is more likely to be domain specific (violated in different ways in different settings).
  • Can modify Cp by adding a postulate (as you don’t want to fiddle with At) because you may have changed subjects or environment etc.
  • Progressive movement: Can turn falsifier into a corroborator by adding auxiliaries that allow you to predict new (previously un-thought of) experiments. Not just post-hoc rationalisation of a falsifier (ad-hoc 1). Ad-hoc 2: when you post-hoc rationalise a falsifier by adding auxiliaries & make new predictions but those predictions are then falsified.
  • Honest ad-hocery: ad-hoc rationalisations that give new predictions (which are found to be correct) that are risky and are a damn strange coincidence.


Yonce, J. L., 2016. Philosophical Psychology Seminar (1989) Videos & Audio, [online] (Last updated 05/25/2016) Available at: http://meehl.umn.edu/video [Accessed on: 06/06/2016]

Notes on Paul Meehl’s “Philosophical Psychology Session” #04

These are the notes I made whilst watching the video recording of Paul Meehl’s philosophy of science lectures. This is the fourth episode (a list of all the videos can he found here). Please note that these posts are not designed to replace or be used instead of the actual videos (I highly recommend you watch them). They are to be read alongside to help you understand what was said. I also do not include everything that he said (just the main/most complex points).

  • Saying “it’s highly probable that Caesar crossed the Rubicon” is the same as “it’s true that Caesar crossed the Rubicon” (1st is object language, 2nd is meta).
  • Probability (about evidence): talking about the relation between the evidence and the theory that the evidence is for.
  • Verisimilitude (ontological concept, refers to whether the state of affairs obtains or not in eyes of omniscient Jones) is NOT a statement about the relationship to the evidence (can’t be equated with probability); it’s a statement with relation to the world, whatever is the case (on the correspondence view).
  • Caesar crossed the Rubicon is true if and only if he crossed the Rubicon.
  • Caesar crossed the Rubicon is probable if there is sufficient evidence in the Vatican Library for you to believe that he did. Caesar crossed the Rubicon is probable if there is sufficient evidence for the Centurion sat next to him as he crosses the Rubicon.
  • Difference between the content and the evidence in support of it
  • This distinction completely torpedoed the “meaning is the method of it’s verification”.
  • Verisimilitude is a matter of degrees (just as confirmation).
  • Science theories can differ in how true they are (non-binary).
  • Logically, can argue that a science theory is false if it contains ANY false statements.
  • Falsifying any conjunct in the argument immediately falsifies the conjunction.
  • T(false)=S1(true),S2(true),S3(false). Falsifying any conjunct in a conjunction, falsifies the conjunction.
  • But we have to talk about degrees of truth to get anything done. Unsatisfactory but there have been (unsuccessful) attempts to quantify degrees of truth (true statements-false statements/total statements). Probability is also unsatisfactorily defined (logicians can’t agree if there’s one or two types of probability).
  • In psychology (when using statistics), use the frequency concept/theory:
  • When we are evaluating theories, we may use the other kind.
  • Kinetic theory of gases & gases: explain equations about gases (their volume, temp etc.), you could derive it by principles of mechanics. Theory of heat reduced to non-thermal concepts (concepts of mass, velocity, collision). They did this by imaging a cylinder of gas which contains molecules. These molecules act like billiard balls (with mass, velocity etc.). Degree of heat (temperature) and amount of heat are different things.
  • Scientists like going down the hierarchy of explanations.
  • Kinetic theory doesn’t work under extreme conditions. According to strict Popperian falsification, an example modus tollens so kinetic theory must be rejected. But we don’t do that. To abandon the theory is not the same as falsification.
  • Instrumentalists don’t care about truth (only utility), realists would have to reject it (but could recognise that part of it is false and so won’t totally abandon it).
  • Thinking about how the kinetic theory was false (in idealised Popperian form) allowed researchers to further explore it and it becomes a corroborator as thinking about how it was false tells us how to rewrite the equation and fit this model to the facts much better. You can be far along enough with your theorising to know that the theory is idealised, you use that knowledge to change equation (don’t need theory powerful enough to generate parameters, can be done in psychology).
  • One kind of adjustment is to change the theory to fit the facts (as with above example). Other is to change belief about the particulars (e.g. Planets weren’t behaving as they should, hypothesised that there might be another planet in such and such a place. Point telescope there and voila, Neptune).
  • Primitive statements are more important in some sense than others.
  • We need an idea of the centrality of a postulate (ideas that are crucial to the theory and can’t be dispensed with) and the peripherality of a postulate (those that can be amended and you still agree with that theory). Any way of getting at the theories’ verisimilitude that doesn’t take this into account is unsatisfactory. Why you can’t measure verisimilitude by a nose count/sentence count (gives same weight to central and peripheral postulate).
  • Core and periphery isn’t particularly explicated (and you can’t do it). Attempt: Can I derive the theory language statement (which come pairing this theory language statement with the replicated experiments) from the postulates of the theory. If not, theory is incomplete. You look at the derivation chain from each of the postulates to the facts and see how many derivation chains contain common postulates. If there are postulates that are common to all derivation chains, can be said to be a central postulate.
  • In any theory, we have a set of theoretical postulates. We have a set of mixed postulates (postulates that contain a mixture of observational and theoretical words).
  • Derive from that statements that are all observational. Makes a theory empirical. What makes it empirical is there is at least one sentence that can be ground out by applying laws of logic and mathematics to the theory that does not contain theoretical words but only logic, maths, and observational words. Some words are obviously observational (black) others are not (libido) but some are fuzzy e.g. 237 amps. But we have an ampmeter and a theory about ampmeters and trust the measurements, so it’s observational (but can be disputed). You can link observations together (and therefore observational statements) via theoretical statements.


  • Kinds of theoretical entities: Russell said there was only 1 (events). Meehl’s ontology: substances (chemistry sense e.g. elements), structures (including simples e.g. quarks), events (e.g. the neuron spikes), states (e.g. Jones is depressed, I am thirsty. Difficult to distinguish between events and states, could say events are just states strung out over long time intervals), dispositions (if x then y, -ble e.g. soluble, flammable; they are dispositional predicates), fields (magnetic fields). Can be used to analyse any concept in social sciences. Important kinds of events are when structures or substance undergoes a change in state, then changes dispositions. Power of the magnet to attract is a first order disposition. Iron being able to become magnetic is a second order disposition. Supreme Dispositions are dispositions that an object must have in order for it to be that object. The list helps think about the laws and theories are present in science. Most laws turn out to be compositional, functional dynamic, developmental. Compositional theories state what something is made out of and how it’s arranged. Functional dynamic involve Aristotelian efficient causes; if you do this then this will happen. Changes in state will result in changes in disposition over time. When comparing theories in similarity, list the kinds of entities and compare. How do they connect (compositionally and functionally). If you’ve drawn functional connections or time changes in developmental statements, can ask what’s the sign of the first derivative? You don’t claim to know what the function is; does x go up or down with y. You’ve got a strand in the net connecting entities. What about the sign of the second derivative?
  • Continuous case: [partial]F(x1x2) [over] [partial] x1 is greater than [partial]F(x1x2) [over] [partial] x2 EVERYWHERE. Means x1 is a more potent influence on y than x2 but still didn’t tell you what function is.
    X1 and x2 = two inputs
  • Discontinuous case: when the influence of x2 is greater depending on x1 being small.
  • Allows you to order partial derivatives.
  • Interaction effect: y is the output. (y with a present, y with a absent)when b is present-(y with a present, y with a absent)when b is absent. The difference between them is not zero.
  • The effect of y when b is there is greater than the effect of y when b is not there.
  • Theories can look the same/have the same connections/have same entities, but this theory makes this influence much more powerful but this only appears when you look into the derivatives (you see that there is an interaction).
  • Fisher effects=partial derivatives for continuous case. Interaction=mixed partial derivatives for continuous case.


Yonce, J. L., 2016. Philosophical Psychology Seminar (1989) Videos & Audio, [online] (Last updated 05/25/2016) Available at: http://meehl.umn.edu/video [Accessed on: 06/06/2016]

The replication crisis, context sensitivity, and the Simpson’s (Paradox)

The Reproducibility Project: Psychology:

The Reproducibility Project: Psychology (OSC, 2015) was a huge effort by many different psychologists across the world to try and assess whether the effects of a selection of papers could be replicated. This was in response to the growing concern about the (lack of) reproducibility of many psychological findings with some high profile failed replications being reported (Hagger & Chatzisarantis, 2016 for ego-depletion and Ranehill, Dreber, Johannesson, Leiberg, Sul, & Weber, 2015 for power-posing). They reported that of the 100 replication attempts, only ~35 were successful. This provoked a strong reaction not only in the psychological literature but also in the popular press, with many news outlets reporting on it.

But it wasn’t without its critics: Gilbert, King, Pettigrew, & Wilson (2016) examined the RP:P’s data using confidence intervals and came to a different conclusion. They were looking for “whether the point estimate of the replication effect size [fell] within the confidence interval of the original” (Srivastava, 2016). Some of the authors from the RP:P responded (Anderson et al., 2016) by pointing out some of the errors in the Gilbert et al. paper. Another analysis was provided by Sanjay Srivastava (2016) who highlighted that whilst Gilbert et al. use confidence intervals, they incorrectly define them (calling into question any conclusions they draw). Gilbert et al. (2016) responded by reaffirming that they “violated the basic rules of sampling when [they] selected studies to replicate” and that many of the replications were unfaithful to the original study (which is a valid criticism and is related to the idea of context sensitivity, which is the focus of this post). Etz & Vandekerckhove (2016) reanalysed 72 of the original studies’ data using Bayes’ statistics and found 64% of those 72 studies (both originals and replications) did not provide strong evidence for either the null or the alternative hypothesis. Simonsohn (2016) argued that rather than ~65% of the replications failing, ~30% failed to replicate and ~30% of the replications were inconclusive.1

But there is one response and one explanation for the low replication rate I want to focus on: the context sensitivity of an experiment.

Location, location, location:

Context sensitivity is the idea that where you conduct an experiment has a large impact on it. It is a type of hidden moderator as it is a variable that affects the experiment that usually isn’t being directly manipulated or controlled by the researcher. The environment in which you perform the study plays a role in the result and should be considered when conducting a replication. It is argued that you cannot detach the “experimental manipulations… from the cultural and historical contexts that define their meanings” (Touhey, 1981). The context of an experiment is very important in social psychology and it has been studied for years, with evidence that it does shape people’s behaviour (for one of many examples, you can look at Fiske, Gilbert, Lindzey; 2010).

Van Bavel, Mende-Siedlecki, Brady, & Reinero (2016) argue that context sensitivity partly explains the poor replication rate of the RP:P. They found that the context sensitivity of a study (rated by 3 students with high inter-rater reliability) had a statistically significant negative correlation with the success of the replication attempt (r=−0.23, P = 0.02). This means the more contextually sensitive the finding was, the less likely it was to replicate. It was still significantly associated with replication success after controlling for the sample size of the original study (which has been suggested to have a significant impact on the success of a replication; Vankov, Bowers, Munafò, 2014). It was not the best predictor of a replication though: the statistical power of the replication and how surprising the replication was were the strongest predictors. They also analysed the data to see whether the discipline of psychology the original study was taken from (either social or cognitive psychology) moderated the relationship between contextual sensitivity and replication success. They did not find a significant interaction (this last point is very important but I’m going to examine it in more detail further on).

So this study appears to show that contextual differences had a significant impact on replication rates and that it should be taken into account when considering the results of the RP:P.

There’s no such thing as…

One of the responses to the paper was by Berger (2016). He stated that “context sensitivity” is too vague a concept to be of any scientific use. There are an enormous number of ways that “context” could impact on a finding and to present it as a uni-dimensional construct (as was done in Van Bavel et al, 2016) is illogical. Context sensitivity can therefore be used to justify any unexplained variance in psychological results. He calls for a more rigorous and falsifiable definition of context sensitivity (namely lack of theory specificity and heterogeneity) and for researchers to be specific when it comes to the source of the problems e.g. is it variation in the population, location, time-period, etc. He also argues that researchers should a prior predict the heterogeneity and effect directions so we can scientifically evaluate the effect of these hidden moderators.

The hidden variable:

Another problem with the paper was highlighted by Schimmack (2016) and Inbar (in press). When you run the analyses again and properly control for sub-discipline (rather than test for the interaction as was originally done), the significant result Van Bavel et al. found disappears (from p=0.02 to p=0.51). They also calculated the correlation within groups (so the correlation between context sensitivity and replication success for cognitive psychology studies and for social psychology studies) and again found non-significant results (r = -.04, p = .79 and r = -.08, p = .54 respectively). This suggests context sensitivity only has a significant impact on replication rates when you don’t control for sub-discipline (so some disciplines of psychology are more likely to replicate than others). Van Bavel has replied to this by arguing you can’t control for sub-discipline as it is “part of the construct of interest” (Van Bavel, 2016).

Simpson’s Paradox:

So how does Simpson’s Paradox fit into all this? (Not those Simpsons unfortunately, Edward H. Simpson). Well, this is a perfect example of Simpson’s paradox: where a trend is found when groups are combined but disappears or reverses when they are examined separately. The classic example comes from Bickel, Hammel, & O’Connell (1975). They examined the admission rates for graduate school at the University of California, Berkeley for 1973. They appeared to show a gender bias towards men as 44% were admitted whereas only 35% of women were.

2016-07-12 (2)
But when you examine all of the departments individually they show that 6 of them admitted more women than men (and 4 admitted more men than women). When analysed, this preference for females was shown to be statistically significant. So how does this work? It’s because of a third variable: rate of admission within the department. As stated in the article: “The proportion of women applicants tends to be high in departments that are hard to get into and low in those that are easy to get into.”

2016-07-12 (1)

Table showing the 6 most applied to departments [Wikipedia]

This is exactly the same thing that happened in the Van Bavel (2016) paper: the original significant finding (r=−0.23, P = 0.02) disappeared after you controlled the hidden variable of sub-discipline.

So what does all this mean?

The purpose of this post isn’t to show that context sensitivity doesn’t have an impact on the RP:P (it almost certainly did and it will have an impact on other research). But it does show that the Van Bavel paper doesn’t tell us how much of an impact this variable has on the RP:P and that we need to be more precise in our language. Unless we are explicit in what we mean by “context sensitivity” and predict what effect it will have before the experiment (and in which direction), it will remain post-hoc hand-waving which doesn’t advance science.


Anderson, C.J.; Bahník, S.; Barnett-Cowan, M.; Bosco, F.A.; Chandler, J.; Chartier, C.R.; Cheung, F.; Christopherson, C.D.; Cordes, A.; Cremata, E.J.; Penna, N.D.; Estel, V.; Fedor, A.; Fitneva, S.A.; Frank, M.C.; Grange, J.A.; Hartshorne, J.K.; Hasselman, F.; Henninger, F.; Hulst, M. v.d.; Jonas, K.J. Lai, C.K.; Levitan, C.A.; Miller, J.K.; Moore, K.S.; Meixner, J.M.; Munafò, M.R.; Neijenhuijs, K.I.; Nilsonne, G.; Nosek, B.A.; Plessow, F.; Prenoveau, J.M.; Ricker, A.S.; Schmidt, K.; Spies, J.R.; Stieger, S.; Strohminger, N.; Sullivan, G.B.; van Aert, R.C.M.; van Assen, M.A.L.M.; Vanpaemel, W.; Vianello, M.; Voracek, M.; Zuni, K. Response to Comment on “Estimating the reproducibility of psychological science”. Science, 351 (6277), 1037.

Berger, D. (2016). There’s no Such Thing as “Context Sensitivity”. [online] Available at: https://www.dropbox.com/home/Public?preview=Berger–no-context.pdf

Bickel, P.J.; Hammel, E.A.; & O’Connell, J.W. (1975). Sex bias in graduate admissions: Data
from Berkeley. Science 187 (4175), 398-404.

Etz, A. & Vandekerckhove J. (2016) A Bayesian Perspective on the Reproducibility Project: Psychology. PLoS ONE, 11(2): e0149794. doi:10.1371/journal.pone.0149794

Fiske, S.T.; Gilbert, D.T.; & Lindzey, G. (2010). Handbook of Social Psychology (John Wiley & Sons, Hoboken, NJ).

Gilbert, D.T; King, G.; Pettigrew, S.; & Wilson, T.D. (2016). Comment on “Estimating the reproducibility of psychological science”. Science, 351 (6277), 1037.

Gilbert, D.T; King, G.; Pettigrew, S.; & Wilson, T.D. (2016). More on “Estimating the Reproducibility of Psychological Science”. [online] Available at: http://projects.iq.harvard.edu/files/psychology-replications/files/gkpw_post_publication_response.pdf?m=1457373897

Hagger, M.S.; Chatzisarantis, N.L.D.; H.J.E.M., Alberts; & Zwienenberg, M. (2016). A multi-lab pre-registered replication of the ego-depletion effect. Perspectives on Psychological Science. [online] Available at: http://www.psychologicalscience.org/redesign/wp-content/uploads/2016/03/Sripada_Ego_RRR_Hagger_FINAL_MANUSCRIPT_Mar19_2016-002.pdf

Inbar, Yoel. (in press). The association between “contextual dependence” and replicability in psychology may be spurious. Proceedings of the National Academy of Sciences of the United States of America.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349 (6251): aac4716. doi: 10.1126/science.aac4716. pmid:26315443

Ranehill, E.; Dreber, A.; Johannesson, M.; Leiberg, S.; Sul, S.; & Weber, R.A. (2015). Assessing the robustness of power posing: no effect on hormones and risk tolerance in a large sample of men and women. Psychological Science26 (5), 653-6.

Sample, I. (2015). Study delivers bleak verdict on validity of psychology experiment results. Guardian. [online] Available at: https://www.theguardian.com/science/2015/aug/27/study-delivers-bleak-verdict-on-validity-of-psychology-experiment-results

Simonsohn, U. (2016). Evaluating Replications: 40% Full ≠ 60% Empty. [online] Available at: http://datacolada.org/47

Schimmack, U. (2016). [online] Available at: https://www.facebook.com/photo.php?fbid=10153697776061687&set=gm.1022514931136210&type=3&theater

Tim. Mere Anachrony: The Simpsons Season One [online] Available at: https://npinopunintended.wordpress.com/2009/11/14/mere-anachrony-the-simpsons-season-one/

Touhey JC (1981) Replication failures in personality and social psychology negative findings or mistaken assumptions? Personality and Social Psychological Bulletin 7(4):593–595.

Van Bavel, J. (2016). [online] Available at: https://twitter.com/jayvanbavel/status/737744646311399424

Van Bavel, J.; Mende-Siedleckia, P.; Bradya, W.J.; & Reinero, D.A. (2016). Contextual sensitivity in scientific reproducibility. Proceedings of the National Academy of Sciences of the United States of America, 113 (23), 6454-6459.

Vankov, I.; Bowers, J.; & Munafò, M.R. (2014). On the persistence of low power in psychological science. The Quarterly Journal of Experimental Psychology, 67 (5), 1037-40.

Wikipedia. (2016). Simpson’s paradox. Available at: https://en.wikipedia.org/wiki/Simpson%27s_paradox#cite_note-freedman-10 [Accessed: 12/07/2016]

Yong, E. (2015). How Reliable Are Psychology Studies? The Atlantic. [online] Available at: http://www.theatlantic.com/science/archive/2015/08/psychology-studies-reliability-reproducability-nosek/402466/

Notes on Paul Meehl’s “Philosophical Psychology Session” #03

These are the notes I made whilst watching the video recording of Paul Meehl’s philosophy of science lectures. This is the third episode (a list of all the videos can he found here). Please note that these posts are not designed to replace or be used instead of the actual videos (I highly recommend you watch them). They are to be read alongside to help you understand what was said. I also do not include everything that he said (just the main/most complex points).

  • Descriptive discourse: what is.
  • Prescriptive discourse: what should be.
  • Science is descriptive, ethics/law etc. Is prescriptive. Philosophy of science (metatheory) is a mixture of both and has to be so in order to properly work (which the logical positivists didn’t realise).
  • External history of science (domain of the non-rational). What effects politics, economics etc. had on a theory.
  • Internal history of science (domain of the rational). Whether a fact had been over-stated or how the theory interacted with other facts and theories.
  • Context of discovery: psychological and social q’s about discoverer. E.g. Discovery of benzene ring. The fact he “dreamed of the snakes” is irrelevant to the truth of the story (the justification).
  • Context of justification: evidence, data, statistics.
  • Some say there shouldn’t be a distinction, BUT: Just because there is twilight, doesn’t mean that night and noon are not meaningful distinctions.
  • There are grey areas e.g. A finding that we are hesitant to bring into the corpus.
  • Sometimes we have to take into account the researcher who has produced this finding e.g. Dayton-Miller and aether.
  • Unknown/unthought of moderators can have a significant impact. Don’t have to be a fraud to not include that in manuscript
  • Fraud is worse than an honest mistake because it can obfuscate and mislead as you have something in front of you. You need enough failed replications to say “my theory no longer needs to explain this”. But this is why taking into account context of discovery is important (even when in context of justification); how close to a person’s heart/passion/wallet is this result? These things won’t be obvious in the manuscript but can have an impact.
  • 4 examples of context impacting research:
    How strongly does someone feel about this result? How much is their wallet being bolstered by this finding?
    Literature reviews also need to have the context of discovery considered. Reviewer may not be a fraud, may be sloppy, original paper may be poorly written. Meta-analysis counter-acts some of these flaws with some counterbalancing taking place that’s hard to do in your head. Meehl 1954 (psychologist is no better at weighing up beta-weights than the clinician). Can be abused.
    File-drawer effect BUT also what kind of research is being funded because it’s popular/faddish? University gets in habit of having large pot of money from government to fund research. Doing research to get grants can mean a narrowing of research but also some research can be shelved by not being funded because it could turn up unwanted/uncomfortable results.- Politics of discovery.
    When reading a paper, you don’t know how much politics/economics has influenced it/caused it to be researched in the first place and stopped other (potentially contradicting) research being conducted. Affects distributions of investigations. If a certain theory is favoured by the use of questionnaires rather than lab experiments and the former is used due to convenience, skewed picture.
    Relying on clinical experience rather than data, their clinical judgements made during observation are highly influenced by their own personal theory (experimenter effects).
    Power function is low, null result doesn’t tell you as much as a positive result.
    Context of discovery is also impacted by context of justification e.g. Knowing logic means you are likely to avoid making a logical fallacy when examining research. Not all impacts will be negative.


  • Scientific realist: there is a world out there that has objective qualities and it is the job of science to work them out.
  • Instrumentalism: the truth of something doesn’t matter if it has utility.
  • But fictions can be useful.
  • B.F. Skinner believed that when we could test mental processes and not just infer them, then it would become apparent which processes map on to which area.
  • 3 main theories of truth: correspondence theory of truth (view of scientific realist, that the truth of a statement is determined by how accurately it corresponds with the real state of affairs), coherence theory (truth consists of the propositions you have hanging together), and instrumental theory (fictionist, truth is what succeeds in predicting or manipulating successfully).
  • Scientific realists admit that instrumental efficacy bears on their truth. Part of the data.
  • Incoherent theory is false by definition, coherent theory can be false.
  • Caesar crossed the Rubicon (for correspondence): only 1 fact needed to verify; whether he crossed or not. Quine corners denote the subject of the sentence e.g. ‘Caesar crossed the Rubicon’ (first half of the sentence is meta-language) is true if and only if Caesar crossed the Rubicon (no quine corners + in the object language).
  • What grounds do we have for believing (epistemological)? *verisimilitude* What are the conditions for that belief to be correct (ontological)? Equivalent in their content, so if one is true then the other is true/if one is false then the other is false.
  • Semantic concept of truth.
  • Knowledge is JUSTIFIED true belief (so stumbling on to a truth by chance is not knowledge).
  • Truth is a predicate of sentences and not things.
  • Argument among logical positivists that they should remove the use of the word truth for empirical sciences as you can never be totally certain that what you’ve said is true (remove from meta-language). Only those predicates which we can be certain are accurate are permissible BUT that means you remove pretty much every word in language (all scientific language and most concrete language)
  • Verisimilitude (similarity to truth) is an ontological rather than epistemological/evidentiary concept (cannot be conflated to probability).
  • Scientific theories are collections of sentences and as such can have degrees of truth.


Yonce, J. L., 2016. Philosophical Psychology Seminar (1989) Videos & Audio, [online] (Last updated 05/25/2016) Available at: http://meehl.umn.edu/video [Accessed on: 06/06/2016]