The replication crisis, context sensitivity, and the Simpson’s (Paradox)

The Reproducibility Project: Psychology:

The Reproducibility Project: Psychology (OSC, 2015) was a huge effort by many different psychologists across the world to try and assess whether the effects of a selection of papers could be replicated. This was in response to the growing concern about the (lack of) reproducibility of many psychological findings with some high profile failed replications being reported (Hagger & Chatzisarantis, 2016 for ego-depletion and Ranehill, Dreber, Johannesson, Leiberg, Sul, & Weber, 2015 for power-posing). They reported that of the 100 replication attempts, only ~35 were successful. This provoked a strong reaction not only in the psychological literature but also in the popular press, with many news outlets reporting on it.

But it wasn’t without its critics: Gilbert, King, Pettigrew, & Wilson (2016) examined the RP:P’s data using confidence intervals and came to a different conclusion. They were looking for “whether the point estimate of the replication effect size [fell] within the confidence interval of the original” (Srivastava, 2016). Some of the authors from the RP:P responded (Anderson et al., 2016) by pointing out some of the errors in the Gilbert et al. paper. Another analysis was provided by Sanjay Srivastava (2016) who highlighted that whilst Gilbert et al. use confidence intervals, they incorrectly define them (calling into question any conclusions they draw). Gilbert et al. (2016) responded by reaffirming that they “violated the basic rules of sampling when [they] selected studies to replicate” and that many of the replications were unfaithful to the original study (which is a valid criticism and is related to the idea of context sensitivity, which is the focus of this post). Etz & Vandekerckhove (2016) reanalysed 72 of the original studies’ data using Bayes’ statistics and found 64% of those 72 studies (both originals and replications) did not provide strong evidence for either the null or the alternative hypothesis. Simonsohn (2016) argued that rather than ~65% of the replications failing, ~30% failed to replicate and ~30% of the replications were inconclusive.1

But there is one response and one explanation for the low replication rate I want to focus on: the context sensitivity of an experiment.

Location, location, location:

Context sensitivity is the idea that where you conduct an experiment has a large impact on it. It is a type of hidden moderator as it is a variable that affects the experiment that usually isn’t being directly manipulated or controlled by the researcher. The environment in which you perform the study plays a role in the result and should be considered when conducting a replication. It is argued that you cannot detach the “experimental manipulations… from the cultural and historical contexts that define their meanings” (Touhey, 1981). The context of an experiment is very important in social psychology and it has been studied for years, with evidence that it does shape people’s behaviour (for one of many examples, you can look at Fiske, Gilbert, Lindzey; 2010).

Van Bavel, Mende-Siedlecki, Brady, & Reinero (2016) argue that context sensitivity partly explains the poor replication rate of the RP:P. They found that the context sensitivity of a study (rated by 3 students with high inter-rater reliability) had a statistically significant negative correlation with the success of the replication attempt (r=−0.23, P = 0.02). This means the more contextually sensitive the finding was, the less likely it was to replicate. It was still significantly associated with replication success after controlling for the sample size of the original study (which has been suggested to have a significant impact on the success of a replication; Vankov, Bowers, Munafò, 2014). It was not the best predictor of a replication though: the statistical power of the replication and how surprising the replication was were the strongest predictors. They also analysed the data to see whether the discipline of psychology the original study was taken from (either social or cognitive psychology) moderated the relationship between contextual sensitivity and replication success. They did not find a significant interaction (this last point is very important but I’m going to examine it in more detail further on).

So this study appears to show that contextual differences had a significant impact on replication rates and that it should be taken into account when considering the results of the RP:P.

There’s no such thing as…

One of the responses to the paper was by Berger (2016). He stated that “context sensitivity” is too vague a concept to be of any scientific use. There are an enormous number of ways that “context” could impact on a finding and to present it as a uni-dimensional construct (as was done in Van Bavel et al, 2016) is illogical. Context sensitivity can therefore be used to justify any unexplained variance in psychological results. He calls for a more rigorous and falsifiable definition of context sensitivity (namely lack of theory specificity and heterogeneity) and for researchers to be specific when it comes to the source of the problems e.g. is it variation in the population, location, time-period, etc. He also argues that researchers should a prior predict the heterogeneity and effect directions so we can scientifically evaluate the effect of these hidden moderators.

The hidden variable:

Another problem with the paper was highlighted by Schimmack (2016) and Inbar (in press). When you run the analyses again and properly control for sub-discipline (rather than test for the interaction as was originally done), the significant result Van Bavel et al. found disappears (from p=0.02 to p=0.51). They also calculated the correlation within groups (so the correlation between context sensitivity and replication success for cognitive psychology studies and for social psychology studies) and again found non-significant results (r = -.04, p = .79 and r = -.08, p = .54 respectively). This suggests context sensitivity only has a significant impact on replication rates when you don’t control for sub-discipline (so some disciplines of psychology are more likely to replicate than others). Van Bavel has replied to this by arguing you can’t control for sub-discipline as it is “part of the construct of interest” (Van Bavel, 2016).

Simpson’s Paradox:

So how does Simpson’s Paradox fit into all this? (Not those Simpsons unfortunately, Edward H. Simpson). Well, this is a perfect example of Simpson’s paradox: where a trend is found when groups are combined but disappears or reverses when they are examined separately. The classic example comes from Bickel, Hammel, & O’Connell (1975). They examined the admission rates for graduate school at the University of California, Berkeley for 1973. They appeared to show a gender bias towards men as 44% were admitted whereas only 35% of women were.

2016-07-12 (2)
But when you examine all of the departments individually they show that 6 of them admitted more women than men (and 4 admitted more men than women). When analysed, this preference for females was shown to be statistically significant. So how does this work? It’s because of a third variable: rate of admission within the department. As stated in the article: “The proportion of women applicants tends to be high in departments that are hard to get into and low in those that are easy to get into.”

2016-07-12 (1)

Table showing the 6 most applied to departments

This is exactly the same thing that happened in the Van Bavel (2016) paper: the original significant finding (r=−0.23, P = 0.02) disappeared after you controlled the hidden variable of sub-discipline.

So what does all this mean?

The purpose of this post isn’t to show that context sensitivity doesn’t have an impact on the RP:P (it almost certainly did and it will have an impact on other research). But it does show that the Van Bavel paper doesn’t tell us how much of an impact this variable has on the RP:P and that we need to be more precise in our language. Unless we are explicit in what we mean by “context sensitivity” and predict what effect it will have before the experiment (and in which direction), it will remain post-hoc hand-waving which doesn’t advance science.


Notes on Paul Meehl’s “Philosophical Psychology Session” #03

These are the notes I made whilst watching the video recording of Paul Meehl’s philosophy of science lectures. This is the third episode (a list of all the videos can he found here). Please note that these posts are not designed to replace or be used instead of the actual videos (I highly recommend you watch them). They are to be read alongside to help you understand what was said. I also do not include everything that he said (just the main/most complex points).

  • Descriptive discourse: what is.
  • Prescriptive discourse: what should be.
  • Science is descriptive, ethics/law etc. Is prescriptive. Philosophy of science (metatheory) is a mixture of both and has to be so in order to properly work (which the logical positivists didn’t realise).
  • External history of science (domain of the non-rational). What effects politics, economics etc. had on a theory.
  • Internal history of science (domain of the rational). Whether a fact had been over-stated or how the theory interacted with other facts and theories.
  • Context of discovery: psychological and social q’s about discoverer. E.g. Discovery of benzene ring. The fact he “dreamed of the snakes” is irrelevant to the truth of the story (the justification).
  • Context of justification: evidence, data, statistics.
  • Some say there shouldn’t be a distinction, BUT: Just because there is twilight, doesn’t mean that night and noon are not meaningful distinctions.
  • There are grey areas e.g. A finding that we are hesitant to bring into the corpus.
  • Sometimes we have to take into account the researcher who has produced this finding e.g. Dayton-Miller and aether.
  • Unknown/unthought of moderators can have a significant impact. Don’t have to be a fraud to not include that in manuscript
  • Fraud is worse than an honest mistake because it can obfuscate and mislead as you have something in front of you. You need enough failed replications to say “my theory no longer needs to explain this”. But this is why taking into account context of discovery is important (even when in context of justification); how close to a person’s heart/passion/wallet is this result? These things won’t be obvious in the manuscript but can have an impact.
  • 4 examples of context impacting research:
    How strongly does someone feel about this result? How much is their wallet being bolstered by this finding?
    Literature reviews also need to have the context of discovery considered. Reviewer may not be a fraud, may be sloppy, original paper may be poorly written. Meta-analysis counter-acts some of these flaws with some counterbalancing taking place that’s hard to do in your head. Meehl 1954 (psychologist is no better at weighing up beta-weights than the clinician). Can be abused.
    File-drawer effect BUT also what kind of research is being funded because it’s popular/faddish? University gets in habit of having large pot of money from government to fund research. Doing research to get grants can mean a narrowing of research but also some research can be shelved by not being funded because it could turn up unwanted/uncomfortable results.- Politics of discovery.
    When reading a paper, you don’t know how much politics/economics has influenced it/caused it to be researched in the first place and stopped other (potentially contradicting) research being conducted. Affects distributions of investigations. If a certain theory is favoured by the use of questionnaires rather than lab experiments and the former is used due to convenience, skewed picture.
    Relying on clinical experience rather than data, their clinical judgements made during observation are highly influenced by their own personal theory (experimenter effects).
    Power function is low, null result doesn’t tell you as much as a positive result.
    Context of discovery is also impacted by context of justification e.g. Knowing logic means you are likely to avoid making a logical fallacy when examining research. Not all impacts will be negative.


  • Scientific realist: there is a world out there that has objective qualities and it is the job of science to work them out.
  • Instrumentalism: the truth of something doesn’t matter if it has utility.
  • But fictions can be useful.
  • B.F. Skinner believed that when we could test mental processes and not just infer them, then it would become apparent which processes map on to which area.
  • 3 main theories of truth: correspondence theory of truth (view of scientific realist, that the truth of a statement is determined by how accurately it corresponds with the real state of affairs), coherence theory (truth consists of the propositions you have hanging together), and instrumental theory (fictionist, truth is what succeeds in predicting or manipulating successfully).
  • Scientific realists admit that instrumental efficacy bears on their truth. Part of the data.
  • Incoherent theory is false by definition, coherent theory can be false.
  • Caesar crossed the Rubicon (for correspondence): only 1 fact needed to verify; whether he crossed or not. Quine corners denote the subject of the sentence e.g. ‘Caesar crossed the Rubicon’ (first half of the sentence is meta-language) is true if and only if Caesar crossed the Rubicon (no quine corners + in the object language).
  • What grounds do we have for believing (epistemological)? *verisimilitude* What are the conditions for that belief to be correct (ontological)? Equivalent in their content, so if one is true then the other is true/if one is false then the other is false.
  • Semantic concept of truth.
  • Knowledge is JUSTIFIED true belief (so stumbling on to a truth by chance is not knowledge).
  • Truth is a predicate of sentences and not things.
  • Argument among logical positivists that they should remove the use of the word truth for empirical sciences as you can never be totally certain that what you’ve said is true (remove from meta-language). Only those predicates which we can be certain are accurate are permissible BUT that means you remove pretty much every word in language (all scientific language and most concrete language)
  • Verisimilitude (similarity to truth) is an ontological rather than epistemological/evidentiary concept (cannot be conflated to probability).
  • Scientific theories are collections of sentences and as such can have degrees of truth.


