Reliability in psychological science: methodology in crisis?

Theory of Knowledge banner

151116 reprodproj“Scientific truth is a moving target,” wrote the editors of the Public Library of Science (PLoS) a decade ago. “But is it inevitable, as John Ioannidis argues…that the majority of findings are actually false?” In the decade since the editors posed this question, the psychological sciences have been shaken by further challenges to their credibility, including some widely reported controversies. It was August of this year, however, that brought the most significant shock waves, when the Reproducibility Project of the Open Science Collaboration announced its conclusions – that most of the articles published in leading psychological journals were unreliable. Most! This crisis in knowledge – in both its nature and its interpretations — is acutely relevant to us as teachers of Theory of Knowledge, aiming as we do to treat the human sciences with contemporary understanding.

What’s the relevance for TOK?

So, first, what’s the problem? A quick refresher here – just to put it all in TOK context!  It comes down to the reliability of the methodology of an area of knowledge, on which the justification for its knowledge claims rests.  If the methods of gaining knowledge are faulty in the psychological sciences, why should we trust any of the knowledge claims of its results?  There are good reasons for methodology to loom large in our TOK knowledge framework!

Central to the methodology of the sciences is reproducibility, the idea that single experiments have to be able to be reproduced before their results are treated as trustworthy:

  1. A research group should test its hypotheses by repeated experiments to weed out errors (“falsifying” and discarding hypotheses that are wrong) and gather a significant degree of confirmation (with statistically measured significance) before it treats any results as ready to submit for publication as shared knowledge; and
  2. Any other scientific group, within the public and collaborative process of science, should be able to replicate the experiment and reach the same results. If not, the procedure, the measurement, and the interpretation demand further scrutiny – and the results stand to be discarded.

This is how science is supposed to work: the knowledge claims from experiments are always open to further testing and alternative interpretations, always open to change if contrary data is found. But what if the findings in articles published in leading journals, after peer review, fail replication the majority of the time (#2 above)?

What is the Reproducibility Project, and what did it find?

The Reproducibility Project: Psychology, coordinated by the Center for Open Science, was an attempt to estimate the reproducibility of published results – that is, to find out just how reliable, on testing, were the shared findings of psychological experiments. The project was launched in November 2011 and published its conclusions in August 2015. In repeating the experiments of 100 studies published in highly ranked journals, a collaboration of 270 scientists following a rigorous procedure found that only 36% of the replications gave the statistically significant results that the original researchers had published. This Reproducibility Project was the “first in-depth explorations of its kind”. (Open Science Collaboration)

Project leader Brian Nosek believes “that other scientific fields are likely to have much in common with psychology.”  A reproducibility project in biology, in the study of cancer, is currently underway.

I recommend the following two articles for a good explanation of the project:

What does the failure rate mean for knowledge in the field?

The high failure rate could be even higher, some have argued, given that only highly respected journals were used as sources for the articles to be tested. After all, prestigious journals are believed to attract stronger work. But, even accepted as 36%, what does this failure rate mean for knowledge in psychology?

First, what does it NOT mean? For one thing, it doesn’t mean that the original articles were wrong: the issue tested in this research was not the truth or falsity of the knowledge claims but the methodology itself of reaching them. The point is that re-running the experiment achieved different results, regardless of whether the original turns out to be wrong, or the replication, or both. The essential problem is that published results have been presumed to be extensively tested and reliable – and it turns out they’re not.

Nevertheless, there could be explanations for some of the replication failure. Possibly, there were subtle variations in methodology by the replicating scientists — raising the need for further testing. Moreover, although social psychology fared worse than cognitive psychology, there could also be a possible explanation, since the social interactions studied may themselves have changed over time, so that the replication was not studying the same thing.

Indeed, the problem may be exaggerated: it may be clear that a replication rate of 36% is a problem, but it is much less clear what an acceptable replication rate would be, given that journals are publishing the most recent findings in a field.

In addition, it could be argued that the degree to which findings can be replicated is not the best measure of reliability in any case. Scientists Stroebe and Hewstone are critical of the Reproducibility Project – critical, that is, of the methodology of the critique of methodology – and comment that meta-analysis is more reliable than replication as a way of evaluating research:

“Reporting the percentage of successful replications is not very informative. More usefully, the project could have identified aspects of studies that predicted replication failure. But here the report disappoints. Since meta-analysis permits us to evaluate the validity of research without the need to collect new data, one can question whether the meagre results of this project justify the time investment of 270 researchers and thousands of undergraduate research participants.”

Still, however one mitigates the problem or proposes better ways of approaching it, it does not go away: the psychological sciences have considered their published results to be reproducible, and it seems that for the most part they are not.

Does the failure lie with scientists publishing their findings without testing them sufficiently? Does it lie with the peer review process of journals, where some articles are accepted for publication and some rejected? Does it lie with an understanding of reproducibility? Or is psychology – as suspected by some in the “harder” sciences – not scientific at all?

Publication biases: How is the construction of  knowledge in psychology affected by its context?

And that brings us to what may be the most important revelation following from the Reproducibility Project: the impact on scientific results of the social context of research, especially in the importance of publication.

Publication bias is heavily toward scientists publishing only findings that lead to positive results by confirming a hypothesis, not to negative results that lead refute it and, for the immediate present, lead nowhere. Even though science works by systematically eliminating errors, who wants to publish, or read about, all the wrong guesses? Journals certainly aren’t looking for them!

Finding positive results, though, necessitates scientists interpreting their results to find correlations that are statistically significant – and offers the temptation of manipulating data to MAKE it significant. The significance of results has commonly been measured as its “p-value”. The manipulation of data to make results seem more significant – possibly deliberately, but possibly unconsciously (confirmation bias!) — is called “p-hacking”. This is the very kind of bias that replication is designed to correct: bias on the part of individual scientists or groups is tested and corrected by others! (For a further explanation of p-hacking, with interactive examples to try out, I recommend a paper by Christie Aschwanden: “Science Isn’t Broken”. )

Publication bias is also heavily toward innovation – toward fresh and potentially exciting new findings. Who wants to do replication of somebody else’s work to test it when one’s own work is thereby stalled?

John Ioannidis, who flagged the problem a decade ago, insists that the reasons for insufficient replication lie in the working lives of scientists:

“with fierce competition for limited research funds and with millions of researchers struggling to make a living (publish, get grants, get promoted), we are under immense pressure to make ‘significant’, ‘innovative’ discoveries. Many scientific fields are thus being flooded with claimed discoveries that nobody ever retests. Retesting (called replication) is discouraged. In most fields, no funding is given for what is pooh-poohed as me-too efforts. We are forced to hasten from one ‘significant’ paper to the next without ever reassessing our previously claimed successes…..”

The conclusion of the Reproducibility Project acknowledges this discouragement of replication built into working conditions. It ends by emphasizing the challenge of balancing between novelty and replication, when incentives for scientists are all on the side of fresh discovery. “Journal reviewers and editors may dismiss a new test of a published idea as unoriginal”, says the report, but that test plays an important part in the development of science:

The claim that “we already know this” belies the uncertainty of scientific evidence. Innovation points out paths that are possible; replication points out paths that are likely; progress relies on both.

What do the psychological sciences learn from the Reproducibility Project?

It is clear that this project, initiated and carried out by psychologists, has illuminated their own methodologies and some of its failings. The project’s significance now, though, depends on all the players – from scientists, to their employers, to their publishers, to the media. Cody Christopherson, one of the co-authors of the Reproducibility Project, points out the need for all of them to recognize their roles in the problem:

“To get hired and promoted in academia, you must publish original research, so direct replications are rarer. I hope going forward that the universities and funding agencies responsible for incentivizing this research—and the media outlets covering them—will realize that they’ve been part of the problem, and that devaluing replication in this way has created a less stable literature than we’d like.”

Some specific recommendations to fix the reproducibility problem have been put forward, including these by Stuart Buck before the project reached its conclusion: “Obvious solutions include more research on statistical and behavioral fixes for irreproducibility, activism for policy changes, and demanding more pre-registration and data sharing from grantees.”

And so…what might Theory of Knowledge take from the Reproducibility Project?

For TOK, I think the conclusions of the Reproducibility Project: Psychology demonstrate, once again, that knowledge is human and fallible. But the project and reactions to it also show that a careful methodology can make the knowledge better – and that critical scrutiny of methodology is crucial to improving its reliability. One of the most important things we learn from the sciences is how to know – and, through self-aware criticism, how we can know better.

For TOK, moreover, we might see this project as an example not of failure but of success in the development of knowledge. Taking the long view, we watch the development of whole areas of knowledge, and might recognize this crisis in the psychological sciences as a step in growing self-knowledge, toward increasing reliability.

Indeed, we may want to applaud the Reproducibility Project, and use it in class as an impressive achievement of science. As Jason Mitchell from Harvard declares,

“The work is heroic. The sheer number of people involved and the care with which it was carried out is just astonishing. This is an example of science working as it should in being very self-critical and questioning everything, especially its own assumptions, methods, and findings.”

And, finally, all of us following the way that knowledge is created, critiqued, and communicated – all of us in Theory of Knowledge – might well pause to appreciate the nature of the scientific enterprise. In the words of one commentator, “If we’re going to rely on science as a means for reaching the truth — and it’s still the best tool we have — it’s important that we understand and respect just how difficult it is to get a rigorous result.” Knowledge doesn’t come easy – and its fascination lies in how we seek to achieve it.


Christie Aschwanden, “Science Isn’t Broken: It’s just a hell of a ot harder than we give it credit for.” FiveThirtyEight, August 19, 2015.

Monya Baker “Over half of psychology studies fail reproducibility test:Largest replication study to date casts doubt on many published positive results.” Nature, 27 August 2015

 Dorothy Bishop “Psychology research: hopeless case or pioneering field?”, The Guardian, August 27, 2015.

Stuart Buck, “Editorial: Solving Reproducibility”, Science. 26 June 2015: Vol. 348 no. 6242 p. 1403 DOI: 10.1126/science.aac8041

Brian Handwerk, “Scientists Replicated 100 Psychology Studies, and Fewer Than Half Got the Same Results”, Smithsonian. August 27, 2015.

John Ioannidis, “Psychology experiments are failing the replication test – for good reason”, The Guardian, August 28, 2015.

Open Science Collaboration, “Estimating the reproducibility of psychological science”, Science 28 August 2015: Vol. 349 no. 6251. DOI:10.1126/science.aac4716.

Wolfgang Stroebe and Miles Hewstone, “What have we learned from the Reproducibility Project?” Times Higher Education, September 17, 2015

Ed Yong, “How Reliable Are Psychology Studies?”, The Atlantic, August 27, 2015.

image: geralt, pixabay, creative commons.