Since the reproducibility crisis in psychology and other sciences started a decade ago, direct replication has gained more and more traction as a valuable epistemic tool. Before the crisis, experiments in psychology were mostly aimed at building upon previous studies instead of verifying them . Researchers generally didn’t see direct replications of experiments as valuable additions to the scientific process, and replications were regularly rejected by journals  [See 3 for an example]. This changed when researchers discovered that many psychological studies actually failed to replicate statistically significant results. Scientists generally believe that, for a phenomenon to be true, exact replications of an experiment should be able to consistently reproduce it. However, in psychology less than half of these replication attempts manage to do so . Due to this lack of reproducibility, researchers started to doubt the existence of once prominent theoretical constructs like social priming  and ego depletion . The direct replication became an invaluable tool for verifying study results and psychological concepts, and turned into an important yardstick for the overall health of the discipline.
Besides gauging the scope and severity of the reproducibility crisis, direct replications are also seen as one of its solutions. The lack of reproducible studies is most commonly attributed to questionable research methods, fraud, publication bias, small sample sizes and sloppy statistics . By conducting exact replications of existing studies, reformers hope to increase the statistical power of earlier studies and cleanse the field from some of its bad apples. Other proposed solutions to the crisis, like promoting open data, increased statistical rigour, larger sample sizes and preregistration of studies, have also been implemented with the aim of improving replication efforts and success. However, by focusing mainly on direct replications and statistics, one might overlook other issues that underlie the crisis. Several philosophers and social scientists have pointed out that the discipline’s lack of validated measures and consistent theoretical concepts with a clear scope also contribute to the low reproducibility rates. These conceptual issues cannot be solved by mere methodological improvements, they argue, and even question the method of replication as a meaningful concept. Should reformers perhaps reconsider the dominant role of direct replications?
A first point of critique is that direct replications cannot possibly exist in practice. Researchers can never recreate the exact same situation as in the original experiment because they need a new sample of participants that haven’t experienced the experiment before. In addition, practically all replications necessarily take place at a later time than the original experiment . Even within a few months, the norms, values, knowledge, meaning of words and phrases, and other cultural variables in the population could have changed, possibly diluting the results of the ‘direct’ replication. This might seem trivial and well-known to researchers, but it is important to consider how these subtle differences can have an impact on replication studies.
Because perfect replications are not realistic, researchers generally replicate only the factors that are relevant to the experimental setup and the theory that is being tested. However, and this is a second critique, what factors are deemed relevant or irrelevant to a specific experiment or theoretical concept depends on the judgement of the individual researcher. According to philosopher Uljana Feest , individual researchers often hold different notions of what theoretical concepts mean, what factors influence them and how they can be measured. These individual judgements are not always explicitly mentioned in psychological studies and can cause problems when researchers replicate an experiment. When a researcher conducts an original experiment, she will use her theoretical understanding of the issue at hand to describe all the factors that seem relevant to her. An experiment that tests the correlation between being outdoors in nature and intelligence, for example, could include the weather, season, and temperature of the test location, as well as a standardized IQ test and the demographics of the participants. However, this description might already harbour many assumptions that other researchers might not have. The researcher might assume that factors like the hour and day of the week, the population density of the test location, and the personal preferences of the participants are not relevant for her theory, and consequently fail to describe them. A second researcher may have completely different assumptions on how to understand and measure intelligence, and on what ‘being outdoors in nature’ might exactly constitute, which could lead to relevant differences in the replication of the original experiment.
A related problem is that many psychological concepts are vague and the methods that ought to measure them unvalidated. Psychology uses countless of tests and other methods to measure concepts like emotions, intelligence, skills, values and personality traits, but before one can use such a measure, one must confirm that the method works as intended . If a researcher wants to study intelligence, for example, he should first clearly define what construct of intelligence he is going to address. Does he want to study emotional intelligence, spatial-temporal intelligence or the capacity for logical reasoning? Then, he has to find or create a measure that has been found to reliably represent the variation in the concept he wants to study – if the concept is quantifiable at all in the first place . Only when both the concept and measure are clearly defined and validated, experiments can lead to valid and reliable results.
In reality however, many studies in psychology often seem to lack clear concepts and validated measures. A recent review by Flake and colleagues  shows that a significant number of studies in personality science and social psychology do not include evidence on whether their measures actually measure the construct they believe they are studying. According to Flake , these ‘questionable measurement practices’ are common in psychology: it is often not clear what exact constructs are used, how they are measured and why the researcher adopted these specific methods. In the last decades, more and more concepts and measures have been brought into existence by psychologists and it is theorized that this reflects the inability of the discipline to reject invalid measures and concepts .
The validity issues surrounding psychological concepts and measures underscore the gravity of the third critique of direct replications; which is that direct replications don’t inform the researcher about the mechanism that underlies the effect being studied [7 and 8]. Direct replications can be informative on whether a certain experimental finding is reliable, and tell us if a random error occurred in the original experiment, but they do not help develop the concepts or significantly increase the support for the underlying theory. An experiment could be poorly designed, and use vague concepts and invalid measures, but nevertheless succeed in replication. For example, by replicating the study on being outdoors in nature and intelligence, one could discover that there is a positive relation between these concepts, but it remains uncertain which element(s) of the experiment led to this effect. A direct replication might appear like the confirmation of an effect, but it obscures that our understanding of the actual concepts and mechanisms at work might still be wrong. Merely improving statistical methodology and increasing the replicability of studies is therefore not sufficient when solving the fundamental issues at hand.
The inability of direct replications to address conceptual and measurement issues is illustrated well by Lurquin & Miyake’s review of the literature on ego-depletion . This theory hypothesizes that individuals have a limited resource of self-control, but researchers haven’t been able to reproduce the original studies that once discovered it. According to the authors, this is not something that direct replication and increased reproducibility alone could solve. They write that the literature on ego-depletion suffers from a ‘conceptual crisis’: the field has no clear definition of self-control and how it could be measured, and the tests that are used have not been independently confirmed to actually measure self-control.
If direct replications cannot tackle these conceptual and methodological issues underlying the reproducibility crisis, then how should they be addressed? Some authors, like Stroebe & Strack  and Crandall & Sherman  argue that we should shift our attention from direct to conceptual replications. In conceptual replications, a specific theory or concept is operationalized using a different measure or change in experimental setup, which could increase the confidence of the hypothesis tested and help establish the generalizability of the theory. However, conceptual replications are also criticized because of their susceptibility to publication bias and questionable research practices – and therefore linked to the origin of the replication crisis itself [1 and 15]. While failed conceptual replications don't offer researchers much new information, a successful conceptual replication is generally regarded as interesting and publishable because it confirms or extends an existing theory or effect. The difference in how the results of these studies are valued, can lead to selective publishing of positive findings and a lack of incentives to critically investigate existing studies and concepts. The resulting proliferation of many, poorly researched theories and concepts is further amplified by the high pressure of publishing in academia and the high researchers’ degree of freedom when designing conceptual replications, which makes these studies vulnerable to malpractice.
For Feest, the lack of clear concepts and ways to measure them makes it hard to value conceptual replications – regardless of whether they succeed or fail. A conceptual replication might succeed in achieving the same result as the original study, but the researchers wouldn't know what variables or mechanism could have led to this result. Conversely, a conceptual replication might fail because of the changes that the researchers made to the experimental setup, problems with the original experiment or because the effect studied is less generalizable or coherent as previously thought. Instead of replication, Feest argues that researchers should focus – and in practice they often do - on what she calls ‘exploratory work’. By critically probing the concepts and phenomena that are studied, researchers narrow down the exact scope of these concepts and weed out erroneous assumptions.
Feest’s suggestions seem to match with the recommendations made in the ego-depletion review. According to its authors, the local crisis surrounding this topic can be tackled if the researchers test their concepts and measures extensively, report more transparently on them, and consequently develop consensus on what self-control is and how it can be measured. Although direct replications efforts are still helpful in increasing the reliability of findings, it seems that they cannot solve the reproducibility crisis on their own: conceptual work and validating measures might very well play an equally significant role.
: Wiggins, B. J., & Chrisopherson, C. D. (2019). The replication crisis in psychology: An overview for theoretical and philosophical psychology. Journal of Theoretical and Philosophical Psychology, 39(4), 202.
: Makel, M. C., & Plucker, J. A. (2014). Facts are more important than novelty: Replication in the education sciences. Educational Researcher, 43(6), 304-316.
: Aldhous, P. (2011). Journal rejects studies contradicting precognition. New Scientist.
: Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251).
: O’Donnell, M., Nelson, L. D., Ackermann, E., Aczel, B., Akhtar, A., Aldrovandi, S., ... & Zrubka, M. (2018). Registered replication report: Dijksterhuis and van Knippenberg (1998). Perspectives on Psychological Science, 13(2), 268-294.
: Hagger, M. S., Chatzisarantis, N. L., Alberts, H., Anggono, C. O., Batailler, C., Birt, A. R., ... & Zwienenberg, M. (2016). A multilab preregistered replication of the ego-depletion effect. Perspectives on Psychological Science, 11(4), 546-573.
: Stroebe, W., & Strack, F. (2014). The alleged crisis and the illusion of exact replication. Perspectives on Psychological Science, 9(1), 59-71.
: Feest, U. (2019). Why replication is overrated. Philosophy of Science, 86(5), 895-905.
: Schimmack, U. (2021). The validation crisis in psychology. Meta-Psychology, 5.
: Michell, J. (2000). Normal science, pathological science and psychometrics. Theory & Psychology, 10(5), 639-667.
: Flake, J. K., Pek, J., & Hehman, E. (2017). Construct validation in social and personality research: Current practice and recommendations. Social Psychological and Personality Science, 8(4), 370-378.
: Flake, J. K., & Fried, E. I. (2020). Measurement schmeasurement: Questionable measurement practices and how to avoid them. Advances in Methods and Practices in Psychological Science, 3(4), 456-465.
: Lurquin, J. H., & Miyake, A. (2017). Challenges to ego-depletion research go beyond the replication crisis: A need for tackling the conceptual crisis. Frontiers in Psychology, 8, 568.
: Crandall, C. S., & Sherman, J. W. (2016). On the scientific superiority of conceptual replications for scientific progress. Journal of Experimental Social Psychology, 66, 93-99.
: Pashler, H., & Harris, C. R. (2012). Is the replicability crisis overblown? Three arguments examined. Perspectives on Psychological Science, 7(6), 531-536.