From Volume 8, Issue 5 of MASS
Retraction Notice: A Reminder to Keep the Evidence in Evidence-Based Fitness?
by Eric Trexler, Ph.D.
Over 5 million people heard about a study reporting beneficial effects of cold water immersion and cold showers on fat loss and psychological outcomes. That paper has been retracted, but the damage has been done.
Study Reviewed: RETRACTED: Impact Of Cold Exposure On Life Satisfaction And Physical Composition Of Soldiers. Néma et al. (2024)
In MASS, we typically write about newly published studies that are hitting the scientific journals. In this article, I’ll be doing just the opposite by writing about an article, previously reviewed in MASS, which recently got retracted (1). I’m covering this retraction for several reasons. First and foremost, the scientific literature should be treated like a living document – it changes over time, and the removal of text can be every bit as important as the addition of text. Second, it invites important reflections on what it actually means to be “evidence based.” Third, there’s a good lesson to be learned about where you get your information from. So, in this article I will first describe the original study and why it was retracted, then I’ll address each of these points.
The Study
As I noted in a previous Research Brief, Néma and colleagues (1) sought to assess the impacts of regular cold water exposure in a sample of 49 soldiers (25 male, 24 female) between 19-30 years old. Half were randomly assigned to the control condition (no intervention), while the other half were randomly assigned to the cold water intervention. The intervention began with a four-hour educational session about the benefits of cold exposure. To assess the acute impact of cold water immersion on perceived anxiety, participants completed the Zung Self-Rating Anxiety Scale questionnaire before and after the first cold water immersion experience (submerged up to the shoulders for two minutes in 3°C water). During the eight-week trial, they did a combination of outdoor cold water immersion (at least once per week in a lake, reservoir, or running water, with a water temperature of ≤6°C [≤42.8°F] and a duration of ≥30 seconds) and cold showers (at least four times per week, with a water temperature of ≤10°C [≤50°F] and a duration of ≥30 seconds). Before and after the intervention, body composition was measured using a multifrequency bioelectrical impedance device. Participants also completed a life satisfaction questionnaire with several domains (health, job and employment, finance, leisure time, “own person,” sexuality, relationships, and living and housing). Participants were instructed to maintain their normal diet and exercise habits throughout the trial.
For the acute response to the first cold water immersion session, the researchers reported a statistically significant reduction in perceived anxiety within the intervention group (41.1 ± 11.0 to 39.3 ± 11.2 arbitrary units; p = 0.031). Body composition outcomes are presented in Table 1 for males, and Table 2 for females. The cold water exposure group experienced statistically significant within-group reductions in waist circumference and visceral fat area. For all other body composition outcomes, there were no statistically significant differences.


For life satisfaction outcomes, the researchers reported data for eight separate domains, along with an overall composite score. For the intervention group, the only significant within-group change was an increase in the sexuality domain (38.5 ± 5.5 to 40.9 ± 5.6 arbitrary units; p = 0.045). For the control group, no significant changes were observed. So, of 18 total statistical tests for life satisfaction outcomes, only one was significant. Further, it was just barely below the common threshold of 0.05, and it was a within-group (not between-group) difference.
When I originally covered this study in MASS, I highlighted a number of substantial limitations. These included lack of clarity about how many study conditions actually existed, lack of clarity about how participants were filtered out of the study, duplicate participants who appear to be in both study groups (or duplicated within the same group), lack of clarity about whether or not the researchers were participants in their own study (which was unblinded by default), data formatting inconsistencies that would (in theory) likely trip up most statistical software, and a statistical approach that was not appropriate for the study design. I summarized these limitations by stating that “the methods they utilized appear to fall short of widely accepted standards for ‘best practices’ in research, so I can’t put a lot of faith in the findings (which weren’t particularly flashy or positive to begin with).” As a result, I concluded that this study failed to provide convincing evidence in favor of habitual cold water immersion, and that people should move along without allocating too much attention to it. To be abundantly clear, identifying the glaring limitations in this study was not a matter of finding a needle in a haystack – very quickly after it was published, there were credible public critiques from several science communicators (one, two, three).
Retraction Isn’t Enough
The retraction notice for the paper by Néma and colleagues (1) went live on the journal website right around February 1, 2024. In this section I’ll be referencing some dates, time frames, and cumulative numbers, so it’s important to note that I’m writing this article in early April of 2024. As I write this article, the journal website indicates that the abstract of this retraction notice has been viewed 683 times, the website version of the full text has been accessed 695 times, and the PDF version of the full text has been downloaded 151 times (note: these numbers will undoubtedly go up, to some extent, by the time this article is published). In comparison, the original article (prior to retraction) has received 55,178 abstract views, 391 full-text views, and 580 PDF downloads as I currently write this article (Table 3). These numbers are interesting because they suggest that general content consumers, abstract readers, and full-text readers are fundamentally different groups of people. The vast majority of general content consumers will not check the abstract, and the vast majority of abstract checkers will not access the full text. However, the “power users” who access full texts probably follow the literature very closely, and are probably more likely to become aware of future retractions. I find it absolutely fascinating that the full text received about 8000% more abstract views than the retraction notice, yet only 14% more full text views (combining PDF downloads and website views) than the retraction notice.

More importantly, these numbers make it very clear that retraction simply isn’t enough to mitigate the impact of popularizing retraction-worthy research. Roughly 5 million people heard the conclusions of this study, presented without any substantive caveats, and a very small percentage of them were inclined to skim the abstract. Of the select few who did skim the abstract on the journal website, less than 2% accessed the full text. Of course, these numbers are rough estimates – the viewership numbers don’t account for overlap between tweet viewers and YouTube viewers, people who learned about this study from other outlets, people who viewed the abstract on PubMed instead of the journal site, people who got a downloaded PDF from a peer, and so on. Nonetheless, there’s not enough estimation error in the world to make up for the fact that the conclusions of this paper were disseminated to over 5 million people and less than 0.02% of them actually read the full text or the retraction notice. The result is roughly 5 million misinformed people who will most likely either 1) remain misinformed or 2) experience a sense of disillusionment when they find out that their intention to “follow the science” led them astray.
Raising Evidence-Based Standards
What exactly does “evidence-based” mean? It may sound like I’m about to venture into a pedantic monologue on semantics, but this is far more meaningful. Defining what constitutes “evidence” extends beyond arguments about vocabulary into the very nature of what we claim to know about the world around us.
In the broadest possible interpretation, just about every perspective is evidence-based. Some people use the term “evidence” in reference to the scientific consensus derived from the totality of scientific literature available on a topic. Others adopt a definition more similar to what we might see in legal proceedings. Both sides present independent pieces of evidence in a trial – any individual piece of evidence may be irrelevant, unreliable, or downright fraudulent, but even these shoddy pieces of evidence are entered into the evidentiary record and used to support a claim or position that is framed as pertinent to the case. If you adopt the latter definition of evidence, then virtually all fitness is evidence-based fitness. If you consider anecdotes and personal experiences to be “evidence,” then there is likely to be some amount of evidentiary justification behind virtually every decision or adjustment you make. With this definition of evidence, an evidence-based justification of preferring a particular repetition range could be that your favorite pro bodybuilder said it’s optimal. Unfortunately this liberal interpretation of the term “evidence-based” presents a glaring issue – if everything is evidence-based, then nothing is evidence-based.
This approach provides no practical framework for comparing strategies or deriving any conclusions more useful than “try it out, hope for the best, and see how it goes.” The entire reason I got interested in a scientific approach to fitness is because I was sick and tired of the endlessly repeating cycle of hype, optimism, experimentation, and disappointment. If we set a low bar for what constitutes “evidence” in evidence-based practice, we’re left with hardly more direction than we started with. Frankly, if you’re going to let anecdotes lead the way, I’d argue that you’re actually better off listening to the “BroScience” anecdotes of people who are accomplishing your fitness-related goals instead of the mechanistic speculations of some random YouTuber or podcaster, regardless of their educational background.
You probably won’t be surprised to hear this, but I prefer to adopt a more stringent interpretation of what constitutes “evidence-based” decision making. I do not deny the fact that anecdotes and personal experience can be informative forms of evidence. However, any claim or strategy framed as “evidence-based” should, in my opinion, maintain a healthy level of respect for the hierarchy of evidence (Figure 1). It would be misguided to completely ignore your own personal experience with a given intervention or strategy, but you should interpret and contextualize your own experiences through the prism of robust literature, including prospective human studies, randomized controlled trials, and meta-analyses. It would also be myopic to completely ignore mechanistic rodent studies, but downright foolish to ignore the large gap (in terms of robustness, generalizability, and overall confidence level) between a mechanistic rodent study and a properly conducted meta-analysis of randomized controlled trials that were designed and executed effectively.

With this in mind, I encourage you to view “evidence-based” as a spectrum rather than a dichotomy. There is minimal value in distinguishing “evidence-based” from “non-evidence-based” – rather, we should evaluate the strength of evidence supporting a particular strategy, claim, or intervention. Expert opinions and anecdotes represent very weak forms of evidence, so “I heard it from a person with a PhD” doesn’t clear the bar. Narrative review papers are basically long and detailed expert opinions, and mechanistic studies (especially those conducted in animal models or extracted cells or tissues) are similarly weak forms of evidence for fitness-related interventions. Human studies (such as observational studies, case-control studies, and prospective cohort studies) are certainly a step up, and randomized controlled trials are the best of the best (along with systematic reviews or meta-analyses of randomized controlled trials). However, this general hierarchy is merely the starting point for assessing the strength of evidence for a particular strategy.
A great study design with poor execution is still low-quality evidence, so it’s important to verify that a study was conducted well. Similarly, it’s critical to verify that any study used as evidence for a claim or strategy is actually relevant to that claim or strategy. For a hypothetical example, a study about the health of people in cold climates would be extremely weak evidence if used to support a claim about ice baths – clearly there’s a difference between spending several hours per day in moderate cold and spending a very brief time in extreme cold. In such a case, the “evidence” is entirely different from the strategy it’s used to prop up. For another hypothetical example, a study about caffeine’s effect on cortisol levels would be extremely weak evidence to support a claim about wakefulness or subjective energy levels. In such a case, there’s a large disconnect between the outcome being used as “evidence” (cortisol) and the outcome of interest (wakefulness). We see less egregious examples all the time in fitness-adjacent research, where proxies like short-term muscle protein synthesis rates are used to make inferences about long-term hypertrophy outcomes.A strategy with extremely strong evidence would be supported by several well-conducted randomized controlled trials, culminating in a properly executed meta-analysis of their combined findings. You could have a high level of confidence applying such a strategy, but a YouTube channel that restricted itself to talking about these types of strategies would probably be perceived as very boring. Most of the strategies discussed on such a channel would be categorized as “obvious” recommendations that almost everyone already knows, and there would be very limited opportunities for novelty, iteration, or innovation. It would be entirely appropriate to expand beyond this evidence and discuss weaker forms of evidence, with one major caveat: the content should be presented in a manner that reflects a low or moderate level of confidence and informs the viewer when claims are supported by weak, flawed, or incomplete data. When you blur the lines between high-confidence recommendations based on robust science and low-confidence speculation based on weak science, content consumers have no choice but to throw out the metaphorical baby with the bathwater. If I know that a content creator will lead me astray 15% of the time, but I don’t have the expertise to verify which 15% of their content is misleading, I can’t apply their recommendations with confidence. No one will get it right 100% of the time, but as the percentage gets lower and lower, their word gets harder and harder to lean on. When it reaches a certain tipping point, their content simply isn’t worth taking seriously without verification from an independent source.
Buyer Beware
As an actively publishing scientist and creator of evidence-based information, I’m not afforded the luxury of staying on the sidelines. When research comes out in my area of focus, I must critically appraise it and formulate a working opinion of it. I will get things wrong sometimes – it’s an inevitable reality of anyone who does this type of work. However, “getting it wrong” can take many different forms. There’s a big difference between confidently making a bogus claim and offering a speculative idea that is appropriately layered with caveats reflecting uncertainty. There’s a big difference between missing obvious errors and limitations in a paper and getting duped (along with countless other scientists) by an elaborate instance of cleverly executed research fraud. Furthermore, the primary goal of a science communicator can’t be to get it right 100% of the time (it simply won’t happen). Rather, a good science communicator takes their duty to educate seriously – when they get it wrong, they take prompt and proactive steps to acknowledge it, update their perspective, and share their corrected perspective as enthusiastically and widely as their prior content containing misinterpretations or errors.
I’ve never been one to call out specific content creators, for two main reasons: I don’t enjoy the vitriol that ensues, and I don’t think the practice is particularly helpful. If I convince a few people to stop listening to some unreliable influencer with a huge following, their audience isn’t meaningfully impacted and there are dozens more unreliable influencers ready to pick up the slack. Instead, I encourage content consumers to periodically audit the content creators they follow. For the purported evidence-based fitness content creators you follow, consider asking yourself the questions in Table 4.

There is a lot of content out there, and it’s never been easier to generate high volumes of content in relatively short amounts of time. Simultaneously, content creators have caught on to the fact that many people let their guard down and abandon their skepticism when they consume content from someone claiming to be “evidence-based.” As a result, it’s never been harder to separate the wheat from the chaff. When trying to stay up-to-date with the latest science, the best way to weed out unreliable information is to periodically apply the questions listed above to ensure that you’re curating information from the most reliable sources possible.
References
- Néma J, Zdara J, Lašák P, Bavlovič J, Bureš M, Pejchal J, Schvach H. RETRACTED: Impact of cold exposure on life satisfaction and physical composition of soldiers. BMJ Mil Health. 2024 Jan 29:e002237.
