I’m going to do something I don’t think I’ve yet done in all my numerous posts (read “exceptionally small due to laziness and low attention span”). I’m going to talk about a specific, published, peer-reviewed study, and I’m going to say it was great, absolutely beautifully done, the kind of research all scientists should aspire to, and correctly identified how to think about almost all the most important considerations all scientists should take into account for all research. In other words, I’m going to positively review a research paper, something I do so rarely it is statistically significant. The paper is on the reasons why statistical significance almost never matters.
Allow me to name the paper I found so delightful:
Here are the highlights: because the most frequent ways scientist’s test hypothesis (statistically) allow them a great deal of flexibility in how they gather data, what methods of data analysis they use, how they can change their methods when they don’t get the results they want until they do, and so forth, “it is unacceptably easy to publish ‘statistically significant’ evidence consistent with any hypothesis.” (p. 1359).
How did they show such a seemingly impossible conclusion? They put their money where their mouths were. They conducted two “fake” studies (the studies were real, in that the researchers actually used real participants, real scientific and statistical methods, etc.; they just didn’t care about the hypothesis they tested). Both studies were designed to test an hypothesis that listening to music changes your beliefs about your age. In the first one, participants listened to two songs: a “control” song (“‘Kalimba, an instrumental song that comes free with the windows 7 operating system”) and the song “Hot Potato” as per performed by The Wiggles (whomever they are). They then had participants rate how old they felt, and discovered that listening to “Hot Potato” made participants feel statistically significantly more likely to rate themselves older.
In the second study, the researchers first listened to “When I’m 64” (I’m not going to tell who that’s by because you should know) compared to the same control, and then perform an unrelated task. In that task, they were required to give their ages and their fathers’ ages. They found that participants were ~1.5 years older after listening to The Beatles’ song than the control.
How did they do this? They manipulated the way that data they used, how they used it, and how they reported it. Moreover, they did so in ways that researchers are free to do and DO actually do all the time.
The researchers also went into some details about the ways in which various “researcher degrees of freedom” or ways in which researchers are able to basically perform studies and analyze data until they get the results desired, including the (rather limited) ways they manipulated what they reported having done vs. what they actually did (i.e., the report left out a lot of quite crucial information that would make their results as insignificant as we’d expect). They also proposed guidelines to help fix the problem, and responded to potential criticisms (one set of criticisms being that they didn’t go far enough; i.e., the problem is worse than they made it appear).
Now, it’s kind of hard to criticize their studies’ logic or the use of particular statistical methods, as the researchers admitted the results were bunk and their whole point was that their methods and analyses were junk. One could look into the study they cited which found that 70% of behavioral scientists stop collecting data based on “interim data analysis” or basically whenever they found that what data they had could be made to look significant. One could also look into the details they provided about potential problems with the ability to e.g., freely choose when to stop data collection based upon such interim analysis by looking at the simulations they ran on the likelihood of obtaining false-positive results (i.e., the statistical tests reveal “statistical significance” that isn’t there). But the beauty of this study is that in addition to all the details and analyses they performed to indicate how flexibility allotted to researchers allow them to always find their results statistically significant, they actually DID THIS. Twice. For real.
I happen to know that this happens for real (most researchers do not think it is an issue and don’t think, for example, that changing what they do until they find their results to be significant is bad science; they think of it as “tweaking” their experiment until they get it “right”). I would say that there are things the researchers didn’t include about how methods and data can be manipulated that should have been, and that these were not in their section on criticisms of their study that said they hadn’t gone far enough. But had they covered these ways and showed using more simulations and studies how these are big issues, whose to say their paper would have been published, or that the amount of time, effort, and the length of the resulting work wouldn’t actually have LESS impact because readers would become lost in the additional detail (or not bother to read it)?
So, this is a statistically significant post about why such significance doesn’t matter. The reason most studies I review here or for friends, family, etc. who ask me about X study, actual requests by scientists to officially review studies, etc., results in an over-whelming number of negative reviews is because
1) In cases like research reviewed on this blog, I choose bad research to highlight problems.
2) Most of the rest of my reviews are informal, unofficial, and based upon studies that received media attention (i.e., those among studies making the most extravagant claims). In other words, a very biased sample.
3) I’m something of a perfectionist, and no study can be perfect. When other researchers or students ask for my opinion about this or that paper, I usually have to qualify many criticisms with something to the effect of “this doesn’t really matter”.