We all know that in science a hypothesis is…nothing like what most people understand. In computational sciences, machine learning, etc., one develops an algorithm to test a thousand or more hypotheses to get a program to recognize the letter “a” when it is written out by hand. Hypothesis testing in the medical, social, cognitive, psychological, and sometimes even natural sciences refers to both an experimental design and a set of statistical methods for interpreting the data these produce. Ok, that’s more complicated than needed. So take the classic example: medical trials in which one group gets a placebo and the other the real pill. Hypothesis testing (otherwise known as null hypothesis significance testing or NHST), both in this example and in general (allowing for variation; the logic is the same) works as follows:2) Assume that the treatment (the pill) will have no effect, or no “statistically significant” effect. This is called the “null hypothesis” and it means that whatever differences exist between the two groups are too likely to be due to chance than to the pill. For example:
1) Create a control group and a test group that are “equivalent” (e.g., if one is testing an antidepressant, than one does not compare the effect of the real pill on depressed people vs. the placebo on people without any history of mental health issues, but another group of depressed people).
2) Test this pill (we’ll call Euphorian) which is supposed to treat depression, and run a clinical trial which shows that a group of participants diagnosed with depression and given Euphorian report feeling a little better, in general, than a group of participants given a placebo. Ignore that this difference could be simply because the people in the placebo group were more severely depressed, or that the treatment group had participants that are more prone to the placebo effect, or any number of reasons. The logic of hypothesis testing is to assume that unless the groups are so different the probability that this is due to chance is ~1%. Supposedly, this is akin to a scientist assuming they are wrong unless they can’t prove themselves wrong.
3) Run the experiment and analyze the data. If it turns out the differences between the groups is so great the probability that it is due to chance is less than .05, .01, or .001 (standard “alpha levels” or levels at which scientist determine they can “reject the null” and accept the “alternative” which is code for saying they claim they’ve shown they’re right).
Sounds great! After, all, we can’t really prove things in sciences, but showing that the chances a hypothesis is wrong is less than 1% is pretty good. The problem is that hypothesis testing doesn’t actually do this. “But you don’t have to take my word for it…” (does that date me?)
“One wishes to know the probability that a biological or medical hypothesis, H, is true in view of the sadly incomplete facts of the world…But the statistical tests used in many sciences (though not much in chemistry or physics) do nothing to aid such judgments. The tests that were regularized or invented in the 1920s by the great statistician and geneticist Ronald A. Fisher (1890-1962) measure the probability that the facts you are examining will occur assuming that the hypothesis is true….The mistake here is known in statistical logic as “the fallacy of the transposed conditional.” If cholera is caused not by polluted drinking water but by bad air, then economically poor areas with rotting garbage and open sewers will have large amounts of cholera. They do. So cholera is caused by bad air. If cholera is caused by person-to-person contagion, then cholera cases will often be neighbors. They are. So cholera is caused by person-to-person contact. Thus Fisherian science.”
McCloskey, D. N., & Ziliak, S. T. (2009). The Unreasonable Ineffectiveness of Fisherian” Tests” in Biology, and Especially in Medicine. Biological Theory, 4(1), 44.
I like the title of the paper this next quote is taken from better than the quote, but that’s not a reflection of the quote simply the devastating criticism inherent in the title:
“In a recent article, Armstrong (2007) points out that, contrary to popular belief, “there is no empirical evidence supporting the use of statistical significance tests. Despite repeated calls for evidence, no one has shown that the applications of tests of statistical significance improve decision making or advance scientific knowledge” (p. 335). He is by no means alone in arguing this. Many prominent researchers have now for decades protested NHST, arguing that it often results in the publication of peer-reviewed and journal endorsed pseudo-science. Indeed, this history of criticism now extends back more than 90 years…….It is certainly a “significant” problem for the social sciences that significance tests do not actually tell researchers what the overwhelming majority of them think they do”
Lambdin, C. (2012). Significance tests as sorcery: Science is empirical—significance tests are not. Theory & Psychology, 22(1), 67-90.