Part I of Part II: Null Hypothesis Significance Testing (NHST)
A number of statistical tests, models, and methods aren’t really distinct from research designs. The most frequently used research design across the sciences involves:
1) Formulating a “testable” hypothesis (e.g., “X drug is effective for Y disease” or “reading blogs causes brain damage”)
2) Assuming an alternative—the “null” hypothesis—is true.
3) Designing a study that tests whether there is some statistically significant result which we would be highly unlikely to find if the “null” hypothesis were true, and thus (in practice) indicating that the original hypothesis is “probably true” (actually, this isn’t what NHST shows, but it is perhaps the most frequent, minimally problematic misinterpretation).
To give a concrete example, I’ll use medical studies because the placebo design is widely known: you have a control group which receives a placebo and the treatment group which receives Herpexia. You know that not everybody will react to the same treatment (or to the placebo) the same way, but you can use statistical tests to determine whether the differences between the control group and the treatment group are so (statistically) “significant” that the odds that the improvement in the treatment group is due to chance is tiny.
This is called “null hypothesis significance testing” (NHST) or significance testing (as well as statistical decision theory and other names). It is used in climate science, neuroscience, medicine, linguistics, business research, evolutionary psychology, education, biology, etc. There’s one tiny problem with it:
“In a recent article, Armstrong (2007) points out that, contrary to popular belief, “there is no empirical evidence supporting the use of statistical significance tests. Despite repeated calls for evidence, no one has shown that the applications of tests of statistical significance improve decision making or advance scientific knowledge” (p. 335). He is by no means alone in arguing this. Many prominent researchers have now for decades protested NHST, arguing that it often results in the publication of peer-reviewed and journal endorsed pseudo-science. Indeed, this history of criticism now extends back more than 90 years” (emphases added)
Lambdin, C. (2012). Significance tests as sorcery: Science is empirical—significance tests are not. Theory & Psychology, 22(1), 67-90.
I’ve relegated a more comprehensive bibliography to the bottom for continuity reasons. There are a couple of reasons why significance testing is so problematic. First, it is found just about everywhere (if particle physicists ever start using it, I’ll give up and join a monastery or cult or something). Second, it lacks empirical or logical support. To give a brief example, imagine we transported scientists back in time to investigate the plague which depopulated much of Europe. Each research team gets a bunch of people (some with the plague, some without) and tests various hypothesized reasons for the cause of the Black Death. One group finds that its spread is statistically significantly related to ports and ships. Another group finds a statistically significant association between its spread and human contact. Another that it is due to rats. Another that it is caused by fleas. And so on.
The problem here is that a statistically significant association between X and Y is easily and often due to Z. Thus a research team can find that the plague was likely caused by sailors because they don’t bother taking into account the rats on the ships and the fleas that infested these rats.
Basically, this method never tells us what we want to know. Which brings us to problem 3: because it only tells us (at best) the probability that we’d observe the data we do given the null hypothesis (the “our real hypothesis is wrong” hypothesis) is true, it tells us essentially nothing. This probability is not logically equivalent to the probability that the null hypothesis is actually true (or false), that the “real” hypothesis is true/false, or even that given the null hypothesis is true (false), the probability of getting data X is p. However, that doesn’t stop textbooks and researchers from treating NHST as if it tells us the probability that the null hypothesis is true or false (among other widespread misuses and fallacious inferences). In fact, part of the problem is that even with an ideal, perfect study using NHST in which all assumptions made explicitly or implicitly are satisfied (from sampling methods to assumptions inherent in the chosen statistical method), we can’t learn much of anything from a significant result. So, apparently, the best thing to do is to pretend this method shows something other than what it does.
Worse still, despite the lengthy criticism and plenty of better approaches, researchers who use better methods find their studies rejected or at least requiring revision (basically, they are asked to use NHST). So, for example, a predictive model which tells us what we should expect if the model is right gets canned in favor of worthless probability values.
Finally, the only thing growing faster than the sheer volume of criticisms of this method is its use. In fact, historically the modern form of NHST emerged from a sort of ad hoc synthesis of two opposing approaches to research: one by Sir Ronald Fisher and the other by Neyman & Pearson. Thus the criticisms of hypothesis testing date to its origins, as currently both approaches are used and they were both proposed as the correct approach compared to the other flawed one. Sometimes, even when the logic underlying a method is flawed or the assumptions about the conclusions one can make from it are invalid, the results can still be empirically supported. For example, in machine learning tens of thousands of hypotheses might be tested for a program to “learn” to recognize faces or how to weight various factors to recommend items you might like to watch on Netflix. Often, the tests used assume things about the variables that aren’t true, but they are close enough to be true that the algorithms work. With NHST, not only is the method flawed and misused, but the kind of empirical demonstration it works or should be continued has yet to be put forth despite 90 years of critical literature and counting.
Part II of…well you know: Statistical Misuse
But don’t worry! Just because this incredibly widespread method is currently being taught to a new generation of researchers hasn’t stopped the increasing misuse of particular statistical tests.
Researchers in the social & behavioral sciences aren’t generally known for their love of mathematics or mathematical proficiency. This is not to say that there aren’t plenty of sociologists or psychologists in numerous fields who have backgrounds in mathematics or even specialize primarily in applied mathematics. Just that, in general, those going into graduate school programs in the social sciences can’t be expected to have taken many courses in mathematics. However, the kinds of methods in data analysis/statistics available today are legion. Some (like Bayesian analysis or permutation methods) have been around for a while but couldn’t be used or couldn’t be used as they are now because we lacked sufficient computing power. While physics was the impetus for most mathematical breakthroughs including the foundations for NHST, modern statistical methods are more likely to be developed in fields like machine learning or other similar areas in computer science. However, because even your basic, traditional multivariate statistics (the “generalized linear model”) is basically a lot of specific applications of linear algebra with some underlying assumptions requiring calculus, even really understanding methods such as multivariate linear regression are beyond the ability of many researchers who have taken neither linear algebra nor calculus. So graduate schools in the social sciences especially are increasingly teaching researchers how to associate particular research questions with particular statistical methods which they can perform by plugging data into a software package like SPSS and hitting some buttons: “Hey Presto! The test did stuff and said our results were significant, so despite the fact that we don’t really understand what this means, we can bury it in the “Methods” section of our paper and count on peer-reviewers to be equally ignorant”.
In short, a general lack in the mathematical education of researchers in the social sciences (and elsewhere) has two results: first, they are mainly taught the most basic, simplest, and least powerful methods. Thus the most frequently used statistics are those developed in the late 19th and early 20th centuries and for many of these there were better methods even then. Second, to the extent many researchers use sophisticated techniques like support-vector machines (SVMs) or spectral graph theory, they have essentially no idea what it is that they are doing.
And, of course, even when the statistical methods are sophisticated and properly used, the results from NHST are still going to be worthless.
References on NHST:
402 Citations Questioning the Indiscriminate Use of Null Hypothesis Significance Tests in Observational Studies (http://warnercnr.colostate.edu/~anderson/thompson1.html)
Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587-606.
Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null ritual. In D. Kaplan (Ed.). (2004). The Sage handbook of quantitative methodology for the social sciences (pp. 391–408).
Gill, J. (1999). The insignificance of null hypothesis significance testing. Political Research Quarterly, 52(3), 647-674.
Gliner, J. A., Leech, N. L., & Morgan, G. A. (2002). Problems with null hypothesis significance testing (NHST): what do the textbooks say?. The Journal of Experimental Education, 71(1), 83-92.
Hobbs, N. T., & Hilborn, R. (2006). Alternatives to statistical hypothesis testing in ecology: a guide to self teaching. Ecological Applications, 16(1), 5-19.
Hubbard, R., & Armstrong, J. S. (2006). Why we don’t really know what statistical significance means: implications for educators. Journal of Marketing Education, 28(2), 114-120.
Hubbard, R., & Lindsay, R. M. (2008). Why P values are not a useful measure of evidence in statistical significance testing. Theory & Psychology, 18(1), 69-88.
Hunter, J. E. (1997). Needed: A ban on the significance test. Psychological Science, 8(1), 3-7.
Johnson, D. H. (1999). The insignificance of statistical significance testing. The journal of wildlife management, 763-772.
Kline, R. B. (2013). Beyond Significance Testing: Statistics Reform in the Behavioral Sciences. American Psychological Association.
Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56(1), 16.
McCloskey, D. N., & Ziliak, S. T. (2009). The Unreasonable Ineffectiveness of Fisherian” Tests” in Biology, and Especially in Medicine. Biological Theory, 4(1), 44.
Orlitzky, M. (2011). How can significance tests be deinstitutionalized?. Organizational Research Methods, 1094428111428356.
Taagepera, R. (2008). Making Social Sciences More Scientific: The Need for Predictive Models. Oxford University Press.
Ziliak, S. T., & McCloskey, D. N. (2008). The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. University of Michigan Press.