Part II: When your basic research design is flawed, use inferior statistical analyses to make things worse

Part I of Part II: Null Hypothesis Significance Testing (NHST)

A number of statistical tests, models, and methods aren’t really distinct from research designs. The most frequently used research design across the sciences involves:

1) Formulating a “testable” hypothesis (e.g., “X drug is effective for Y disease” or “reading blogs causes brain damage”)

2) Assuming an alternative—the “null” hypothesis—is true.

3) Designing a study that tests whether there is some statistically significant result which we would be highly unlikely to find if the “null” hypothesis were true, and thus (in practice) indicating that the original hypothesis is “probably true” (actually, this isn’t what NHST shows, but it is perhaps the most frequent, minimally problematic misinterpretation).

To give a concrete example, I’ll use medical studies because the placebo design is widely known: you have a control group which receives a placebo and the treatment group which receives Herpexia. You know that not everybody will react to the same treatment (or to the placebo) the same way, but you can use statistical tests to determine whether the differences between the control group and the treatment group are so (statistically) “significant” that the odds that the improvement in the treatment group is due to chance is tiny.

This is called “null hypothesis significance testing” (NHST) or significance testing (as well as statistical decision theory and other names). It is used in climate science, neuroscience, medicine, linguistics, business research, evolutionary psychology, education, biology, etc. There’s one tiny problem with it:

“In a recent article, Armstrong (2007) points out that, contrary to popular belief, “there is no empirical evidence supporting the use of statistical significance tests. Despite repeated calls for evidence, no one has shown that the applications of tests of statistical significance improve decision making or advance scientific knowledge” (p. 335). He is by no means alone in arguing this. Many prominent researchers have now for decades protested NHST, arguing that it often results in the publication of peer-reviewed and journal endorsed pseudo-science. Indeed, this history of criticism now extends back more than 90 years” (emphases added)

Lambdin, C. (2012). Significance tests as sorcery: Science is empirical—significance tests are not. Theory & Psychology, 22(1), 67-90.

I’ve relegated a more comprehensive bibliography to the bottom for continuity reasons. There are a couple of reasons why significance testing is so problematic. First, it is found just about everywhere (if particle physicists ever start using it, I’ll give up and join a monastery or cult or something). Second, it lacks empirical or logical support. To give a brief example, imagine we transported scientists back in time to investigate the plague which depopulated much of Europe. Each research team gets a bunch of people (some with the plague, some without) and tests various hypothesized reasons for the cause of the Black Death. One group finds that its spread is statistically significantly related to ports and ships. Another group finds a statistically significant association between its spread and human contact. Another that it is due to rats. Another that it is caused by fleas. And so on.

The problem here is that a statistically significant association between X and Y is easily and often due to Z. Thus a research team can find that the plague was likely caused by sailors because they don’t bother taking into account the rats on the ships and the fleas that infested these rats.

Basically, this method never tells us what we want to know. Which brings us to problem 3: because it only tells us (at best) the probability that we’d observe the data we do given the null hypothesis (the “our real hypothesis is wrong” hypothesis) is true, it tells us essentially nothing. This probability is not logically equivalent to the probability that the null hypothesis is actually true (or false), that the “real” hypothesis is true/false, or even that given the null hypothesis is true (false), the probability of getting data X is p. However, that doesn’t stop textbooks and researchers from treating NHST as if it tells us the probability that the null hypothesis is true or false (among other widespread misuses and fallacious inferences). In fact, part of the problem is that even with an ideal, perfect study using NHST in which all assumptions made explicitly or implicitly are satisfied (from sampling methods to assumptions inherent in the chosen statistical method), we can’t learn much of anything from a significant result. So, apparently, the best thing to do is to pretend this method shows something other than what it does.

Worse still, despite the lengthy criticism and plenty of better approaches, researchers who use better methods find their studies rejected or at least requiring revision (basically, they are asked to use NHST). So, for example, a predictive model which tells us what we should expect if the model is right gets canned in favor of worthless probability values.

Finally, the only thing growing faster than the sheer volume of criticisms of this method is its use. In fact, historically the modern form of NHST emerged from a sort of ad hoc synthesis of two opposing approaches to research: one by Sir Ronald Fisher and the other by Neyman & Pearson. Thus the criticisms of hypothesis testing date to its origins, as currently both approaches are used and they were both proposed as the correct approach compared to the other flawed one. Sometimes, even when the logic underlying a method is flawed or the assumptions about the conclusions one can make from it are invalid, the results can still be empirically supported. For example, in machine learning tens of thousands of hypotheses might be tested for a program to “learn” to recognize faces or how to weight various factors to recommend items you might like to watch on Netflix. Often, the tests used assume things about the variables that aren’t true, but they are close enough to be true that the algorithms work. With NHST, not only is the method flawed and misused, but the kind of empirical demonstration it works or should be continued has yet to be put forth despite 90 years of critical literature and counting.

Part II of…well you know: Statistical Misuse

But don’t worry! Just because this incredibly widespread method is currently being taught to a new generation of researchers hasn’t stopped the increasing misuse of particular statistical tests.

Researchers in the social & behavioral sciences aren’t generally known for their love of mathematics or mathematical proficiency. This is not to say that there aren’t plenty of sociologists or psychologists in numerous fields who have backgrounds in mathematics or even specialize primarily in applied mathematics. Just that, in general, those going into graduate school programs in the social sciences can’t be expected to have taken many courses in mathematics. However, the kinds of methods in data analysis/statistics available today are legion. Some (like Bayesian analysis or permutation methods) have been around for a while but couldn’t be used or couldn’t be used as they are now because we lacked sufficient computing power. While physics was the impetus for most mathematical breakthroughs including the foundations for NHST, modern statistical methods are more likely to be developed in fields like machine learning or other similar areas in computer science. However, because even your basic, traditional multivariate statistics (the “generalized linear model”) is basically a lot of specific applications of linear algebra with some underlying assumptions requiring calculus, even really understanding methods such as multivariate linear regression are beyond the ability of many researchers who have taken neither linear algebra nor calculus. So graduate schools in the social sciences especially are increasingly teaching researchers how to associate particular research questions with particular statistical methods which they can perform by plugging data into a software package like SPSS and hitting some buttons: “Hey Presto! The test did stuff and said our results were significant, so despite the fact that we don’t really understand what this means, we can bury it in the “Methods” section of our paper and count on peer-reviewers to be equally ignorant”.

In short, a general lack in the mathematical education of researchers in the social sciences (and elsewhere) has two results: first, they are mainly taught the most basic, simplest, and least powerful methods. Thus the most frequently used statistics are those developed in the late 19th and early 20th centuries and for many of these there were better methods even then. Second, to the extent many researchers use sophisticated techniques like support-vector machines (SVMs) or spectral graph theory, they have essentially no idea what it is that they are doing.

And, of course, even when the statistical methods are sophisticated and properly used, the results from NHST are still going to be worthless.

Go team.

References on NHST:

402 Citations Questioning the Indiscriminate Use of Null Hypothesis Significance Tests in Observational Studies (

Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587-606.

Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null ritual. In D. Kaplan (Ed.). (2004). The Sage handbook of quantitative methodology for the social sciences (pp. 391–408).

Gill, J. (1999). The insignificance of null hypothesis significance testing. Political Research Quarterly, 52(3), 647-674.

Gliner, J. A., Leech, N. L., & Morgan, G. A. (2002). Problems with null hypothesis significance testing (NHST): what do the textbooks say?. The Journal of Experimental Education, 71(1), 83-92.

Hobbs, N. T., & Hilborn, R. (2006). Alternatives to statistical hypothesis testing in ecology: a guide to self teaching. Ecological Applications, 16(1), 5-19.

Hubbard, R., & Armstrong, J. S. (2006). Why we don’t really know what statistical significance means: implications for educators. Journal of Marketing Education, 28(2), 114-120.

Hubbard, R., & Lindsay, R. M. (2008). Why P values are not a useful measure of evidence in statistical significance testing. Theory & Psychology, 18(1), 69-88.

 Hunter, J. E. (1997). Needed: A ban on the significance test. Psychological Science, 8(1), 3-7.

Johnson, D. H. (1999). The insignificance of statistical significance testing. The journal of wildlife management, 763-772.

Kline, R. B. (2013). Beyond Significance Testing: Statistics Reform in the Behavioral Sciences. American Psychological Association.

Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56(1), 16.

McCloskey, D. N., & Ziliak, S. T. (2009). The Unreasonable Ineffectiveness of Fisherian” Tests” in Biology, and Especially in Medicine. Biological Theory, 4(1), 44.

Orlitzky, M. (2011). How can significance tests be deinstitutionalized?. Organizational Research Methods, 1094428111428356.

Taagepera, R. (2008). Making Social Sciences More Scientific: The Need for Predictive Models. Oxford University Press.

Ziliak, S. T., & McCloskey, D. N. (2008). The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. University of Michigan Press.

This entry was posted in Statistics, The Language of Science: Why most scientists don't speak it, The Scientific Method and tagged , , , , , , , , , , . Bookmark the permalink.

5 Responses to Part II: When your basic research design is flawed, use inferior statistical analyses to make things worse

  1. Pingback: Preview of Part II on a general critique of social and (some) other sciences | Research Reviews

  2. Jeff says:

    What are you opinions on multivariate or multi-variable analysis?
    … and how do you define multivariate regression analysis?

    These questions might be off-topic, but I was just wondering what you think about them in general regardless of how they are used in academia and research.

    • Multivariate analysis is a huge collection of methods. Those most generally used (multivariate regression analysis and multivariate correlation analysis, and more broadly the GLM) are the least robust to the most common violations of standard assumptions about variables (normality, linearity, homoscedasticity, and absence of multicollinearity). Of course, for something like (multiple) regression analysis one can use it effectively whilst violating all the assumptions the method makes about the variables analyzed if one is lucky. Put differently, whether a statistical test/method is “robust” to violations of its underlying assumptions is partly due to its application. I can use multiple regression in data mining, learning algorithms, and similar applications when the underlying data are nonlinearly distributed (typically in such applications we’d say that the data points aren’t linearly separable), continuity and by extension normality, etc. But this is because if I’m using it for e.g., supervised or unsupervised learning then my program either works or it doesn’t. If I’m using the same methods to analyze participant responses in an fMRI study, I can’t determine the extent to which violations of underlying assumptions that I can’t in general test for render my use of a particular statistical test/method useless. That is, as the outcome of any use of some statistical test/method in e.g., cognitive neuroscience, evolutionary psychology, psychiatry, etc., can’t really be tested.
      To use a specific example, imagine I hypothesize a particular relationship between religiosity an intelligence as hundreds of studies have/do. I can only test this hypothesis by assuming the variables “intelligence” and “religiosity” both exist and exist in particular ways. I cannot test whether these variables or normally distributed or whether a particular statistical method will fail because of multicollinearity because I can’t even be sure the variables correspond to anything “real”. This is a huge problem in psychiatry as the biomedical model (the standard model of psychiatry) holds that all mental disorders are discrete diseases with their own etiology. However, this was an assumption made in the 80s that has yet to find support in research but against which their exists an impressive number of studies. Basically, the same brain regions and chemicals are implicated in so many distinct disorders and comorbidity so prevalent that we have plenty of reasons to think the distinct disorders we’ve defined are not actually all distinct (at least not in the ways we have assumed they are). The relevant issue here, though, is that the same brain regions and neurotransmitters are implicated in such a wide-ranging number of cognitive and emotional processes as well as mental health issues that using the “standard” multivariate methods can fail so easily simply because it will identify as significant to, causally implicated in, or otherwise relevant to so many “distinct” disorders that exist by definition. So if I e.g., run some statistical test to determine the relevancy/relationship between serotonin and some mood disorder, I can’t know whether or not the nature of the results will be due to the ubiquitous importance of serotonin to so many brain processes, the fact that the mood disorder in question is defined by mental states that share an underlying neurophysiological basis with everything from “normal” mental states to other mental disorders, an actual relationship between the disorder of interest and serotonin, etc.
      This is not true of e.g., pattern recognition. If I am trying to boost the performance of a product recommendation algorithm or facial recognition algorithm, and I decide that multiple regression is an adequate tool, I will know if it is or not because either my program will do what I want or it won’t work.
      This is also true in computational sciences from computational chemistry to computational neuroscience, or more generally true of modelling. If I build a model of cellular activity, cosmic ray’s and cloud seeding, or neural circuitry, I can compare the performance of my model with the system(s) I am trying to model. Either they work for my purposes, or they don’t (in general, we require extremely simplified models for most systems of interest). This is true even in certain social sciences, such as economics: if I use some set of statistical methods/tests on economic data to build a model, and my model predicts that some economy or the global economy will behave in a particular way, I may not know whether or not my statistical methods/tests assumed things that weren’t true of the variables that make up my model, but I can know that my model succeeded or failed.
      There really isn’t a statistical test, measure, or method that simply doesn’t work or can be known to be useless given violations of its underlying assumptions, whether we are talking about simple correlation using Pearson’s r or manifold learning. The problem is that in too many sciences we neither know the extent to which the underlying assumptions are violated nor are we able to test whether they are and the effects of these violations. That’s why the use of, and the demand for the use, NHST is such a problem: despite the “hypothesis testing” in the name, this approach doesn’t allow us to determine if we are right because we are only testing the outcome against the assumptions we made for these tests to begin with (given that our hypothesis is wrong, we are at best testing the probability that we’d get the data we find). All the assumptions are built into the design, unlike with models we can run and compare to actual outcomes.
      Multivariate regression analysis refers to a variety of techniques which fit points in n-dimensional space to a “line” in that space. But like the “area under the curve” definition of the integral from calculus, this basic definition is at best misleading for a large number of actual applications. For example, anything that fits points to a line can also be used for classification/clustering (determining which data points “belong” to a particular region of “space”), which is basically all of machine/statistical learning, data mining, pattern recognition, latent variable analysis, and more. Also, as multivariate regression includes regression models/methods that are more robust to violations of linearity like logistic regression or nonlinear regression, even the nature of the “line” the points in some n-dimensional space are fitted to changes.

  3. Jeff says:

    I was expecting a small paragraph, not an entire blog entry. Thanks!

    Speaking of blog entries though. I would like to see one on confirmation bias (relates to cognitive sciences if you take requests).

    I think a lot of people tend to make assumptions, and will agree with anything that aligns with their flow of logic no matter how flawed or illogical it is. Like with political viewpoints, where someone will contort a topic to the point of the extreme and try to make it seem valid no matter how illogical it is.

    It’s like the following scenario.

    Person 1: “It’s about to pour-down…”
    Person 2: “How do you know?”
    Person 1: “Because it’s cloudy outside and gray clouds are usually a sign of rain.”
    Person 2: “That doesn’t mean it will rain. That is just a good sign that a storm is brewing and it will rain somewhere. Not necessarily here.”

    You have two people in disagreement. Both are partially right, but might not know whether it will pour-down or not. All they can do is base it on experience or past observations.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s