Have you ever been asked to fill out a survey that included some statement (e.g., “I experience back pain”, “I have difficulty concentrating”, “I write blogs nobody in their right mind would read”, etc.) and an “ordered” set of responses (e.g., “Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree”, “Always, Almost Always, Occasionally, Almost Never, Never” etc.)? These kinds of questions and response formats are referred to as “Likert Scales” or “Likert-type scales”. They are extremely common in fields ranging from neuroscience to managerial science. Individual responses are almost always treated as a single value, ranging from 1-5 or 1-7 or 1-however many possible options there are. Then, statistical tests are used to analyze responses to determine things about certain subpopulations of people (say, bloggers or those who are depressed) or even all people. One problem many researchers have found with this method is the idea that particular response such as “agree” are EXACTLY one “unit” more than “Strongly Agree” and one less than “Neutral”. Numerous proposals for better approaches exist in the literature, including the focus for this post:
Yusoff, R., & Janor, R. M. (2014). Generation of an Interval metric scale to measure attitude. SAGE Open, 4(1), 2158244013516768.
However, I’m not interested in Likert-type response data here. I’m interested in something FAR more problematic across scientific fields. Namely, the use of statistical methods and their underlying assumptions by those who don’t understand basic probability and statistics adequately enough to be entrusted to calculate the probability of a fair coin toss (ok, that’s not fair; they can be trusted with such a task and even with calculating the probability that a roll of dice will equal some number between 2 and 12). The authors in the aforementioned study are proposing a superior method to Likert-scales for evaluating variables of the kind that thousands of scientists do. In the study, they make sure to define the relevant, basic “facts” about the kinds of variables involved here. For example, a variable (such as number of children or number of grades completed from kindergarten through graduate education) are considered discrete. The authors define discrete as follows:
“According to Mann (2001), a discrete variable assumes values that are obtained from counting”
This “Mann (2001)” they refer to is a textbook, Introductory Statistics, by Prem S. Mann. I don’t have the 2001 edition, but I have a later edition and it provides the following definition:
“Discrete Variable A variable whose values are countable is called a discrete variable. In other words, a discrete variable can assume only certain values with no intermediate values.
For example, the number of cars sold on any given day at a car dealership is a discrete variable because the number of cars sold must be 0, 1, 2, 3, . . . and we can count it.”
Naturally, one would think, the authors of the study are justified in saying that “[t]o obtain the values of a discrete variable, all one has to do is to count; hence, the operational procedure is counting.”
There’s just a tiny little problem here: this is absolutely, ludicrously, and completely wrong. But it isn’t the researchers’ fault: after all, Mann’s text is similar here to many other introductory and even graduate level statistics textbooks: it assumes the reader has little or no mathematical background and notions like continuity are foreign, so it provides a simplification that ends up being used in technical, peer-reviewed research.
In actuality, a discrete variable is “countable’ in the following sense: either there are finitely many values it can assume (such as the number of cars sold on a Tuesday from Bernoulli Family Dealership) or there are countably infinitely many values it can assume. I’ve previously provided an informal account of the difference between countably infinite and uncountably infinite, so I won’t bother to do so here. All I wish to point out is something pretty obvious: you can’t count to infinity. Countable, in the sense Mann and mathematicians more generally mean, doesn’t actually mean that there is ANY way one can handle all discrete variables by using “counting” as “the operational procedure”.
What is the opposite of a discrete variable? A continuous variable: “continuous variables are obtained by measuring and thus, assumes any value contained in an interval, for example”.
Of course, the “obtained by measuring” bit is nonsense, because we “measure” discrete variables. Far worse, however, is how the researchers seem to understand “continuous”, which they describe in greater detail as follows: “values of a continuous variable are obtained using a measuring tool or scale that implies the existence of a more elaborate operational procedure that must be clearly defined as the basis for measurement. Because of the dependence on a measuring instrument, values obtained will be subjected to measurement errors, not exact, and fall within an interval that consists of infinite points.”
Here’s the takeaway/key point: there exists a set which you’ve been using at least since you were a teenager that consists of infinitely many points in ANY interval and which is COUNTABLE. This set is called the “rational numbers”. According to the researchers (and even many statistics textbooks), the rational numbers are continuous. It turns out that if one tries to apply the statistical methods used by researchers in all kinds of fields on a variable that is at most as “big” and dense as the rational numbers (i.e., is countably infinite and can assume infinitely many values between any two values), they won’t just give the wrong answer. This is because continuous variables require something from calculus called integration or integrals (actually, they really involve limits, but because integrals require limits and the calculation of continuous variables require integration, I’m simplifying). In order for a function or a variable in probability theory/statistics to be integrable, it cannot be limited to countably many values/outcomes.
To make this a bit clearer, consider the “bell curve” or “normal distribution”. Like any continuous distribution, in order to use it to calculate probabilities or for statistical tests that assume a variable is normally distributed (i.e., all the most common tests), whether or not some experiment yields a statistically significant outcome, or whether two variables are correlated, or any number of uses of this distribution, depends upon integration of a particular normal distribution. Trying to integrate a variable that can take on at most countably infinite values and that is everywhere dense (i.e., between any two possible values there are infinitely many other possible values) is impossible. Ergo, if the researchers actually had to do more than look up in some appendix or use some statistical software package to calculate statistical significance using their definition of discrete and continuous, they couldn’t even get a wrong answer; just no answer.
Here’s the real problem: almost none of the variables that are treated as continuously distributed can assume even countably infinitely many values, but only finitely many. If an infinite set like the rationals is too “small” for the standard integration required for all the most common statistical tests, think of how much worse a finite set fails.