Cognitive & Personality Tests or How to Publish Pseudoscience in Mainstream Journals

I’ve seen more than one link to an article about the relationships between certain traits. Generally speaking, one of two traits is compared with intelligence: political orientation or religion (although sometimes both). My goal here is mainly to go over why the studies are almost always inadequate, inaccurate, and sometimes just downright deceptive. However, I will spend some time on a few other issues, especially how intelligence is typically approached within the sciences.

Intelligence Metrics

Often a paper will appear asking when or if we will ever encounter another intelligence in the universe, whether in the form of extraterrestrial life or strong A.I. Personally, I think this question can’t be answered because it assumes humans are intelligent. I think we’re a bunch of morons and those who tend to be thought of as more intelligent (Kurt Gödel, Isaac Newton, Stephen Hawking, etc.) are often more idiotic than the average person. Gödel, a man with means, reputation, money, and a brother who was a medical doctor, died of starvation. Why? He refused to eat. This is the greatest logician of all time, as a genius as anyone can be, and the guy died of starvation when surrounded by food.

However, not everybody shares my cynicism, and even those who do recognize the usefulness of a measure of cognitive abilities for everything from social policies to A.I. research. Outside of most research, intelligence is described informally in terms of traits like capacity for spatial orientation, recall, the ability to recognize abstract associations between seemingly unrelated concepts, the ability to obtain and retain information from verbal data, etc. I had to take the SATs (one of the two central US college admissions tests) again in order to work for a test prep company. This is because when I took it the first time (and when I should have had plans to go to college) it was only two parts: verbal and math. Now there are three (verbal was split in 2), and teachers are required to get a at least a 90% on all parts of any test they teach/tutor for. We are supposed to answer questions about intelligence and the SATs with “the SAT isn’t an intelligence test”, which isn’t true, instead of the much better answer “the SAT tells you how well you take the SAT”.

The reason I say that it is an intelligence test of sorts is best explained by comparing to the other pre-college admissions test: the ACT. The SAT math, for example, is much more basic than the ACT. In fact, it’s mostly mathematics one learns before and early on in high school, particularly algebra and geometry. Not only that, but they give you the formulae for calculating things like volume. The ACT does not, and it includes questions about matrices and higher level mathematics. It’s also easier. Why? Because ETS, the company behind both the GREs (the standard graduate admissions test) and the SATs know how students are taught. They know, for example, that math is usually taught procedurally, so they deliberately ask questions that test concepts the students have studied but in a way they have never seen. A concrete example may help. The students I taught recall (if they have forgotten) readily the ways to solve equations and what functions are and so forth. But they are used to a function of the form f(x)= (2x + 3x)^3 – 7. Such a question can throw students off as they often apply a rule they learned (add the exponent outside the equation to the exponents of the terms inside) when it can’t apply (that rule is for terms in parentheses without addition/subtraction).

However, that’s something students have encountered. Almost every SAT test will have at least one question of the form

x # y = 2x + 3y^2 -7. What is the value of 3 # 5?

You are supposed to see immediately that this is really a piecewise function formulated weirdly. Basically, whatever goes on the left of # you treat as any term with an x or before a y, and whatever goes on the left consists of the terms with or after y.

The main reason I brought up the 2-part SAT is because when the test changed, they took away part of the verbal I loved and most people hate (liking these is probably a criterion for some mental health diagnosis): analogies. “An elephant is to a picture frame as a pillow is to a…” followed by a number of options. This is “verbal”, but it tests an abstract ability to see relations between pairs of concepts.

IQ tests used to be (and frequently still are) divided broadly into verbal-nonverbal. Now there are either subcategories or simply different categories, but they all relate to things like finding abstract patterns, connections, mental “rotation” of ideas or of actual pictures of objects, and pretty much the same things the old IQ tests had. The SAT is “not an IQ test” not so much because it isn’t an intelligence test (it is designed so that the knowledge is at least a grade or two below that of the early test-takers, yet predicts scholastic aptitude, so if it isn’t testing “intelligence” like an IQ test, how is it a predictor of scholastic aptitude?). It’s because even though it is designed so that students who are not in honors English or haven’t taken trig still have all the “knowledge” necessary. What they may lack is the cognitive abilities required to apply this knowledge.

But both the IQ tests and SATS still require knowledge. You can improve your IQ score just the way you can your SAT score: knowing what’s one the test and practicing it.

On the other side of the fence are things like grades or level of education. Studies which relate e.g., religiosity to intelligence frequently use gpa and (if it is a longitudinal study, or if the participants are adults) level of education completed. I’ve never seen level of education used alone, but I have seen GPA used alone. One reason for this is because about 90% of all research using human subjects that isn’t medical uses college kids. They can be given course credit or a small amount of money making them a cheap way to obtain participants (and an easier way to find them, as there are many ways to contact students that are not possible with the public at large). The problem is that although getting GPAs from college students is a snap, getting IQ scores is not. So SAT scores are often used in addition to GPAs.

But don’t they have, like, statistical models and stuff to deal with biases and other stuff?

Gee I’m glad you asked that! They do have models which can help identify biases and which can be custom-made for the experimental paradigm to produce optimal results. But that would require knowing a lot more about statistical models than most researchers asking this type of question know. The people who do this kind of research are usually social psychologists or something close. Most social neuroscientists are just social psychologists who learned how to press buttons that control some neuroimaging technology and process the signals. Psychologists are the experts when it comes to behavioral experiments, as they invented them. They also did this before there were statistical software packages and even before personal computers. So they designed experiments that required very little computational work. Interestingly, many of the standard statistical tests used to day were known to be poor or at least known to have deficiencies that other already existing tests did not have. However, as the more advanced ones weren’t really possible before computers, they weren’t used even by the ones who developed the better tests.

I’ll give an example from a favorite statistician who has written 2 intro stats textbooks that require almost no math abilities in the one case and not much more in the other. This example is from the easier textbook: Basic Statistics by Rand R. Wilcox.

It concerns an actual study in which undergrads were asked how many sexual partners they wanted. The average for males was almost 65 partners. Now before anybody goes jumping to conclusions or thinking in their heads something like “well waddya expect? Men are dogs who have two heads and think with the smaller one”, there’s more. For those who require a refresher, an average is computed by adding up the individual answers (in this case, summing for each student the numerical answer given for how many sexual partners they wanted to have) and then dividing by the number of answers (which is in this case the same as the number of participants). For the males, one answer given was 6,000. The total number of males was 105, but let’s say a 100 because it’s easier. And let’s say that every male except the 6,000 guy said they wanted 10 sexual partners. The average for these would then be 10. Add 10+10…+10 until you’ve added it 100x, and then divide by 100. Now let’s stick with 100 subjects, but this time instead of all 10’s we have the 6,000 guy. The average is now 6090/100 = 60.9. 99% of 100 subjects answered 10, and the average is ~61.

But that’s pretty easy to spot. Here’s the full list:

6 1 1 3 1 1 1 1 1 1 6 1 1 1 4

5 3 9 1 1 1 5 12 10 4 2 1 1 4 45

8 5 0 1 150 13 19 2 1 18 3 1 3 1 11

1 2 1 1 1 12 1 1 2 6 1 1 1 1 4

1 150 6 40 4 30 10 1 1 0 3 4 1 4 7

1 10 0 19 1 9 1 1 1 5 0 1 1 15 4

1 4 1 1 11 1 1 30 12 6000 1 0 1 1 15

If we took the average after throwing out the 6,000 we’d still have a bad answer (this is why you should always graph your data, never just rely on the computations). An important thing to realize is that the standard statistical inference tests rely on the sample average. There are 3 measures of central tendency: the mode, the mean (average), and the median. The mode tells you which value occurs the most, and the median tells you what number you’d have in the middle if you put all the values in order from lowest to highest. Here are the first 3 rows from above, unordered than ordered:

6 1 1 3 1 1 1 1 1 1 6 1 1 1 4 5 3 9 1 1 1 5 12 10 4 2 1 1 4 45 8 5 0 1 150 13 19 2 1 18 3 1 3 1 11

0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 3 3 4 4 4 5 5 5 6 6 8 9 10 11 12 13 18 19 45 150

If we used all 105 values we’d get a longer list but with basically the same mean, median, and mode. There’s an easy way to do all that the tests from the standard intro stats textbooks can using the median. It involves something called quartiles, but the basic idea is to use the median and the mode and have, instead of an mean, 4 “medians”. The reason the mean is used is that by itself it is a better indicator of all the values than the mean or mode. However, the average value is, on average, not a very good average.

The mean is heavily influenced by things like outliers and clusters, and determining when a value is an outlier is not as simple as in the above example. Entire volumes are devoted to outlier detection.

Just be looking at the ordered list, clearly we want values of 1’s to count the most (over half the entire list is at most 2 away from 1), yet using the mean will not do this.

Why the average matters more than average

So we know the average can be a bad metric but there is a reason one learns there are 3 measures of central tendency: none is good on its own. More importantly, we don’t find lists of averages in studies telling us e.g., that people who are “night owls” tend to be more intelligent (I’m not making this up; the journal Personality and Individual Differences published the study “Why night owls are more intelligent” in 2009 using the same theoretical basis one of the authors used to publish a similar study a year later entitled “Why Liberals and Atheists Are More Intelligent”) we don’t find constant reports of averages but p values from tests called thing like t-tests or MANOVA and so on. Why, then, does it make any difference what problems there are with averages?

Perhaps the central concept that runs throughout all of the most typical statistical tests is variance. Typically, the value of the population mean isn’t known, so it is estimated from the sample under the assumption that sample means quickly approach population means (which is true) and thus one can calculate exactly how many participants one needs to meet the necessary number. The most basic statistics uses the population mean in order to compute a z-score, which gives the probability that we got the results we did by chance. At some point (the alpha level or p value), we say the chances are so small that our results are..[cue dramatic music] statistically significant. A variation on this test, called a t statistic, is almost always used instead of z-scores because the population average is unknown. The t-test uses the sample mean to approximate the population mean. Because we know that sample means quickly approach population means, this (so just about every intro stats book tells us) is no problem. So we calculate the t-statistic and we get a p value/alpha level. The 3 standard acceptable ones (i.e., the ones which researchers can use and declare their results were “statistically significant”) are .05, .01, & .001 (or a 5% chance, a 1% chance, or a .1 % chance). But as the t value doesn’t tell us these probabilities, we have to use a program or a chart, the number of participants (or answers, or whatever the hell kind of values we have that we are trying to determine are statistically significant), and find the associated alpha level. So, for example, if we asked 15 college frat boys how many sexual partners they wanted, computed our t value, that value would have to be at most 1.341 to hit the 1% alpha (p) level. However, if we got 1.341 and we had 120 frat boys, we’d have to get at most 1.289 to get the same alpha level. Notice that there isn’t much of a difference. In fact, the table I used doesn’t go past 120 because the differences won’t ever matter.

One of the many ways using a t value can get you garbage results is based on a step used to compute it that relies on the sample mean. Basically, the t test uses the variation of the sample as an indicator of the population. Variation is computed using the mean. So the very first step to get a t value is to compute the sample mean. Everything else uses the that first step and therefore depends upon the average. If the average is garbage for reasons like those above, than chances are the t value will be garbage too.

ANOVA, MANOVA, multiple regression, MANCOVA, and various component analyses are all statistical techniques grad students learn in their required multivariate statistics course and they all use the sample average. In fact, although the effect of using the average changes, it’s there for the entirety of what’s called the generalized linear model (and is basically all of statistics a person who is doing this kind of research will be introduced to in a class).

So why don’t they use different tests? And how much difference does using the average make?

Normally, if you ask a statistician this question, there are two possible out comes:

1) You are killed because you dared to question the Holy Central Limit Theorem as it is misapplied constantly

Or

2) You are told that you should be doing something called “Bayesian analysis” or “Bayesian statistics” (and if you ask why, then you will be hunted down and beaten to death by the kill squad of the Bayesian conspiracy).

The actual answer is that most of the statistics use are used because somebody else used them for a similar experiment before (if it is one of the “standard” tests taught in one of the probably only two math classes the social psychology-type researcher will take, then they can just assume plenty of researchers have used it for similar experiments, rather than show this). Basically, a great many researchers in the social & behavioral sciences are taught what question types go with what statistical tests (or how to look for this without learning more math). The basis for the entirety of the standard sampling sizes to approximate a population is what’s called the Central Limit Theorem. In simplistic terms, this theorem involves a random variable for any population: the number of US citizens who are have at least decent competency in a language other than English, the number of bottles of vodka each Russian drinks per day, the number of Canadians made fun of in any given year, the number of French people who hate or are hated by people, the number of [insert group name her] it takes to screw in a light bulb, etc.). It doesn’t matter if the variable is highly skewed as in the case with US citizens who know another language (the vast majority do not know any other), if you repeatedly take random samples from the US citizen population the sample means will quickly tend towards a normal distribution (a bell curve).

Somehow, that turned into “you only need ~40 people to represent all humans”. The central limit theorem involves calculus. In fact, most probability courses in college require calculus as a prerequisite. Yet, despite the obvious similarity between probability and statistics, this doesn’t hold for statistics. Multivariate statistics uses linear algebra and/or multivariate calculus, yet typically social psychology-types will never be required to take either linear algebra or any calculus. This means that the algebraic structure of statistics as well as probability distributions involves two things (linear algebra and calculus) that are central to the underlying logic of statistics used in research and which are not required.

The basis of sampling (probability distribution functions and their limits/integrals) is presented as more like numerology than number theory (actually, it should be “than analysis” but numerology and number theory sounds better). For many of the more advanced multivariate statistics, eigenvalues and eigenvectors are the core of the statistic (as in e.g., principal component analysis) yet these are concepts taught rather late in a linear algebra course. They require an understanding of matrices, dimensions, matrix algebra, and other things that are sufficiently complex every university with a math department in the world will have an entire course devoted to linear algebra. However, although such a course doesn’t spend much time on applications, a multivariate statistics class of the type required by those who run these intelligence experiments is practically devoted to such applications, but doesn’t teach linear algebra.

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s