**P**sychology has a **p**roblem. It’s not a **p**iffling **p**roblem, but a **p**ronounced **p**roblem; one **p**regnant with **p**er**p**etual **p**ain and **p**otential to induce **p**oor health. That’s right—**p**-values!

P-values are the end result of most common statistical analyses in psychology, and they “allow” the researcher to determine whether their effect is statistically “significant”; in psychology, we use a criterion of *p*<.05 as our marker of significance. (As an aside, one surprisingly common mistake students make is confusing'<‘ with ‘>’; ‘<‘ means less than. If x < y, we are saying x is less than y; just think that the arrow needs to point to the smaller value. So, x > y would mean that x is greater than y.)

Why is it problematic? There are many reasons—it is a test of a null hypothesis, but the null is never really true; it depends on sample size; it doesn’t answer the questions we, as researchers, are interested in—but I want to focus on just one: the p-value—primarily its interpretation—is severely misunderstood.

Consider this question: how would you answer it in an exam situation?

Which of the following is a definition of

p?

- The probability that the results are due to chance; i.e. the probability that the null is true.
- The probability that the results are not due to chance; i.e. the probability that the null is false
- The probability of observing results as extreme (or more so) as the ones you have obtained, if the null hypothesis is true.
- The probability that the results would be replicated if the experiment was conducted a second time.

You might think that 1 is the correct answer…well…give yourself a pat on the…HEAD!

Don’t feel bad. I ran this question in class to level 2 undergraduates, and all got it wrong. I was also approached by the internal exam board at my university who suggested that I had marked the wrong answer as correct in the end-of-semester exam. (Don’t believe for one second this is unique to my department; give the above question to professors in your department and count on one hand how many get it correct.)

Yes, this problem runs deep, but it’s not surprising; I have in front of me several undergraduate statistics textbooks, and only one—yes, ONE!—gives the correct definition.

The *p*-value is the probability of observing results as extreme as yours—or even more extreme—if the null is true. Although most people are surprised by the answer (were you?), it makes perfect intuitive sense once you think about it.

Imagine we are interested in whether caffeine improves cognitive function (having mainlined some coffee this morning just so I feel half-human, this is a pertinent question); let us also assume that the true effect of caffeine on cognitive function is real—that is, caffeine does improve function. We give one group of participants a caffeine pill, and the others a placebo, and then expose them to some arbitrary test of cognitive function. We get the following means (standard deviations in parentheses), assuming that in this arbitrary case a lower mean signifies better performance:

- Caffeine—26.75 (15.26)
- Placebo—38.55 (14.07)

We run our tests on the two means—for those interested, it would be an independent-samples t-test—and get *p*=0.008.

With our correct definition of *p* in-hand, we can state that the probability of observing the scores we have—or scores more extreme—**if the null hypothesis is really true** (that is, if caffeine actually has no real effect) is 0.008; i.e., if there really is no true effect of caffeine, then it is very unlikely that we would obtain the results that we have. We can therefore suggest that caffeine does have an effect on cognitive function (if nothing else, it surely helps with writing blogs about *p*-values at 8am on a Saturday). In psychology, because the *p*-value falls below the criterion of 0.05, we would declare this a significant result. Bingo!

There are many other things wrong with the *p*-value, and there are always murmurings and movements in the literature to abolish the *p*-value once and for all. I agree with these sentiments. In the meantime, though, I think communicating the true interpretation of the *p*-value will help researchers understand their data better, until a viable, accessible, alternative comes along. I will discuss some alternatives through the lifetime of this blog.

Now, back to my coffee…