Category Archives: Significance testing

We’re all (naturally) Bayesian (and psi is very likely not true)

In a previous post, I extolled the virtues of Bayesian statistics. Bayesian statistics is often flouted as the superior alternative to p-values, and its popularity is growing rapidly in psychology. In that post I discussed how one of the advantages of Bayesian analysis is it allows us to assess evidence in support of a certain hypothesis, given the data (rather than what the p-value gives you, which is something quite different, as I discuss in this previous post).

However, that post neglected one critical aspect of Bayesian analysis: priors. In Bayesian statistics, you take your prior belief about something, then you view some data, and Bayes’ rule formally states how your beliefs should be updated in light of the new data; this updated belief is called the posterior. The prior not only states what your prior belief is, it also states how strongly you believe it. For example, if deciding whether a coin is fair or not, most people would have a strong prior that it is fair (i.e., that p(head) = 0.5), because most coins you encounter are fair. However, some people might believe in aliens, but the prior for this would be pretty weak. Strong priors require a lot of strong opposing evidence to overcome, whereas weak priors can be overcome with weaker evidence.


At this point, frequentist psychologists (i.e., those strongly committed to “traditional” statistics with p-values and null-hypothesis testing) will be shaking their head in fervent disgust. They will say “You shouldn’t allow your beliefs enter into scientific analysis; I want objective results, not subjective results! Do they have a point? I would argue NO!

Why we’re all Bayesian already

We are all—yes, even you—Bayesian. We let our beliefs (priors) enter our assessment of “data” all the time. Let’s use an example. If I told you I had roast chicken for dinner this weekend, your prior would be likely in the form of believing me; therefore, you would not require much evidence from me (or any at all) to be convinced that I had roast chicken. However, if I told you I had roast chicken for dinner, and aliens joined me for desert, you would immediately demand photo evidence, video recordings, and a whole-host of scientific evidence to convince you it was true.

Priors should enter our decision-making process, and be paired with the data to reach rational conclusions. As another example, if I flipped a coin 10 times and it came up heads 8 times, you would likely not change your prior that much from believing it is a fair coin. However, if you knew the coin was from a magic shop, your prior would be that the coin is biased. These priors should be used for coming to conclusions.


An Example from Psychology

Below is a screen-grab from a paper I have read recently. It shows a meta-analysis of a certain psychological phenomena of interest. For those unfamiliar, a meta-analysis is a statistical aggregation of all known published results about a particular effect of interest. It allows for a much more precise estimate of the effect size of the particular phenomena of interest. Meta analyses are important, because individual studies have error (sampling, experimental, and measurement, with residual errors also) associated with their reported results.

The figure below shows a recent meta-analysis. Each diamond reflects the estimated effect size for the phenomena under investigation in each study. The lines represent 95% confidence intervals around this effect size. Confidence intervals that overlap with zero do not provide convincing evidence that the effect size reported is different from zero.


bemThe diamond at the bottom (circled in red) is the meta-analysis effect size; that is, the estimate of effect size when the data is aggregated across all studies (this is a very simplified, and ultimately incorrect, summary of what the meta-analytic effect size represents, but it serves our purposes). You will note that the 95% confidence intervals aren’t even close to zero, suggesting that the effect under investigation is “real”.

Now, to prove you are all Bayesian. Answer the following two questions:

1) How strongly do you believe in this final positive effect size if you knew the studies were investigating whether working memory capacity is related to intelligence?

2) How strongly do you believe in this final positive effect size if you knew the studies were investigating whether people are able to perceive future events? This is known as “psi”

Now, I believe that most—if not all—would take the first question at face value, and would have faith in the final effect size. You all either have no real prior for the relationship between working memory and intelligence, or you have a prior that there is indeed a relationship. Therefore, you aren’t really surprised (or perhaps interested) in the result of the meta-analysis.

However, I would be shocked if any had faith in the final effect size knowing that it provides “positive evidence” for people being able to see into the future. That is, you all have—and if you don’t, you really should have—a strong prior against believing in being able to perceive future events reliably.

The point is, in both cases you are viewing the same data. In the former, you would accept the data without much problem. In the latter case, you still think the results don’t provide convincing evidence. That is, you are Bayesian. Accept it. Embrace it. Move forward.

Bem’s Psi Work

This meta-analysis is not made up; it’s from a paper by Cornell psychologist Daryl Bem. You can view the full (as yet, unpublished) manuscript here:

This paper has confirmed my prior belief (pun intended) that Bayesian analysis is the way forward in tackling how to interpret data. Priors entering the decision-making equation is not a weakness, but a strength. Despite Bem’s best efforts with this paper and his prior work, I am still not convinced that people can “feel the future”. To me, psi doesn’t exist. A frequentist should be convinced by the above meta-analysis—after all, the p-value for psi is very low!

A strong Bayesian approach to statistical decision-making, with well-defined (and defendable) priors, makes you robust against wacky claims, even if the data is “strong”.


“Extraordinary claims require extraordinary evidence”

—Carl Sagan (quite likely a Bayesian)



Tagged , , ,

Understanding p-values via simulations

As I mentioned in an earlier post, p-values in psychological research are often misunderstood. Ask students (and academics!) what the definition of the p-value is and you will likely get many different responses. To jog your memory, the definition of the p-value is the probability of observing a test statistic as extreme—or more extreme—than the one you have observed, assuming the null is true. But, even with this definition in hand, many struggle to conceptualise what the p-value reflects. In this blog post, I take inspiration from a lovely paper I have recently read that advocates using computer simulation to understand the p-value. Continue reading

Tagged , ,

Effect Sizes, FTW!

Today I give my favourite lecture of the year. Second year students today will be introduced to some of the controversies surrounding the use of p-values. Although I don’t go into great depth, I highlight that p-values—among the many problems associated with them—don’t provide answers to the type of questions we—as researchers—are asking. Continue reading

Tagged , , ,

What’s all this business about Bayes?

As readers of contemporary psychology journals may well know, you can’t help but keep reading about something called “Bayesian Statistics”. Researchers “in the know” seem to extol Bayesian statistics as a superior method of making inferences from data compared to traditional, “frequentist”, methods (yes, Mr p-value, I’m looking at you!).

But what is it? What can it do that standard methods can’t? It turns out the answer is just about everything you’ve ever dreamed of (well, as a researcher, anyway!). Continue reading

Tagged , ,

P-off and leave me alone

Psychology has a problem. It’s not a piffling problem, but a pronounced problem; one pregnant with perpetual pain and potential to induce poor health. That’s right—p-values!

P-values are the end result of most common statistical analyses in psychology, and they “allow” the researcher to determine whether their effect is statistically “significant”; in psychology, we use a criterion of p<.05 as our marker of significance. (As an aside, one surprisingly common mistake students make is confusing'<‘ with ‘>’; ‘<‘ means less than. If x < y, we are saying x is less than y; just think that the arrow needs to point to the smaller value. So, x > y would mean that x is greater than y.)

Why is it problematic? There are many reasons—it is a test of a null hypothesis, but the null is never really true; it depends on sample size; it doesn’t answer the questions we, as researchers, are interested in—but I want to focus on just one: the p-value—primarily its interpretation—is severely misunderstood.

Consider this question: how would you answer it in an exam situation?

Which of the following is a definition of p?

  1. The probability that the results are due to chance; i.e. the probability that the null is true.
  2. The probability that the results are not due to chance; i.e. the probability that the null is false
  3. The probability of observing results as extreme (or more so) as the ones you have obtained, if the null hypothesis is true.
  4. The probability that the results would be replicated if the experiment was conducted a second time.

You might think that 1 is the correct answer…well…give yourself a pat on the…HEAD!

Don’t feel bad. I ran this question in class to level 2 undergraduates, and all got it wrong. I was also approached by the internal exam board at my university who suggested that I had marked the wrong answer as correct in the end-of-semester exam. (Don’t believe for one second this is unique to my department; give the above question to professors in your department and count on one hand how many get it correct.)

Yes, this problem runs deep, but it’s not surprising; I have in front of me several undergraduate statistics textbooks, and only one—yes, ONE!—gives the correct definition.

The p-value is the probability of observing results as extreme as yours—or even more extreme—if the null is true. Although most people are surprised by the answer (were you?), it makes perfect intuitive sense once you think about it.

Imagine we are interested in whether caffeine improves cognitive function (having mainlined some coffee this morning just so I feel half-human, this is a pertinent question); let us also assume that the true effect of caffeine on cognitive function is real—that is, caffeine does improve function. We give one group of participants a caffeine pill, and the others a placebo, and then expose them to some arbitrary test of cognitive function. We get the following means (standard deviations in parentheses), assuming that in this arbitrary case a lower mean signifies better performance:

  • Caffeine—26.75 (15.26)
  • Placebo—38.55 (14.07)

We run our tests on the two means—for those interested, it would be an independent-samples t-test—and get p=0.008.

With our correct definition of p in-hand, we can state that the probability of observing the scores we have—or scores more extreme—if the null hypothesis is really true (that is, if caffeine actually has no real effect) is 0.008; i.e., if there really is no true effect of caffeine, then it is very unlikely that we would obtain the results that we have. We can therefore suggest that caffeine does have an effect on cognitive function (if nothing else, it surely helps with writing blogs about p-values at 8am on a Saturday). In psychology, because the p-value falls below the criterion of 0.05, we would declare this a significant result. Bingo!

There are many other things wrong with the p-value, and there are always murmurings and movements in the literature to abolish the p-value once and for all. I agree with these sentiments. In the meantime, though, I think communicating the true interpretation of the p-value will help researchers understand their data better, until a viable, accessible, alternative comes along. I will discuss some alternatives through the lifetime of this blog.

Now, back to my coffee…

Tagged ,